top of page

About

My name is Rohit Dohre

Hi there! I'm Rohit Dohre, a dedicated consultant, specializing in leveraging cutting-edge technologies like Python and PySpark to drive innovation and solve complex business challenges.

  • Facebook
  • LinkedIn
  • Instagram
SquarePic_20190911_12113239_compress85_edited.jpg
  • Writer's pictureRohit Dohre

Unveiling Customer Dynamics: PySpark Analysis for Daily New and Repeat Customer Counts.

Introduction:


In the busy world of stores and shopping, it's really important to know how customers behave. In this blog post, we're going to talk about using a tool called PySpark to look at data and find out how many new customers and repeat customers visit a store each day.


Problem Statement:


In a retail setting, transactional data stored in a PySpark DataFrame holds vital information such as order ID, customer ID, order date, and order amount. Our goal is to analyze customer behavior by discerning the count of new and repeat customers for each day.


Data Preparation:


We kickstart our journey with a CSV file named customer_sales.csv. This file comprises essential columns:

  • order_id: Unique identifier for each order.

  • cust_id: Identifier for each customer.

  • order_date: Date when the order was placed.

  • order_amount: The amount of the order.


Expected Output:



We will leverage Databricks Community Edition for solving the problem.

Let's begin with creating our data set and reading it into Pyspark Dataframe.


STEP 1:Create and Store Datasets


dbutils.fs.put('/FileStore/rohit/question5.csv','''

order_id,cust_id,order_date,order_amount

1,100,2022-01-01,2500

2,200,2022-01-01,2500

3,300,2022-01-01,2105

4,100,2022-01-02,2000

5,400,2022-01-02,2200

6,500,2022-01-02,2600

7,100,2022-01-03,3050

8,400,2022-01-03,1080

9,600,2022-01-03,3000

10,100,2022-01-04,3050

11,500,2022-01-04,1080

12,200,2022-01-04,3000'''

,True)


STEP 2: Read Data into PYSPARK Data Frame and view your data


df=spark.read.options(header=True,inferSchema=True).csv('/FileStore/rohit/question5.csv')



STEP 4: Window Function - Dense Rank

Leverage the power of DENSE_RANK window function to address our analysis needs.

Importing necessary libraries


STEP 5: Import necessary libraries


from pyspark.sql.functions import col, dense_rank, when, sum

from pyspark.sql.window import Window


STEP 6: Create Window Specification


window=Window.partitionBy('cust_id').orderBy('order_date')


Code explanation:

  • partitionBy('cust_id'): This part of the specification defines how the data should be partitioned or grouped. In this case, we are partitioning the data based on the cust_id column. It means that rows with the same cust_id will be grouped together in the window.

  • orderBy('order_date'): This part specifies the order in which the rows within each partition should be sorted. Here, we're ordering the rows based on the order_date. It means that within each partition (group of rows with the same cust_id , the rows will be sorted in ascending order of order_date


STEP 7: Apply Window and View Data


temp_df=df.withColumn('rank',dense_rank().over(window))

temp_df.show()



STEP  8 : FINAL RESULT


final_df=temp_df.groupBy('order_date').agg(

                             sum(when(col('rank')==1,1).otherwise(0)).alias('new_cust'),

                            sum(when(col('rank')!=1,1).otherwise(0)).alias('old_cust')

                            ).orderBy('order_date')


Let's break down the code:


  1. groupBy('order_date'):This part groups the data by the 'order_date' column, meaning it will create groups where all rows with the same order date are together.

  2. agg():This function is used for aggregation, meaning it's used to perform some calculation on grouped data. Here, we're aggregating data based on the order date.

  3. sum(when(col('rank') == 1, 1).otherwise(0)).alias('new_cust'): Within the aggregation, this line calculates the sum of a condition. It checks if the rank column is equal to 1, means we have seen it for the first time and its a new customer. If it is, it assigns 1, otherwise 0. Then, it sums up these values. This effectively counts the number of rows where the rank is equal to 1, indicating new customers for each order date. Finally, the result is aliased as 'new_cust'.

  4. sum(when(col('rank') != 1, 1).otherwise(0)).alias('old_cust'):  Similarly, this line calculates the sum of another condition. It checks if the rank column is not equal to 1, means we have seen it for the previously and its a old customer. If it is, it assigns 1, otherwise 0. Then, it sums up these values. This effectively counts the number of rows where the rank is not equal to 1, indicating old customers (repeat customers) for each order date. Finally, the result is aliased as 'old_cust'.


STEP  9 : SHOW THE FINAL OUTPUT





Thanks, If you have any better solution, please leave in the comment section below.






17 views0 comments

Comments


bottom of page