Introduction:
In the busy world of stores and shopping, it's really important to know how customers behave. In this blog post, we're going to talk about using a tool called PySpark to look at data and find out how many new customers and repeat customers visit a store each day.
Problem Statement:
In a retail setting, transactional data stored in a PySpark DataFrame holds vital information such as order ID, customer ID, order date, and order amount. Our goal is to analyze customer behavior by discerning the count of new and repeat customers for each day.
Data Preparation:
We kickstart our journey with a CSV file named customer_sales.csv. This file comprises essential columns:
order_id: Unique identifier for each order.
cust_id: Identifier for each customer.
order_date: Date when the order was placed.
order_amount: The amount of the order.
Expected Output:
We will leverage Databricks Community Edition for solving the problem.
Let's begin with creating our data set and reading it into Pyspark Dataframe.
STEP 1:Create and Store Datasets
dbutils.fs.put('/FileStore/rohit/question5.csv','''
order_id,cust_id,order_date,order_amount
1,100,2022-01-01,2500
2,200,2022-01-01,2500
3,300,2022-01-01,2105
4,100,2022-01-02,2000
5,400,2022-01-02,2200
6,500,2022-01-02,2600
7,100,2022-01-03,3050
8,400,2022-01-03,1080
9,600,2022-01-03,3000
10,100,2022-01-04,3050
11,500,2022-01-04,1080
12,200,2022-01-04,3000'''
,True)
STEP 2: Read Data into PYSPARK Data Frame and view your data
df=spark.read.options(header=True,inferSchema=True).csv('/FileStore/rohit/question5.csv')
df.show()
STEP 4: Window Function - Dense Rank
Leverage the power of DENSE_RANK window function to address our analysis needs.
Importing necessary libraries
STEP 5: Import necessary libraries
from pyspark.sql.functions import col, dense_rank, when, sum
from pyspark.sql.window import Window
STEP 6: Create Window Specification
window=Window.partitionBy('cust_id').orderBy('order_date')
Code explanation:
partitionBy('cust_id'): This part of the specification defines how the data should be partitioned or grouped. In this case, we are partitioning the data based on the cust_id column. It means that rows with the same cust_id will be grouped together in the window.
orderBy('order_date'): This part specifies the order in which the rows within each partition should be sorted. Here, we're ordering the rows based on the order_date. It means that within each partition (group of rows with the same cust_id , the rows will be sorted in ascending order of order_date.
STEP 7: Apply Window and View Data
temp_df=df.withColumn('rank',dense_rank().over(window))
temp_df.show()
STEP 8 : FINAL RESULT
final_df=temp_df.groupBy('order_date').agg(
sum(when(col('rank')==1,1).otherwise(0)).alias('new_cust'),
sum(when(col('rank')!=1,1).otherwise(0)).alias('old_cust')
).orderBy('order_date')
Let's break down the code:
groupBy('order_date'):This part groups the data by the 'order_date' column, meaning it will create groups where all rows with the same order date are together.
agg():This function is used for aggregation, meaning it's used to perform some calculation on grouped data. Here, we're aggregating data based on the order date.
sum(when(col('rank') == 1, 1).otherwise(0)).alias('new_cust'): Within the aggregation, this line calculates the sum of a condition. It checks if the rank column is equal to 1, means we have seen it for the first time and its a new customer. If it is, it assigns 1, otherwise 0. Then, it sums up these values. This effectively counts the number of rows where the rank is equal to 1, indicating new customers for each order date. Finally, the result is aliased as 'new_cust'.
sum(when(col('rank') != 1, 1).otherwise(0)).alias('old_cust'): Similarly, this line calculates the sum of another condition. It checks if the rank column is not equal to 1, means we have seen it for the previously and its a old customer. If it is, it assigns 1, otherwise 0. Then, it sums up these values. This effectively counts the number of rows where the rank is not equal to 1, indicating old customers (repeat customers) for each order date. Finally, the result is aliased as 'old_cust'.
STEP 9 : SHOW THE FINAL OUTPUT
Thanks, If you have any better solution, please leave in the comment section below.
Comments