Many organizations use Clickstream for reporting but struggle to turn that data into valuable Data Products and production-ready ML models. Thanks to Apache Spark and Cloud Computing, it’s not as daunting task as you may think.
I’m going to cover in detail a recipe for you to take your Clickstream, explore it, ETL into a cloud platform, then create/publish data products into a REST API for consumption. This approach is designed to be nimble and iterative, with no back-end engineering needed.
Created by Josh Janzen, Senior Data Scientist
In Four Simple Steps, ETL Clickstream to Data Product APIs (no Engineer needed!)
1. GM/DM 1:1 1
IN FOUR SIMPLE STEPS,
ETL CLICKSTREAM TO
DATA PRODUCTS
(NO ENGINEER NEEDED!)
SENIOR DATA SCIENTIST|JOSH JANZEN
2. GM/DM 1:1 2
JOSH JANZEN
SENIOR DATA SCIENTIST
Degrees from:
Data Science Tools:
About:
Life Time champions a healthy
and happy life for its members
across 138 destinations in 38
major markets in the U.S. and
Canada
3. GM/DM 1:1
1. DATA FEED 2. EXPLORE 3. ETL/ML 4. DEPLOY
Ø FTP to S3 w/bucket
credentials
Ø Sample data and
explore
Ø Find columns of
interest
Ø ETL columns of
interest
Ø Apply ML
algorithms
Ø Create web APIs
with Azure ML
Ø Interactive Web
Apps
4. GM/DM 1:1 4
STEP
ØFTP to S3 w/bucket
credentials
ØStart off as batch (nightly)
1. DATA FEEDeffort
25%
50%
75%
100%
progress
5. GM/DM 1:1 5
STEP
ØSample data and explore
ØFind columns of interest
2. EXPLOREeffort
25%
50%
75%
100%
progress
6. GM/DM 1:1 6
STEP 2. EXPLOREeffort
25%
50%
75%
100%
progress
Func RemoveNullColumns:
for column in dataframe:
if column is null:
remove column
Int threshold = 2
Func RemoveLowVariationColumns:
for column in dataframe:
if count(distinct values) in column < threshold:
remove column