Multi Source Data Analysis using Spark and Tellius
1. Multi Source Data Analysis
Using Apache Spark and Tellius
https://github.com/phatak-dev/spark2.0-examples
2. ● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
3. Agenda
● Multi Source Data
● Challenges with Multi Source
● Traditional and Data Lake Approach
● Spark Approach
● Data Source and Data Frame API
● Tellius Platform
● Multi Source analysis in Tellius
5. Multi Source Data
● In the era of cloud computing and big data, data for
analysis can come from various sources
● In every organization, it has become very common to
have multiple different sources to store wide variety of
storage system
● The nature of the data will vary from source to source
● Data can be structured, semi structured or fully
unstructured also.
6. Multi Source Example in Ecommerce
● Relational databases are used to hold product details
and customer transactions
● Big data warehousing tools like Hadoop/Hive/Impala are
used to store historical transactions and ratings for
analytics
● Google analytics to store the website analytics data
● Log Data in S3/ Azure Blog
● Every storage system is optimized to store specific type
of data
8. Need of Multi Source Analysis
● If the analysis of the data is restricted to only one
source, then we may lose sight of interesting patterns in
our business
● Complete view / 360 degree view of the business in not
possible unless we consider all the data which is
available to us
● Advance analytics like ML or AI is more useful when
there is more variety in the data
9. Traditional Approach
● In traditional way of doing multi source analysis, needed
all data to be moved to a single data source
● This approach made sense when number of sources
were few and data was well structured
● With increasing number of sources, the time to ETL
becomes bigger
● Normalizing the data for same schema becomes
challenging for semi-structured sources
● Traditional databases cannot hold the data in volume
also
10. Data Lake Approach
● Move the data to big data enabled repository from
different sources
● It solves the problem of volume, but there are still
challenges with it
● All the rich schema information in the source may not
translate well to the data lake repository
● ETL time will be still significant
● Will be not able to use underneath source processing
capabilities
● Not good for exploratory analysis
12. Requirements
● Ability to load the data uniformly from different source
irrespective their type
● Ability to represent the data in a single format
irrespective of their sources
● Ability to combine the data from the source naturally
● Ability to query the data across the sources naturally
● Ability to use the underneath source processing
whenever possible
13. Apache Spark Approach
● Data Source API of Spark SQL allows user to load the
uniformly from wide variety of sources
● DataFrame/ Dataset API of Spark allows user to
represent all the data source data uniformly
● Spark SQL has ability to join the data from different
sources
● Spark SQL pushes filters and prune columns if the
underneath source supports it
15. Customer 360
● Four different datasets from two different sources
● We will be using flat file and Mysql data sources
● Transactions - Primarily focuses on Customer information like
Age, Gender, location etc. ( Mysql)
● Demographics - Cost of product, purchase date, store id, store
type, brands, Retail Department, Retail cost(Mysql)
● Credit Information – Reward Member, Redemption Method
● Marketing Information - Ad source, Promotional code
16. Loading Data
● We are going to use csv and jdbc connector for spark to
load the data
● Due to auto inference of the schema, we will get all the
needed schema in data frame
● After that we are going to preview the data, using show
method
● Ex : MultiSourceLoad
17. Multi Source Data Model
● We can define a data model using the join of the spark
● Here we will be joining the 4 datasets on customerid as
common
● After join using inner join, we get a data model which
has all the sources combine
● Ex : MultiSourceDataModel
18. Multi Source Analysis
● Show us the sales by different sources
● Average Cost and Sum Revenue by City and
Department
● Revenue by Campaign
● Ex : MultiSourceDataAnalysis
20. About Tellius
Search and AI-powered analytics platform,
enabling anyone to get answers from their business data
using an intuitive search-driven interface and automatically
uncover hidden insights with machine learning
22. Takes days/weeks to get
answers to ad-hoc questions
Time consuming manual process of
analyzing millions of combinations
and charts
No easy way for business users and
analysts to understand, trust and
leverage ML/AI techniques
Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes
So much business data, but very few insights
23. Tellius is disrupting data analytics with AI
Combining modern search driven user experience with
AI-driven automation to find hidden answers
24. Tellius Modern Analytics experience
Get Instant answers
Start exploring
Reduce your analysis time from
Hours to Mins
Explainable AI for business
analysts
Time consuming,
Canned reports and dashboards
On-Demand,
Personalized experienceSelf-service data prep
Scalable In-Memory Data Platform
Search-driven
Conversational Analytics
Automated discovery
Of insights
Automated Machine
Learning
25. Only AI Platform that enables collaboration between roles
DATA MANAGEMENT
Visual Data prep with
SQL/ Python support
VISUAL ANALYSIS
Voice Enabled Search Driven
Interface for Asking Questions
Business User
Data Science
Practitioner
Data Analyst
Data Engineer
DISCOVERY OF INSIGHTS
Augmented discovery of insights
With natural language narrative
MACHINE LEARNING
AutoML and deployment of
ML models with Explainable AI
26. Google-like Search
driven Conversational
interface
Reveals hidden
relevant insights
saving 1000’s of hours
Eliminating friction
between self service
data prep to
ad-hoc analysis
and explainable
ML models
In-memory
architecture capable
of handling
billions of records
Intuitive UX AI-Driven Automation
Unified Analytics
Experience
Scalable Architecture
Why Tellius?
Only company providing instant Natural language Search experience, surfacing
AI-driven relevant insights across billions of records across data sources at scale and
enabling users to easily create and explain ML/AI models
27. Business Value Proposition
Automate discovery of relevant
hidden Insights
in your data
Ease of Use Uncover Hidden Insights
Get instant answers with
conversational Search
driven approach
Save Time
Augment Manual discovery process
with automation powered by Machine
learning
30. Loading Data
● Tellius exposes various kind of data sources to connect
using spark data source API
● In this use case, we will using Mysql and csv
connectors to load the data to the system
● Tellius collects the metadata about data as part of the
loading.
● Some of the connectors like Salesforce and Google
Analytics are homegrown using same data source API
31. Defining Data Model
● Tellius calls data models as business views
● Business view allow user to create data model across
datasets seamlessly
● Internal all datasets in Tellius are represented as spark
Data Frames
● Defining a business view in the Tellius is like defining a
join in spark sql
32. Multi Source analysis using NLP
● Which top 6 sources by avg revenue
● Hey Tellius what’s my revenue broken down by
department
● show revenue by cit
● show revenue by department for InstagramAds
● These ultimately runs as spark queries and produces
the results
● We can use voice also
33. Multi Source analysis using Assistant
● Show total revenue
● By city
● What about cost
● for InstagramAds
● Use Voice
● Try out Google Home
35. Spark DataModel
● Spark join creates a flat data model which is different
than typical data ware data model
● So this flat data model is good when there no
duplication of primary keys aka star model
● But if there duplication, we end up double counting
values when we run the queries directly
● Example : DoubleCounting
36. Handling Double Counting in Tellius
● Tellius has implemented its own query language on top
of the Spark SQL layer to implement data warehouse
like strategies to avoid this double counting
● This layer allows Tellius to provide multi source analysis
on top spark with accuracy of a data warehouse system
● Ex : show point_redemeption_method
37. References
● Dataset API -
https://www.youtube.com/watch?v=hHFuKeeQujc
● Structured Data Analysis -
https://www.youtube.com/watch?v=0jd3EWmKQfo
● Anatomy of Spark SQL -
https://www.youtube.com/watch?v=TCWOJ6EJprY