The recent boom in big data processing and democratization of the big data space has been enabled by the fact that most of the concepts originated in the research labs of companies such as Google, Amazon, Yahoo and Facebook are now available as open source. Technologies such as Hadoop, Cassandra let businesses around the world to become more data driven and tap into their massive data feeds to mine valuable insights.
At the same time, we are still at a certain stage of the maturity curve of these new big data technologies and of the entire big data technology stack. Many of the technologies originated from a particular use case and attempts to apply them in a more generic fashion are hitting the limits of their technological foundations. In some areas, there are several competing technologies for the same set of use cases, which increases risks and costs of big data implementations.
We will show how GoodData solves the entire big data pipeline today, starting from raw data feeds all the way up to actionable business insights. All this provided as a hosted multi-tenant environment letting its customers to solve their particular analytical use case or many analytical use cases for thousands of their customers all using the same platform and tools while abstracting them away from the technological details of the big data stack.
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
1. GoodData – the Case Study #2:
Big Data Pipeline for Analytics at Scale
DB Technologies for Big Data @ FIT CVUT
2014 GoodData Corporation. All Rights Reserved.
November 19 2014
3. GoodData Corporation. All Rights Reserved.
End to End, Analytics
Platform as a Service
Traditional BI
Data Visualization
Tableau, Qlikview, Spotfire, etc.
Analytics Engine
Cognos, Oracle, Business Objects, etc.
Data Marts
MySQL, PostgreSQL, etc.
Data Warehouse
Oracle, Teradata, Netezza, Microsoft, etc.
ETL
Informatica, DataStage, Boomi, Snaplogic, etc.
Infrastructure
Servers, Storage, Networking, etc.
Data Collaboration
Data Visualization
Analytics Engine
Data Marts
Data Warehouse
ELT / ETL
Infrastructure
4. One Platform. Two Markets.
For Your Customers
Powered By GoodData Partner Program
for disruptive ISVs including Zendesk,
Switchfly, and Phizzle
GoodData Corporation. All Rights Reserved.
For Your Business
Drive your business with your data.
Experience and accelerators for
Social, Sales, Marketing, Yammer
10. GoodData Corporation. All Rights Reserved.
End to End, Analytics
Platform as a Service
Traditional BI
Data Visualization
Tableau, Qlikview, Spotfire, etc.
Analytics Engine
Cognos, Oracle, Business Objects, etc.
Data Marts
MySQL, PostgreSQL, etc.
Data Warehouse
Oracle, Teradata, Netezza, Microsoft, etc.
ETL
Informatica, DataStage, Boomi, Snaplogic, etc.
Infrastructure
Servers, Storage, Networking, etc.
Data Collaboration
Data Visualization
Analytics Engine
Data Marts
Data Warehouse
ELT / ETL
Infrastructure
11. GoodData Platform Zoom-In
End to End, Analytics
Platform as a Service
Data Collaboration
Data Visualization
Analytics Engine
Data Marts
Data Warehouse
ELT / ETL
Infrastructure
14. GoodData Corporation. All Rights Reserved.
Let’s Start With The Outcome - The Insights
• User Experience
○ Visual Appeal
○ Ease of Use
○ Performance
• Analytical Power
• Many Data Sources
○ Need to cross analyze all of them
○ Need to add/remove sources as needed
• Cost Efficiency
○ Computational density allowed by multi-tenancy
15. GoodData Corporation. All Rights Reserved.
Let’s Start With The Outcome - The Insights
● Analytical Engine / MAQL
● Exploration, Visualization
and Distribution Layer
● Pluggable Database
Backends
● 10s of GB up to TBs
16. GoodData Corporation. All Rights Reserved.
Behind The Scenes - The Big Data Pipeline
• Large Data Throughput
○ Close to Real-time Updates
• Many Data Sources
○ Need to cross analyze all of them
○ Need to add/remove sources as needed
• Agility
○ Capture all data without knowing the analytical use case in advance
• Cost Efficiency
○ Computational density allowed by multi-tenancy
17. GoodData Corporation. All Rights Reserved.
Behind The Scenes - The Big Data Pipeline
• Big Data Store
○ 100s of TBs per customer
○ Persist All Incoming Data
○ CSV, XML, JSON, ...
• Immutable
○ Append Only
○ Keep Ingestion History
• Technologies
○ Amazon S3
○ Cloud Files
18. GoodData Corporation. All Rights Reserved.
Behind The Scenes - The Big Data Pipeline
• Agile Data Warehouse
○ 10s of TBs per customer
○ Relational Model
○ Semi-Cleansed
○ Complete History Captured
• Technologies
○ HP Vertica
○ GoodData BI Integration Services
19. GoodData Corporation. All Rights Reserved.
Behind The Scenes - The Big Data Pipeline
• Combine Input Stage Data Sets
○ Mapping, Cleansing
• Perform Data Transformations in Data Warehouse
○ Benchmarking, Snapshotting, Sampling
• Generate Data Mart Input Data
○ Data Warehouse : Data Mart relation is typically 1 : N
○ 10s of thousands Data Marts in PbG (OEM) use case!
20. GoodData Corporation. All Rights Reserved.
Behind The Scenes - The Big Data Pipeline
• GoodData BI Integration Services
○ CloudConnect Runtime
○ Ruby Runtime
○ Data Integration Console
Over 2M ETL jobs per week!
21. GoodData Corporation. All Rights Reserved.
The Wrap-Up - The Big Data Pipeline
Progression Through:
• Big Data Store
• Data Warehouse
• Data Marts
As a means to satisfy the end user:
• User Experience
• Analytical Power
• Many Data Sources
• Cost Efficiency