SlideShare a Scribd company logo
1 of 28
The Future of Big Data Tooling
Alexander Aldev
Alexander Aldev
About me
Тranslator| between business and IT
CTO and co-founder | MammothDB
17 years | various shades of analytics, DWH, BI
Nerd | making scaled data infrastructure practical
Spoiler Alert!
This talk
Can we predict the future?
How do Big Data tools work today?
How did they evolve?
… and some examples
What is their environment?
yes, we can! It already happened.
HOW MANY Z’S IN THIS
SOUP?
photo Ursus Wehrli
THE BIG DATA TOOL APPROACH
photo Ursus Wehrli
1
23
4
MAYBE THIS WOULD HELP…
photo Ursus Wehrli
Working Definition
Just what’s Big Data?
Datasets so large and/or complex
that traditional data processing techniques
are inadequate to handle them
Examples
Indexing 100PB of crawled web content
Providing on-line interactive analytics to 10mln clients
IT’S RELATIVE
photo Anders Rasmusen
Today
The Big Data toolset?
… for analytics, this is mostly synonymous with Hadoop
Hadoop architecture
Cluster of Commodity Servers
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(MapReduce)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Pig)
Query
(Hive)
Machine Learning
(Mahout)
the DFS
Data Node 1
File 1
Data Node 10Data Node 2
File 2
High throughput
Linear scalability
Fault tolerance
block
replication
classical workflow
1 MapReduce Job
Input File on DFS
Split
Extract
Structure
Shuffle
Aggregate
Output File on DFS
Store on DFS
Read from DFS
Store on Local FS
Analytical Query
Input on DFS Input on DFS
M/R Job
M/R Job
M/R JobIntermediate/DFS
M/R Job
Intermediate/DFSIntermediate/DFS
Output on DFS
programmability
Map()
Reduce()
complex queries
require running many
Map/Reduce jobs!!!
JOINs are difficult
WHEREs are difficult
File 1 k
File 2k
node 1
node 2
shuffle
File 2k
= k ?
resource management
1 Task = 1 Core
Split
400 cores
= 100 node x 4 cores
2.5 GB/s
= 400 tasks x 64 MB/task / 10 sec/task
14.6 GB/s
= 100 nodes * 150 MB/s
20 cores
= 5 node x 4 cores
128 MB/s
= 20 tasks x 64 MB/task / 10 sec/task
2.9 GB/s
= 20 nodes * 150 MB/s
theoretical
theoretical
max
max
in reality, multiple M/R
~ 3MB/s
Spark architecture
Cluster of Commodity Servers
Distributed File Store
(HDFS)
Resource
Management
(YARN)
Distributed Compute
(Spark)
Higher-level Apps
NoSQLDataStore
(HBase)
Data Flow
(Scala)
Query
(Spark SQL)
Machine Learning
(MLib)
Optimized Execution
what’s different?
Pipelines for batches of jobs
Memory caching of intermediate results
Programmability
Rich set of high-level data flow operations
Support for popular languages Scala, Java, Python
Workflow
what’s the same?
Scan the file
Interpret data structure in user code
Perform analysis
Philosophy
Ingest and collect all data now
Analyze later
Hadoop Storage
other improvements
Columnar data formats
Compression
SQL-on-Hadoop
Friendlier interface to analysts and tools
Optimized implementation (Impala, PrestoDB)
Data Sources
now, an enterprise
A variety of systems covering departmental functions
Mostly structured and transactional
Loose alignment of business terms
Typical Challenges
Data quality
Data integration
Interactive analytics
Business audiences
Client self-service analytics
Significant volumes (10-100 TB range)
Leveraging investment in IT and training
BUDGET!!!
Scalable Storage and Computaton
Big Data tools offer
Reliable and scalable storage for files
Reliable and scalable batch-mode computation
Not efficient at small scale
Unified Data Integration
The data is there
Its quality is up to the user
Its integration is up to the user and difficult / slow
“The user” is a small group of highly qualified data scientists
New programming interfaces
Mounting costs to acquire, extend and run
Top Uses in 2015 (Gartner)
Hadoop adoption
File storage
Basic analytics
Proof of concept
Next year: Advanced Analytics, DWH
Cluster Size
Average cluster size: 20 nodes
Median cluster size: 32 nodes
50% report under 10TB of storage
Top Reasons for Slow Adoption
Lack of adequate skill
No business case
Especially good at …
so Hadoop is …
Batch-processing
of web-scale
unstructured data
on large expensive infrastructures
But not that good at …
data integration and unification
concurrent use
interactive querying
accessibility to business users
Yeah, mainframes of old days…
sounds familiar…?
Batch-processed
Centralized
Users waiting queuing for system access
CODASYL-style programming
What’s the future?
Scale out
Let the data management system manage the data
Optimized structured storage
Declarative syntax for business users
Interfacing data management and presentation tools
Data integration methodologies
scaled-out DBMS
Cluster of Commodity Servers
Distributed File
Store
Resource
ManagementDistributed Execution & Aggregation
Higher-level Apps
Declarative Query Language
Distributed Database Engine
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
MammothDB architecture
Cluster of Commodity Servers
Resource
ManagementInteractive Map/Reduce
Higher-level Apps
SQL
Columnar RDBMS (per Node)
Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
Business Challenge
use case logistics
Predict cost of moving cargo between pairs of cities
Integrate into ERP
Validate at country level globally
Track historical accuracy
Outputs: 3 levels of service, 15’000 tradelanes, 4 charges
Client DWH
Solution
MammothDB Web Portal
E-LT
prediction
model
MS SSAS
ROLAP cube
Rate
Calculator
SAP extract
generator
Business Challenge
use case media planing
Track campaign across different media
Integrate online feeds
Store extended historical data
Load into downstream system
Provide ad-hoc reporting
Google
Solution
MammothDB Web Portal
E-LT
pull &
consolidate
MS SSAS
ROLAP cube
extract
generator
Facebook
Gemius
…
QlikView
Q & A
Thank you!

More Related Content

What's hot

Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
magda3695
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
Shankar R
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 

What's hot (20)

Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Graphing Your Data
Graphing Your DataGraphing Your Data
Graphing Your Data
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
How Linked Data Can Speed Information Discovery
How Linked Data Can Speed Information DiscoveryHow Linked Data Can Speed Information Discovery
How Linked Data Can Speed Information Discovery
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata
BigdataBigdata
Bigdata
 
Processing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesProcessing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approaches
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your Project
 

Viewers also liked

Viewers also liked (6)

Tweeting beyond Facts – The Need for a Linguistic Perspective
Tweeting beyond Facts – The Need for a Linguistic PerspectiveTweeting beyond Facts – The Need for a Linguistic Perspective
Tweeting beyond Facts – The Need for a Linguistic Perspective
 
Real-time information analysis: social networks and open data
Real-time information analysis: social networks and open dataReal-time information analysis: social networks and open data
Real-time information analysis: social networks and open data
 
Computer vision and image processing for dental products
Computer vision and image processing for dental productsComputer vision and image processing for dental products
Computer vision and image processing for dental products
 
Data science challenges in flight search
Data science challenges in flight searchData science challenges in flight search
Data science challenges in flight search
 
Wavelet analysis of financial datasets
Wavelet analysis of financial datasetsWavelet analysis of financial datasets
Wavelet analysis of financial datasets
 
Crowdsourced hedge funds
Crowdsourced hedge funds Crowdsourced hedge funds
Crowdsourced hedge funds
 

Similar to The future of Big Data tooling

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 

Similar to The future of Big Data tooling (20)

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
BigData
BigDataBigData
BigData
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Final deck
Final deckFinal deck
Final deck
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 

More from Data Science Society

More from Data Science Society (20)

[Data Meetup] Data Science in Finance - Factor Models in Finance
[Data Meetup] Data Science in Finance - Factor Models in Finance[Data Meetup] Data Science in Finance - Factor Models in Finance
[Data Meetup] Data Science in Finance - Factor Models in Finance
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
 
[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT
[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT
[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT
 
Computer Vision in Real Estate
Computer Vision in Real EstateComputer Vision in Real Estate
Computer Vision in Real Estate
 
ML in Proptech - Concept to Production
ML in Proptech  -  Concept to ProductionML in Proptech  -  Concept to Production
ML in Proptech - Concept to Production
 
Lessons Learned: Linked Open Data implemented in 2 Use Cases
Lessons Learned: Linked Open Data implemented in 2 Use CasesLessons Learned: Linked Open Data implemented in 2 Use Cases
Lessons Learned: Linked Open Data implemented in 2 Use Cases
 
AI methods for localization in noisy environment
AI methods for localization in noisy environment AI methods for localization in noisy environment
AI methods for localization in noisy environment
 
Object Identification and Detection Hackathon Solution
Object Identification and Detection Hackathon Solution Object Identification and Detection Hackathon Solution
Object Identification and Detection Hackathon Solution
 
Data Science for Open Innovation in SMEs and Large Corporations
Data Science for Open Innovation in SMEs and Large CorporationsData Science for Open Innovation in SMEs and Large Corporations
Data Science for Open Innovation in SMEs and Large Corporations
 
Air Pollution in Sofia - Solution through Data Science by Kiwi team
Air Pollution in Sofia - Solution through Data Science by Kiwi teamAir Pollution in Sofia - Solution through Data Science by Kiwi team
Air Pollution in Sofia - Solution through Data Science by Kiwi team
 
Machine Learning in Astrophysics
Machine Learning in AstrophysicsMachine Learning in Astrophysics
Machine Learning in Astrophysics
 
#AcademiaDatathon Finlists' Solution of Crypto Datathon Case
#AcademiaDatathon Finlists' Solution of Crypto Datathon Case#AcademiaDatathon Finlists' Solution of Crypto Datathon Case
#AcademiaDatathon Finlists' Solution of Crypto Datathon Case
 
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
 
DNA Analytics - What does really goes into Sausages - Datathon2018 Solution
DNA Analytics - What does really goes into Sausages - Datathon2018 SolutionDNA Analytics - What does really goes into Sausages - Datathon2018 Solution
DNA Analytics - What does really goes into Sausages - Datathon2018 Solution
 
Relationships between research tasks and data structure (basic methods and a...
Relationships between research tasks and data structure (basic  methods and a...Relationships between research tasks and data structure (basic  methods and a...
Relationships between research tasks and data structure (basic methods and a...
 
Data science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.HaralampievData science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.Haralampiev
 
Problems of Application of Machine Learning in the CRM - panel
Problems of Application of Machine Learning in the CRM - panel Problems of Application of Machine Learning in the CRM - panel
Problems of Application of Machine Learning in the CRM - panel
 
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
 
Intelligent Question Answering Using the Wisdom of the Crowd, Preslav Nakov
Intelligent Question Answering Using the Wisdom of the Crowd, Preslav NakovIntelligent Question Answering Using the Wisdom of the Crowd, Preslav Nakov
Intelligent Question Answering Using the Wisdom of the Crowd, Preslav Nakov
 
Master class Hristo Hadjitchonev - Aubg
Master class Hristo Hadjitchonev - Aubg Master class Hristo Hadjitchonev - Aubg
Master class Hristo Hadjitchonev - Aubg
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

The future of Big Data tooling

  • 1. The Future of Big Data Tooling Alexander Aldev
  • 2. Alexander Aldev About me Тranslator| between business and IT CTO and co-founder | MammothDB 17 years | various shades of analytics, DWH, BI Nerd | making scaled data infrastructure practical
  • 3. Spoiler Alert! This talk Can we predict the future? How do Big Data tools work today? How did they evolve? … and some examples What is their environment? yes, we can! It already happened.
  • 4. HOW MANY Z’S IN THIS SOUP? photo Ursus Wehrli
  • 5. THE BIG DATA TOOL APPROACH photo Ursus Wehrli 1 23 4
  • 6. MAYBE THIS WOULD HELP… photo Ursus Wehrli
  • 7. Working Definition Just what’s Big Data? Datasets so large and/or complex that traditional data processing techniques are inadequate to handle them Examples Indexing 100PB of crawled web content Providing on-line interactive analytics to 10mln clients
  • 9. Today The Big Data toolset? … for analytics, this is mostly synonymous with Hadoop
  • 10. Hadoop architecture Cluster of Commodity Servers Distributed File Store (HDFS) Resource Management (YARN) Distributed Compute (MapReduce) Higher-level Apps NoSQLDataStore (HBase) Data Flow (Pig) Query (Hive) Machine Learning (Mahout)
  • 11. the DFS Data Node 1 File 1 Data Node 10Data Node 2 File 2 High throughput Linear scalability Fault tolerance block replication
  • 12. classical workflow 1 MapReduce Job Input File on DFS Split Extract Structure Shuffle Aggregate Output File on DFS Store on DFS Read from DFS Store on Local FS Analytical Query Input on DFS Input on DFS M/R Job M/R Job M/R JobIntermediate/DFS M/R Job Intermediate/DFSIntermediate/DFS Output on DFS
  • 13. programmability Map() Reduce() complex queries require running many Map/Reduce jobs!!! JOINs are difficult WHEREs are difficult File 1 k File 2k node 1 node 2 shuffle File 2k = k ?
  • 14. resource management 1 Task = 1 Core Split 400 cores = 100 node x 4 cores 2.5 GB/s = 400 tasks x 64 MB/task / 10 sec/task 14.6 GB/s = 100 nodes * 150 MB/s 20 cores = 5 node x 4 cores 128 MB/s = 20 tasks x 64 MB/task / 10 sec/task 2.9 GB/s = 20 nodes * 150 MB/s theoretical theoretical max max in reality, multiple M/R ~ 3MB/s
  • 15. Spark architecture Cluster of Commodity Servers Distributed File Store (HDFS) Resource Management (YARN) Distributed Compute (Spark) Higher-level Apps NoSQLDataStore (HBase) Data Flow (Scala) Query (Spark SQL) Machine Learning (MLib)
  • 16. Optimized Execution what’s different? Pipelines for batches of jobs Memory caching of intermediate results Programmability Rich set of high-level data flow operations Support for popular languages Scala, Java, Python
  • 17. Workflow what’s the same? Scan the file Interpret data structure in user code Perform analysis Philosophy Ingest and collect all data now Analyze later
  • 18. Hadoop Storage other improvements Columnar data formats Compression SQL-on-Hadoop Friendlier interface to analysts and tools Optimized implementation (Impala, PrestoDB)
  • 19. Data Sources now, an enterprise A variety of systems covering departmental functions Mostly structured and transactional Loose alignment of business terms Typical Challenges Data quality Data integration Interactive analytics Business audiences Client self-service analytics Significant volumes (10-100 TB range) Leveraging investment in IT and training BUDGET!!!
  • 20. Scalable Storage and Computaton Big Data tools offer Reliable and scalable storage for files Reliable and scalable batch-mode computation Not efficient at small scale Unified Data Integration The data is there Its quality is up to the user Its integration is up to the user and difficult / slow “The user” is a small group of highly qualified data scientists New programming interfaces Mounting costs to acquire, extend and run
  • 21. Top Uses in 2015 (Gartner) Hadoop adoption File storage Basic analytics Proof of concept Next year: Advanced Analytics, DWH Cluster Size Average cluster size: 20 nodes Median cluster size: 32 nodes 50% report under 10TB of storage Top Reasons for Slow Adoption Lack of adequate skill No business case
  • 22. Especially good at … so Hadoop is … Batch-processing of web-scale unstructured data on large expensive infrastructures But not that good at … data integration and unification concurrent use interactive querying accessibility to business users
  • 23. Yeah, mainframes of old days… sounds familiar…? Batch-processed Centralized Users waiting queuing for system access CODASYL-style programming What’s the future? Scale out Let the data management system manage the data Optimized structured storage Declarative syntax for business users Interfacing data management and presentation tools Data integration methodologies
  • 24. scaled-out DBMS Cluster of Commodity Servers Distributed File Store Resource ManagementDistributed Execution & Aggregation Higher-level Apps Declarative Query Language Distributed Database Engine Partitioned Storage and Querying Data Integration Self-service BI Advanced Analytics Machine Learning
  • 25. MammothDB architecture Cluster of Commodity Servers Resource ManagementInteractive Map/Reduce Higher-level Apps SQL Columnar RDBMS (per Node) Partitioned Storage and Querying Data Integration Self-service BI Advanced Analytics Machine Learning
  • 26. Business Challenge use case logistics Predict cost of moving cargo between pairs of cities Integrate into ERP Validate at country level globally Track historical accuracy Outputs: 3 levels of service, 15’000 tradelanes, 4 charges Client DWH Solution MammothDB Web Portal E-LT prediction model MS SSAS ROLAP cube Rate Calculator SAP extract generator
  • 27. Business Challenge use case media planing Track campaign across different media Integrate online feeds Store extended historical data Load into downstream system Provide ad-hoc reporting Google Solution MammothDB Web Portal E-LT pull & consolidate MS SSAS ROLAP cube extract generator Facebook Gemius … QlikView
  • 28. Q & A Thank you!

Editor's Notes

  1. … and the next two slides summarize the key points.
  2. Let’s take a quiz. How many letters Z in this soup?
  3. Let’s do it like a big data tool.
  4. Although typically Big Data is associated with Web-Scale, 100s of PB and a mix of structured and unstructured data, it is still a challenge to copy 1 TB external HDD, it takes 2 hrs to copy.