SlideShare a Scribd company logo
1 of 41
© 2012 IBM Corporation1
Information Retrieval, Applied Statistics and Mathematics
on BigData
Romeo Kienzler
Data Scientist and Architect
IBM Innovation Center Zurich
© 2012 IBM Corporation2
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< 500 EURO
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d
© 2012 IBM Corporation3
Supercomputer in a Rack
Supercomputer before
➔
Weather
➔
Atom Bombs
➔
Science
➔
Crash Tests
Supercomputer in a Rack
➔
18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st
TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
© 2012 IBM Corporation4
Hadoop / BigInsights
© 2012 IBM Corporation5
Hadoop Distributed File System
© 2012 IBM Corporation6
Hadoop Job Scheduling
© 2012 IBM Corporation7
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
© 2012 IBM Corporation8
Watson
1 TB (at 45.5 GByte/s)
- 1 Core - 22 sec
- 10 Core - 2.2 sec
- 100 Core - 220 msec
- 1000 Core - 22 msec
- 10000 Core - 2.2 msec
© 2012 IBM Corporation9
Data Streaming
X86
Box
X86 Blade Cell
Blade
X86 BladeFPGA
Blade
X86
Blade
X86 Blade X86
Blade
X86 BladeX86
Blade
Operating System
Transport
System S Data Fabric
Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container
© 2012 IBM Corporation10
Massive Parallel DataWarehousing
© 2012 IBM Corporation11
Why do we need to process so much data?
© 2012 IBM Corporation12
12
Data Growth
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportunity
100 Million Tweets are posted every day, 35 hours of video are being uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net.80 % spam and viruses. => Filtering is more and more important.
Up to 2003 the same amount of data has been produced as between 2003 and now
© 2012 IBM Corporation13
Separate the Signal From the Noise¹
¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
© 2012 IBM Corporation14
The Unreasonable Effectiveness of Data¹
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
© 2012 IBM Corporation15
Statistical Modeling of Physical Systems
© 2012 IBM Corporation16
From Unstructured Data to Structured Data -
Feature Extraction
Feature extraction involves simplifying the amount of resources
required to describe a large set of data accurately¹
¹: Wikipedia
© 2012 IBM Corporation17
Dimension Reduction
Principal Component Analysis / Singular Value Decomposition
Linear Discriminant Analysis
Source: coursera.org
© 2012 IBM Corporation18
Data Parallelism
© 2012 IBM Corporation19
Data Parallelism
Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis)
N-gram Models (NLP)
Ordinary Least-Square Parameter Estimator for Linear Regression
© 2012 IBM Corporation20
BUT: Do I want to care about algorithm
parallelization?
© 2012 IBM Corporation21
High-Level Languages
Source: Hadoopsphere.com
© 2012 IBM Corporation22
High-Level Languages (IBM SystemML)
Extensible Library
Linear SVMs,
Logistic Reg
K-means
Classification
Linear
Regression
Regression
SGD solver,
NMF
Matrix Factorizations Clustering
PageRank,
HITS
Ranking
Parser
High-Level Ops
Low-Level Ops
Runtime Ops
Optimizations
Hadoop
DML
Scripts
Open Source Variant:
Apache Mahout
- less algorithms
- no optimizer
© 2012 IBM Corporation23
High-Level Languages (RHadoop)
Source: http://www.revolutionanalytics.com
© 2012 IBM Corporation24
High-Level Languages (R on IBM PureData)
Source: http://www.revolutionanalytics.com
© 2012 IBM Corporation25
Push Back
Application Algorithm Compile Engine Execution Language Engine
© 2012 IBM Corporation26
Push Back
© 2012 IBM Corporation27
Push Back
© 2012 IBM Corporation28
© 2012 IBM Corporation29
© 2012 IBM Corporation30
© 2012 IBM Corporation31
© 2012 IBM Corporation32
© 2012 IBM Corporation33
© 2012 IBM Corporation34
© 2012 IBM Corporation35
© 2012 IBM Corporation36
© 2012 IBM Corporation37
Source: coursera.org Linear Discriminant Analysis
© 2012 IBM Corporation38
© 2012 IBM Corporation39
Outlook
Theory: With BigData the machines are thinking for us
Reality: Existing algorithms are now beginning to be applied on a large scale basis
Presence: Every company thinks they have to urgently participate in BigData, but don't know how
Future: Every company will have access to BigData technologies and will use them
Hype: The whole world is doing BigData
Vision: BigData Analytics is usable for everybody at their fingertips
© 2012 IBM Corporation40
Questions?
© 2012 IBM Corporation41
Links
www.ibm.com/developerworks
www.ibm.com/ibm/university/academic
romeo.kienzler@ch.ibm.com
rkie@ch.ibm.com
U6K8qm_HFas
Jqq66INlQ0U

More Related Content

What's hot

Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Datainside-BigData.com
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Aerospike
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDivyen Patel
 
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is HereMartin Hamilton
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyDavid Lecomber
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for ReserachJürgen Ambrosi
 
Superior Cloud Economics with IBM Power Systems
Superior Cloud Economics with IBM Power SystemsSuperior Cloud Economics with IBM Power Systems
Superior Cloud Economics with IBM Power SystemsIBM Power Systems
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteven Totman
 
The IBM Data Engine for NoSQL on IBM Power Systems™
The IBM Data Engine for NoSQL on IBM Power Systems™The IBM Data Engine for NoSQL on IBM Power Systems™
The IBM Data Engine for NoSQL on IBM Power Systems™IBM Power Systems
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM Ganesan Narayanasamy
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?Ian Lumb
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsorthuguk
 
High performance computing
High performance computingHigh performance computing
High performance computingMaher Alshammari
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Amazon Web Services
 

What's hot (20)

Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Data
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is Here
 
EC2 Foundations - Laura Thomson
EC2 Foundations - Laura ThomsonEC2 Foundations - Laura Thomson
EC2 Foundations - Laura Thomson
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for Reserach
 
Superior Cloud Economics with IBM Power Systems
Superior Cloud Economics with IBM Power SystemsSuperior Cloud Economics with IBM Power Systems
Superior Cloud Economics with IBM Power Systems
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
 
The IBM Data Engine for NoSQL on IBM Power Systems™
The IBM Data Engine for NoSQL on IBM Power Systems™The IBM Data Engine for NoSQL on IBM Power Systems™
The IBM Data Engine for NoSQL on IBM Power Systems™
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsort
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Met Office.PDF
Met Office.PDFMet Office.PDF
Met Office.PDF
 

Viewers also liked

Development in the cloud for the cloud – Guest Lecture - University of Applie...
Development in the cloud for the cloud – Guest Lecture - University of Applie...Development in the cloud for the cloud – Guest Lecture - University of Applie...
Development in the cloud for the cloud – Guest Lecture - University of Applie...Romeo Kienzler
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichRomeo Kienzler
 
IBM Watson for Healthcare
IBM Watson for HealthcareIBM Watson for Healthcare
IBM Watson for HealthcareRomeo Kienzler
 
Architecturesfor massive parallel data base clustersproviding linear scale ou...
Architecturesfor massive parallel data base clustersproviding linear scale ou...Architecturesfor massive parallel data base clustersproviding linear scale ou...
Architecturesfor massive parallel data base clustersproviding linear scale ou...Romeo Kienzler
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningRomeo Kienzler
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataRomeo Kienzler
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Romeo Kienzler
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 

Viewers also liked (8)

Development in the cloud for the cloud – Guest Lecture - University of Applie...
Development in the cloud for the cloud – Guest Lecture - University of Applie...Development in the cloud for the cloud – Guest Lecture - University of Applie...
Development in the cloud for the cloud – Guest Lecture - University of Applie...
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
 
IBM Watson for Healthcare
IBM Watson for HealthcareIBM Watson for Healthcare
IBM Watson for Healthcare
 
Architecturesfor massive parallel data base clustersproviding linear scale ou...
Architecturesfor massive parallel data base clustersproviding linear scale ou...Architecturesfor massive parallel data base clustersproviding linear scale ou...
Architecturesfor massive parallel data base clustersproviding linear scale ou...
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor Data
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 

Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...
Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...
Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...Romeo Kienzler
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXSanjayKPrasad2
 
Introduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrIntroduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrPranav Kulkarni
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Joergen Floes
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Disaggregated Hadoop Stacks
Disaggregated Hadoop StacksDisaggregated Hadoop Stacks
Disaggregated Hadoop StacksDataWorks Summit
 
Future of Power: IBM Trends & Directions - Erik Rex
Future of Power: IBM Trends & Directions - Erik RexFuture of Power: IBM Trends & Directions - Erik Rex
Future of Power: IBM Trends & Directions - Erik RexIBM Danmark
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesInsight Technology, Inc.
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 

Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13 (20)

Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...
Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...
Software and Hardware Infrastructures to conquer Data Explosion in Life Scien...
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTXDecision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
Decision Optimization - CPLEX Optimization Studio - Product Overview(2).PPTX
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Introduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrIntroduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj Bongirr
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Disaggregated Hadoop Stacks
Disaggregated Hadoop StacksDisaggregated Hadoop Stacks
Disaggregated Hadoop Stacks
 
Future of Power: IBM Trends & Directions - Erik Rex
Future of Power: IBM Trends & Directions - Erik RexFuture of Power: IBM Trends & Directions - Erik Rex
Future of Power: IBM Trends & Directions - Erik Rex
 
Expect More from Hadoop
Expect More from Hadoop Expect More from Hadoop
Expect More from Hadoop
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practices
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 

More from Romeo Kienzler

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingRomeo Kienzler
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book VernisageRomeo Kienzler
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Romeo Kienzler
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarRomeo Kienzler
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Romeo Kienzler
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceRomeo Kienzler
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...Romeo Kienzler
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamRomeo Kienzler
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14Romeo Kienzler
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Romeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...Romeo Kienzler
 
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...Romeo Kienzler
 

More from Romeo Kienzler (20)

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network Training
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book Vernisage
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, Qatar
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoT
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa Neddam
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
 
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

  • 1. © 2012 IBM Corporation1 Information Retrieval, Applied Statistics and Mathematics on BigData Romeo Kienzler Data Scientist and Architect IBM Innovation Center Zurich
  • 2. © 2012 IBM Corporation2 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < 500 EURO 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD MTBF ~ 365 d > 1,5 d
  • 3. © 2012 IBM Corporation3 Supercomputer in a Rack Supercomputer before ➔ Weather ➔ Atom Bombs ➔ Science ➔ Crash Tests Supercomputer in a Rack ➔ 18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
  • 4. © 2012 IBM Corporation4 Hadoop / BigInsights
  • 5. © 2012 IBM Corporation5 Hadoop Distributed File System
  • 6. © 2012 IBM Corporation6 Hadoop Job Scheduling
  • 7. © 2012 IBM Corporation7 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  • 8. © 2012 IBM Corporation8 Watson 1 TB (at 45.5 GByte/s) - 1 Core - 22 sec - 10 Core - 2.2 sec - 100 Core - 220 msec - 1000 Core - 22 msec - 10000 Core - 2.2 msec
  • 9. © 2012 IBM Corporation9 Data Streaming X86 Box X86 Blade Cell Blade X86 BladeFPGA Blade X86 Blade X86 Blade X86 Blade X86 BladeX86 Blade Operating System Transport System S Data Fabric Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container
  • 10. © 2012 IBM Corporation10 Massive Parallel DataWarehousing
  • 11. © 2012 IBM Corporation11 Why do we need to process so much data?
  • 12. © 2012 IBM Corporation12 12 Data Growth Data AVAILABLE to an organization data an organization can PROCESS Missed opportunity 100 Million Tweets are posted every day, 35 hours of video are being uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net.80 % spam and viruses. => Filtering is more and more important. Up to 2003 the same amount of data has been produced as between 2003 and now
  • 13. © 2012 IBM Corporation13 Separate the Signal From the Noise¹ ¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
  • 14. © 2012 IBM Corporation14 The Unreasonable Effectiveness of Data¹ "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
  • 15. © 2012 IBM Corporation15 Statistical Modeling of Physical Systems
  • 16. © 2012 IBM Corporation16 From Unstructured Data to Structured Data - Feature Extraction Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately¹ ¹: Wikipedia
  • 17. © 2012 IBM Corporation17 Dimension Reduction Principal Component Analysis / Singular Value Decomposition Linear Discriminant Analysis Source: coursera.org
  • 18. © 2012 IBM Corporation18 Data Parallelism
  • 19. © 2012 IBM Corporation19 Data Parallelism Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis) N-gram Models (NLP) Ordinary Least-Square Parameter Estimator for Linear Regression
  • 20. © 2012 IBM Corporation20 BUT: Do I want to care about algorithm parallelization?
  • 21. © 2012 IBM Corporation21 High-Level Languages Source: Hadoopsphere.com
  • 22. © 2012 IBM Corporation22 High-Level Languages (IBM SystemML) Extensible Library Linear SVMs, Logistic Reg K-means Classification Linear Regression Regression SGD solver, NMF Matrix Factorizations Clustering PageRank, HITS Ranking Parser High-Level Ops Low-Level Ops Runtime Ops Optimizations Hadoop DML Scripts Open Source Variant: Apache Mahout - less algorithms - no optimizer
  • 23. © 2012 IBM Corporation23 High-Level Languages (RHadoop) Source: http://www.revolutionanalytics.com
  • 24. © 2012 IBM Corporation24 High-Level Languages (R on IBM PureData) Source: http://www.revolutionanalytics.com
  • 25. © 2012 IBM Corporation25 Push Back Application Algorithm Compile Engine Execution Language Engine
  • 26. © 2012 IBM Corporation26 Push Back
  • 27. © 2012 IBM Corporation27 Push Back
  • 28. © 2012 IBM Corporation28
  • 29. © 2012 IBM Corporation29
  • 30. © 2012 IBM Corporation30
  • 31. © 2012 IBM Corporation31
  • 32. © 2012 IBM Corporation32
  • 33. © 2012 IBM Corporation33
  • 34. © 2012 IBM Corporation34
  • 35. © 2012 IBM Corporation35
  • 36. © 2012 IBM Corporation36
  • 37. © 2012 IBM Corporation37 Source: coursera.org Linear Discriminant Analysis
  • 38. © 2012 IBM Corporation38
  • 39. © 2012 IBM Corporation39 Outlook Theory: With BigData the machines are thinking for us Reality: Existing algorithms are now beginning to be applied on a large scale basis Presence: Every company thinks they have to urgently participate in BigData, but don't know how Future: Every company will have access to BigData technologies and will use them Hype: The whole world is doing BigData Vision: BigData Analytics is usable for everybody at their fingertips
  • 40. © 2012 IBM Corporation40 Questions?
  • 41. © 2012 IBM Corporation41 Links www.ibm.com/developerworks www.ibm.com/ibm/university/academic romeo.kienzler@ch.ibm.com rkie@ch.ibm.com U6K8qm_HFas Jqq66INlQ0U