More Related Content
Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13
Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13 (20)
More from Romeo Kienzler (20)
Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13
- 1. © 2012 IBM Corporation1
Information Retrieval, Applied Statistics and Mathematics
on BigData
Romeo Kienzler
Data Scientist and Architect
IBM Innovation Center Zurich
- 2. © 2012 IBM Corporation2
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< 500 EURO
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d
- 3. © 2012 IBM Corporation3
Supercomputer in a Rack
Supercomputer before
➔
Weather
➔
Atom Bombs
➔
Science
➔
Crash Tests
Supercomputer in a Rack
➔
18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st
TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
- 5. © 2012 IBM Corporation5
Hadoop Distributed File System
- 7. © 2012 IBM Corporation7
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
- 8. © 2012 IBM Corporation8
Watson
1 TB (at 45.5 GByte/s)
- 1 Core - 22 sec
- 10 Core - 2.2 sec
- 100 Core - 220 msec
- 1000 Core - 22 msec
- 10000 Core - 2.2 msec
- 9. © 2012 IBM Corporation9
Data Streaming
X86
Box
X86 Blade Cell
Blade
X86 BladeFPGA
Blade
X86
Blade
X86 Blade X86
Blade
X86 BladeX86
Blade
Operating System
Transport
System S Data Fabric
Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container
- 10. © 2012 IBM Corporation10
Massive Parallel DataWarehousing
- 11. © 2012 IBM Corporation11
Why do we need to process so much data?
- 12. © 2012 IBM Corporation12
12
Data Growth
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportunity
100 Million Tweets are posted every day, 35 hours of video are being uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net.80 % spam and viruses. => Filtering is more and more important.
Up to 2003 the same amount of data has been produced as between 2003 and now
- 13. © 2012 IBM Corporation13
Separate the Signal From the Noise¹
¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
- 14. © 2012 IBM Corporation14
The Unreasonable Effectiveness of Data¹
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
- 15. © 2012 IBM Corporation15
Statistical Modeling of Physical Systems
- 16. © 2012 IBM Corporation16
From Unstructured Data to Structured Data -
Feature Extraction
Feature extraction involves simplifying the amount of resources
required to describe a large set of data accurately¹
¹: Wikipedia
- 17. © 2012 IBM Corporation17
Dimension Reduction
Principal Component Analysis / Singular Value Decomposition
Linear Discriminant Analysis
Source: coursera.org
- 19. © 2012 IBM Corporation19
Data Parallelism
Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis)
N-gram Models (NLP)
Ordinary Least-Square Parameter Estimator for Linear Regression
- 20. © 2012 IBM Corporation20
BUT: Do I want to care about algorithm
parallelization?
- 21. © 2012 IBM Corporation21
High-Level Languages
Source: Hadoopsphere.com
- 22. © 2012 IBM Corporation22
High-Level Languages (IBM SystemML)
Extensible Library
Linear SVMs,
Logistic Reg
K-means
Classification
Linear
Regression
Regression
SGD solver,
NMF
Matrix Factorizations Clustering
PageRank,
HITS
Ranking
Parser
High-Level Ops
Low-Level Ops
Runtime Ops
Optimizations
Hadoop
DML
Scripts
Open Source Variant:
Apache Mahout
- less algorithms
- no optimizer
- 23. © 2012 IBM Corporation23
High-Level Languages (RHadoop)
Source: http://www.revolutionanalytics.com
- 24. © 2012 IBM Corporation24
High-Level Languages (R on IBM PureData)
Source: http://www.revolutionanalytics.com
- 25. © 2012 IBM Corporation25
Push Back
Application Algorithm Compile Engine Execution Language Engine
- 37. © 2012 IBM Corporation37
Source: coursera.org Linear Discriminant Analysis
- 39. © 2012 IBM Corporation39
Outlook
Theory: With BigData the machines are thinking for us
Reality: Existing algorithms are now beginning to be applied on a large scale basis
Presence: Every company thinks they have to urgently participate in BigData, but don't know how
Future: Every company will have access to BigData technologies and will use them
Hype: The whole world is doing BigData
Vision: BigData Analytics is usable for everybody at their fingertips
- 41. © 2012 IBM Corporation41
Links
www.ibm.com/developerworks
www.ibm.com/ibm/university/academic
romeo.kienzler@ch.ibm.com
rkie@ch.ibm.com
U6K8qm_HFas
Jqq66INlQ0U