Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

© 2012 IBM Corporation1
Information Retrieval, Applied Statistics and Mathematics
on BigData
Romeo Kienzler
Data Scientist and Architect
IBM Innovation Center Zurich

Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< 500 EURO
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d

Supercomputer in a Rack
Supercomputer before
➔
Weather
➔
Atom Bombs
➔
Science
➔
Crash Tests
Supercomputer in a Rack
➔
18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st
TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)

Hadoop / BigInsights

Hadoop Distributed File System

Hadoop Job Scheduling

Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec

Watson
1 TB (at 45.5 GByte/s)
- 1 Core - 22 sec
- 10 Core - 2.2 sec
- 100 Core - 220 msec
- 1000 Core - 22 msec
- 10000 Core - 2.2 msec

Data Streaming
X86
Box
X86 Blade Cell
Blade
X86 BladeFPGA
Blade
X86
Blade
X86 Blade X86
Blade
X86 BladeX86
Blade
Operating System
Transport
System S Data Fabric
Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container

Massive Parallel DataWarehousing

Why do we need to process so much data?

12
Data Growth
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportunity
100 Million Tweets are posted every day, 35 hours of video are being uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net.80 % spam and viruses. => Filtering is more and more important.
Up to 2003 the same amount of data has been produced as between 2003 and now

Separate the Signal From the Noise¹
¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/

The Unreasonable Effectiveness of Data¹
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

Statistical Modeling of Physical Systems

From Unstructured Data to Structured Data -
Feature Extraction
Feature extraction involves simplifying the amount of resources
required to describe a large set of data accurately¹
¹: Wikipedia

Dimension Reduction
Principal Component Analysis / Singular Value Decomposition
Linear Discriminant Analysis
Source: coursera.org

Data Parallelism

Data Parallelism
Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis)
N-gram Models (NLP)
Ordinary Least-Square Parameter Estimator for Linear Regression

BUT: Do I want to care about algorithm
parallelization?

High-Level Languages
Source: Hadoopsphere.com

High-Level Languages (IBM SystemML)
Extensible Library
Linear SVMs,
Logistic Reg
K-means
Classification
Linear
Regression
Regression
SGD solver,
NMF
Matrix Factorizations Clustering
PageRank,
HITS
Ranking
Parser
High-Level Ops
Low-Level Ops
Runtime Ops
Optimizations
Hadoop
DML
Scripts
Open Source Variant:
Apache Mahout
- less algorithms
- no optimizer

High-Level Languages (RHadoop)
Source: http://www.revolutionanalytics.com

High-Level Languages (R on IBM PureData)
Source: http://www.revolutionanalytics.com

Push Back
Application Algorithm Compile Engine Execution Language Engine

Push Back

Source: coursera.org Linear Discriminant Analysis

Outlook
Theory: With BigData the machines are thinking for us
Reality: Existing algorithms are now beginning to be applied on a large scale basis
Presence: Every company thinks they have to urgently participate in BigData, but don't know how
Future: Every company will have access to BigData technologies and will use them
Hype: The whole world is doing BigData
Vision: BigData Analytics is usable for everybody at their fingertips

Questions?

Links
www.ibm.com/developerworks
www.ibm.com/ibm/university/academic
romeo.kienzler@ch.ibm.com
rkie@ch.ibm.com
U6K8qm_HFas
Jqq66INlQ0U

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

Similar to Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13 (20)

More from Romeo Kienzler

More from Romeo Kienzler (20)

Recently uploaded

Recently uploaded (20)

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13