SlideShare ist ein Scribd-Unternehmen logo
1 von 108
Downloaden Sie, um offline zu lesen
© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Workshop
SwissRE, 11.6.14
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData
(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
© 2013 IBM Corporation2
The Data Scientists Workplace of the Future -
* * C R E D I T S * *
Romeo Kienzler
IBM Innovation Center
●
Parts of these slides have been copied from and/or revised by
●
Dr. Anand Ranganathan, IBM Watson Research Lab
●
Dr. Stefan Mück, IBM BigData Leader Europe
●
Dr. Berthold Rheinwald, IBM Almaden Research Lab
●
Dr. Diego Kuonen, Statoo Consulting
●
Dr. Abdel Labbi, IBM Zurich Research Lab
●
Brandon MacKenzie, IBM Software Group
© 2013 IBM Corporation3
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation4
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation5
DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
●
SQL (42%)
●
R (33%)
●
Python (26%)
●
Excel (25%)
●
Java, Ruby, C++ (17%)
●
SPSS, SAS (9%)
●
Limitations (Single Node usage)
●
Main Memory
●
CPU <> Main Memory Bandwidth
●
CPU
●
Storage <> Main Memory Bandwidth (either Single node or SAN)
© 2013 IBM Corporation6
DataScience at present - Demo
●
Assume 1 TB file on Hard Drive
●
Spit into 16 files
●
split -d -n 16 output.json
●
Distribute on 4 Nodes
●
for node in `seq 1 16`; do scp x$node id@node$i:~/; done
●
Perform calculation in paralell
●
for node in `seq 1 16`; do
ssh id@node$i 'cat $file
|awk -F":" '{print $6}'
|grep -i samsung
|grep breathtaking |wc -l';
done > result
●
Merge Result
●
cat result |sum
Source: http://sergeytihon.wordpress.com/2013/03/20/the-data-science-venn-diagram/
© 2013 IBM Corporation7
What is BIG data?
© 2013 IBM Corporation8
What is BIG data?
© 2013 IBM Corporation9
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation10
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation11
BigData == Hadoop?
Hadoop BigData
Hadoop
© 2013 IBM Corporation12
What is beyond “Data Warehouse”?
Data Lake
Data Warehouse
© 2013 IBM Corporation13
First “BigData” UseCase ?
●
Google Index
●
40 X 10^9 = 40.000.000.000 => 40 billion pages indexed
●
Will break 100 PB barrier soon
●
Derived from MapReduce
●
now “caffeine” based on “percolator”
●
Incremental vs. batch
●
In-Memory vs. disk
●
© 2013 IBM Corporation14
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation15
BigData UseCases
●
CERN LHC
●
25 petabytes per year
●
Facebook
●
Hive Datawarehouse
●
300 PB, Growing 600 TB / d
●
> 100 k servers
●
Genomics
●
Enterprises
●
Data center analytics (Logflies, OS/NW monitors, ...)
●
Predictive Maintenance, Cybersecurity
●
Social Media Analytics
●
DWH offload
●
Call Detail Record (CDR) data preservation
http://www.balthasar-glaettli.ch/vorratsdaten/
© 2013 IBM Corporation1616
Why is Big Data important?
© 2013 IBM Corporation17
BigData Analytics
Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/
© 2013 IBM Corporation18
BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore
© 2013 IBM Corporation19
We need Data Parallelism
© 2013 IBM Corporation20
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
© 2013 IBM Corporation21
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
 MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog
© 2013 IBM Corporation22
NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document Store
– MongoDB
– CouchDB
 Key / Value Store
– Amazon DynamoDB
– Voldemort
 Graph DBs
– DB2 SPARQL Extension
– Neo4J
 MP RDBMS
– DB2 DPF, DB2 pureScale, PureData for Operational Analytics
– Oracle RAC
– Greenplum

http://nosql-database.org/ > 150
© 2013 IBM Corporation23
CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously guarantee all 3 properties
– Consistency (all nodes see the same data at the same time)
– Availability (guarantee that every request knows whether it was successful or failed)
– Partition tolerance (continues to operate despite failure of part of the system)
 What about ACID?
– Atomicity
– Consistency
– Isolation
– Durability
 BASE, the new ACID
– Basically Available
– Soft state
– Eventual consistency
• Monotonic Read Consistency
• Monotonic Write Consistency
• Read Your Own Writes
–

© 2013 IBM Corporation24
What role is the cloud playing here?
© 2013 IBM Corporation25
“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
© 2013 IBM Corporation26
“Elastic” Scale-Out
of
© 2013 IBM Corporation27
“Elastic” Scale-Out
of
CPU Cores
© 2013 IBM Corporation28
“Elastic” Scale-Out
of
CPU Cores Storage
© 2013 IBM Corporation29
“Elastic” Scale-Out
of
CPU Cores Storage Memory
© 2013 IBM Corporation30
“Elastic” Scale-Out
linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
© 2013 IBM Corporation31
How do Databases Scale-Out?
Shared Disk Architectures
© 2013 IBM Corporation32
How do Databases Scale-Out?
Shared Nothing Architectures
© 2013 IBM Corporation33
Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
© 2013 IBM Corporation34
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop
© 2013 IBM Corporation35
Large Scale Data Ingestion
●
Traditionally
●
Crawl to local file system (e.g. wget http://www.heise.de/newsticker/)
●
Export RDBMS data to CSV (local file system)
●
Batched FTP Servers uploads
●
Then: Copy to HDFS
●
BigInsights
●
Use one of built-in importers
●
Imports directly info HDFS
●
Use Eclipse-Tooling to deploy custom importers easily
© 2013 IBM Corporation36
Large Scale Data Ingestion (ETL on M/R)
●
Modern ETL (Extract, Transform, Load) tools support Hadoop as
●
Source, Sink (HDFS)
●
Engine (MapReduce)
●
Example: InfoSphere DataStage
© 2013 IBM Corporation37
Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processing steps
●
Feature Extraction of
●
Video
●
Audio
●
Semistructured Text (e.g. Logfiles)
●
Structured Text
●
Filtering
●
Compression
●
Recommendation: Usage of Streaming Engines
●
IBM InfoSphere Streams
●
Twitter Storm (now Apache incubator)
●
Apache Spark Streaming
© 2013 IBM Corporation38
Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processing steps
●
Feature Extraction of
●
Video
●
Audio
●
Semistructured Text (e.g. Logfiles)
●
Structured Text
●
Filtering
●
Compression
© 2013 IBM Corporation39
SQL on Hadoop
●
IBM BigSQL (ANSI 92 compliant)
●
HIVE (SQL dialect)
●
Cloudera Impala
●
Lingual
●
...
SQL Hadoop
© 2013 IBM Corporation40
BigSQL V3.0 – ANSI SQL 92 compliant
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to
successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without
modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
© 2013 IBM Corporation41
BigSQL V3.0 – Architecture
© 2013 IBM Corporation42
BigSQL V3.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation43
BigSQL V3.0 – Demo (small)
CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE LOCATION
'/user/biadmin/32Gtest';
select count(hour), hour from trace group by hour order by hour
-- This command runs on 32 GB / ~650.000.000 rows in HDFS
© 2013 IBM Corporation44
BigSQL V3.0 – Demo (small)
© 2013 IBM Corporation45
BigSQL V3.0 – Demo (small)
© 2013 IBM Corporation46
R on Hadoop
●
IBM BigR (based on SystemML Almadan Research project)
●
Rhadoop
●
RHIPE
●
...
“R” Hadoop
© 2013 IBM Corporation47
BigR (based on SystemML)
Example: Gaussian Non-negative Matrix Factorization
package gnmf;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF
{
public static void main(String[] args) throws IOException, URISyntaxException
{
if(args.length < 10)
{
System.out.println("missing parameters");
System.out.println("expected parameters: [directory of v] [directory of w] [directory
of h] " +
"[k] [num mappers] [num reducers] [replication] [working directory] " +
"[final directory of w] [final directory of h]");
System.exit(1);
}
String vDir = args[0];
String wDir = args[1];
String hDir = args[2];
int k = Integer.parseInt(args[3]);
int numMappers = Integer.parseInt(args[4]);
int numReducers = Integer.parseInt(args[5]);
int replication = Integer.parseInt(args[6]);
String outputDir = args[7];
String wFinalDir = args[8];
String hFinalDir = args[9];
JobConf mainJob = new JobConf(MatrixGNMF.class);
String vDirectory;
String wDirectory;
String hDirectory;
FileSystem.get(mainJob).delete(new Path(outputDir));
vDirectory = vDir;
hDirectory = hDir;
wDirectory = wDir;
String workingDirectory;
String resultDirectoryX;
String resultDirectoryY;
long start = System.currentTimeMillis();
System.gc();
System.out.println("starting calculation");
System.out.print("calculating X = WT * V... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = WT * W * H... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
wDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating H = H .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back H... ");
FileSystem.get(mainJob).delete(new Path(hDirectory));
hDirectory = workingDirectory;
System.out.println("done");
System.out.print("calculating X = V * HT... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = W * H * HT... ");
workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication,
hDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating W = W .* X ./ Y... ");
workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication,
wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k);
System.out.println("done");
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back W... ");
FileSystem.get(mainJob).delete(new Path(wDirectory));
package gnmf;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2
{
static class UpdateWHStep2Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixVector, TaggedIndex,
MatrixVector>
{
@Override
public void map(TaggedIndex key, MatrixVector value,
OutputCollector<TaggedIndex, MatrixVector> out,
Reporter reporter) throws IOException
{
out.collect(key, value);
}
}
static class UpdateWHStep2Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixVector, TaggedIndex,
MatrixObject>
{
@Override
public void reduce(TaggedIndex key, Iterator<MatrixVector> values,
OutputCollector<TaggedIndex, MatrixObject> out, Reporter
reporter)
throws IOException
{
MatrixVector result = null;
while(values.hasNext())
{
MatrixVector current = values.next();
if(result == null)
{
result = current.getCopy();
} else
{
result.addVector(current);
}
}
if(result != null)
{
out.collect(new TaggedIndex(key.getIndex(),
TaggedIndex.TYPE_VECTOR_X),
new MatrixObject(result));
}
}
}
public static String runJob(int numMappers, int numReducers, int
replication,
String inputDir, String outputDir) throws IOException
{
String workingDirectory = outputDir + System.currentTimeMillis() +
"-UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class);
job.setJobName("MatrixGNMFUpdateWHStep2");
job.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputDir));
package gnmf;
import gnmf.io.MatrixCell;
import gnmf.io.MatrixFormats;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1
{
public static final int UPDATE_TYPE_H = 0;
public static final int UPDATE_TYPE_W = 1;
static class UpdateWHStep1Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject>
{
private int updateType;
@Override
public void map(TaggedIndex key, MatrixObject value,
OutputCollector<TaggedIndex, MatrixObject> out,
Reporter reporter) throws IOException
{
if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL)
{
MatrixCell current = (MatrixCell) value.getObject();
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL),
new MatrixObject(new MatrixCell(key.getIndex(), current.getValue())));
} else
{
out.collect(key, value);
}
}
@Override
public void configure(JobConf job)
{
updateType = job.getInt("gnmf.updateType", 0);
}
}
static class UpdateWHStep1Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector>
{
private double[] baseVector = null;
private int vectorSizeK;
@Override
public void reduce(TaggedIndex key, Iterator<MatrixObject> values,
OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter)
throws IOException
{
if(key.getType() == TaggedIndex.TYPE_VECTOR)
{
if(!values.hasNext())
throw new RuntimeException("expected vector");
MatrixFormats current = values.next().getObject();
if(!(current instanceof MatrixVector))
throw new RuntimeException("expected vector");
baseVector = ((MatrixVector) current).getValues();
} else
{
while(values.hasNext())
{
MatrixCell current = (MatrixCell) values.next().getObject();
if(baseVector == null)
{
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
new MatrixVector(vectorSizeK));
} else
{
if(baseVector.length == 0)
throw new RuntimeException("base vector is corrupted");
MatrixVector resultingVector = new MatrixVector(baseVector);
resultingVector.multiplyWithScalar(current.getValue());
if(resultingVector.getValues().length == 0)
throw new RuntimeException("multiplying with scalar failed");
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
resultingVector);
}
}
baseVector = null;
}
}
@Override
public void configure(JobConf job)
{
vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0);
Java Implementation
(>1500 lines of code)
Equivalent SystemML Implementation
(10 lines of code)
Experimenting with multiple variants!
W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))
H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)
W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))
H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))
W = W*(V/(W%*%H) %*% t(H))/(E%*%t(H))
H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
© 2013 IBM Corporation48
BigR (based on SystemML)
SystemML compiles hybrid runtime plans ranging from in-
memory, single machine (CP) to large-scale, cluster (MR)
compute
●
Challenge
●
Guaranteed hard memory constraints
(budget of JVM size)
●
for arbitrary complex ML programs
●
Key Technical Innovations
●
CP & MR Runtime: Single machine & MR operations, integrated runtime
●
Caching: Reuse and eviction of in-memory objects
●
Cost Model: Accurate time and worst-case memory estimates
●
Optimizer: Cost-based runtime plan generation
●
Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Runtime
CP CP/MR MR
Gradually exploit
MR parallelism
High performance
computing for
small data sizes.
Scalable
computing for
large data sizes.
Hybrid Plans
© 2013 IBM Corporation49
R Clients
SystemML
Statistics
Engine
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture
© 2013 IBM Corporation50
BigR Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation51
BigR Demo (small)
library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes =
c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)
© 2013 IBM Corporation52
BigR Demo (small)
class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...
© 2013 IBM Corporation53
BigR Demo (small)
© 2013 IBM Corporation54
BigR Demo (small)
jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()
© 2013 IBM Corporation55
BigR Demo (small)
Sampling, Resampling, Bootstrapping
vs
Whole Dataset Processing
What is your experience?
© 2013 IBM Corporation56
Python on Hadoop
python Hadoop
© 2013 IBM Corporation57
SPSS on Hadoop
© 2013 IBM Corporation58
SPSS on Hadoop
© 2013 IBM Corporation59
BigSheets Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation60
BigSheets Demo (small)
© 2013 IBM Corporation61
BigSheets Demo (small)
This command runs on 32 GB /
~650.000.000 rows in HDFS
© 2013 IBM Corporation62
BigSheets Demo (small)
© 2013 IBM Corporation63
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation64
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation65
If this is not enough? → BigData AppStore
© 2013 IBM Corporation66
BigData AppStore, Eclipse Tooling
●
Write your apps in
●
Java (MapReduce)
●
PigLatin,Jaql
●
BigSQL/Hive/BigR
●
Deploy it to BigInsights via Eclipse
●
Automatically
●
Schedule
●
Update
●
hdfs files
●
BigSQL tables
●
BigSheets collections
© 2013 IBM Corporation67
Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps
© 2013 IBM Corporation68
DFT/Audio Analytics (as promised)
library(tuneR)
a <- readWave("whitenoisesine.wav")
f<- fft(a@left)
jpeg('rplot_wnsine.jpg')
plot(Re(f)^2)
dev.off()
a <- readWave("whitenoise.wav")
f<- fft(a@left)
jpeg('rplot_wn.jpg')
plot(Re(f)^2)
dev.off()
a <- readWave("whitenoisesine.wav")
brv <- as.bigr.vector(a@left)
al <- as.list(a@left)
© 2013 IBM Corporation69
Backup Slides
© 2013 IBM Corporation70
© 2013 IBM Corporation71
© 2013 IBM Corporation72
© 2013 IBM Corporation73
© 2013 IBM Corporation74
© 2013 IBM Corporation75
© 2013 IBM Corporation76
© 2013 IBM Corporation77
© 2013 IBM Corporation78
© 2013 IBM Corporation79
© 2013 IBM Corporation80
© 2013 IBM Corporation81
© 2013 IBM Corporation82
© 2013 IBM Corporation83
© 2013 IBM Corporation84
Map-Reduce
Source: http://www.cloudcomputingpatterns.org/Map_Reduce
© 2013 IBM Corporation85
© 2013 IBM Corporation86
© 2013 IBM Corporation87
© 2013 IBM Corporation88
© 2013 IBM Corporation89
© 2013 IBM Corporation90
© 2013 IBM Corporation91
© 2013 IBM Corporation92
© 2013 IBM Corporation93
© 2013 IBM Corporation94
© 2013 IBM Corporation95
© 2013 IBM Corporation96
© 2013 IBM Corporation97
© 2013 IBM Corporation98
© 2013 IBM Corporation99
© 2013 IBM Corporation100
© 2013 IBM Corporation101
© 2013 IBM Corporation102
© 2013 IBM Corporation103
© 2013 IBM Corporation104
© 2013 IBM Corporation105
© 2013 IBM Corporation106
© 2013 IBM Corporation107
© 2013 IBM Corporation108

Weitere ähnliche Inhalte

Was ist angesagt?

제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureDmitry Buzdin
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesAlluxio, Inc.
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 

Was ist angesagt? (20)

제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hive
HiveHive
Hive
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
10c introduction
10c introduction10c introduction
10c introduction
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Andere mochten auch

SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...Romeo Kienzler
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichRomeo Kienzler
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Romeo Kienzler
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Romeo Kienzler
 

Andere mochten auch (6)

SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
 
Digital Banking
Digital BankingDigital Banking
Digital Banking
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
 

Ähnlich wie The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...Romeo Kienzler
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
IBM World of Watson 2016 - DB2 Analytics Accelerator on CloudIBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
IBM World of Watson 2016 - DB2 Analytics Accelerator on CloudDaniel Martin
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics systemModusOptimum
 
Introduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrIntroduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrPranav Kulkarni
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 

Ähnlich wie The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler (20)

Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
IBM World of Watson 2016 - DB2 Analytics Accelerator on CloudIBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Introduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj BongirrIntroduction to Big Data by Manouj Bongirr
Introduction to Big Data by Manouj Bongirr
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 

Mehr von Romeo Kienzler

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingRomeo Kienzler
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book VernisageRomeo Kienzler
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Romeo Kienzler
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarRomeo Kienzler
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningRomeo Kienzler
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataRomeo Kienzler
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceRomeo Kienzler
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...Romeo Kienzler
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamRomeo Kienzler
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14Romeo Kienzler
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Romeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...Romeo Kienzler
 

Mehr von Romeo Kienzler (20)

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network Training
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book Vernisage
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, Qatar
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoT
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor Data
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa Neddam
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
BlueMix – IBM CIO Leadership Exchange Europe 27/28.5.14 - Berlin - Romeo Kien...
 

Kürzlich hochgeladen

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 

Kürzlich hochgeladen (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 

The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

  • 1. © 2013 IBM Corporation1 The Data Scientists Workplace of the Future - Workshop SwissRE, 11.6.14 Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  • 2. © 2013 IBM Corporation2 The Data Scientists Workplace of the Future - * * C R E D I T S * * Romeo Kienzler IBM Innovation Center ● Parts of these slides have been copied from and/or revised by ● Dr. Anand Ranganathan, IBM Watson Research Lab ● Dr. Stefan Mück, IBM BigData Leader Europe ● Dr. Berthold Rheinwald, IBM Almaden Research Lab ● Dr. Diego Kuonen, Statoo Consulting ● Dr. Abdel Labbi, IBM Zurich Research Lab ● Brandon MacKenzie, IBM Software Group
  • 3. © 2013 IBM Corporation3 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  • 4. © 2013 IBM Corporation4 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  • 5. © 2013 IBM Corporation5 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  • 6. © 2013 IBM Corporation6 DataScience at present - Demo ● Assume 1 TB file on Hard Drive ● Spit into 16 files ● split -d -n 16 output.json ● Distribute on 4 Nodes ● for node in `seq 1 16`; do scp x$node id@node$i:~/; done ● Perform calculation in paralell ● for node in `seq 1 16`; do ssh id@node$i 'cat $file |awk -F":" '{print $6}' |grep -i samsung |grep breathtaking |wc -l'; done > result ● Merge Result ● cat result |sum Source: http://sergeytihon.wordpress.com/2013/03/20/the-data-science-venn-diagram/
  • 7. © 2013 IBM Corporation7 What is BIG data?
  • 8. © 2013 IBM Corporation8 What is BIG data?
  • 9. © 2013 IBM Corporation9 What is BIG data? Big Data Hadoop
  • 10. © 2013 IBM Corporation10 What is BIG data? Business Intelligence Data Warehouse
  • 11. © 2013 IBM Corporation11 BigData == Hadoop? Hadoop BigData Hadoop
  • 12. © 2013 IBM Corporation12 What is beyond “Data Warehouse”? Data Lake Data Warehouse
  • 13. © 2013 IBM Corporation13 First “BigData” UseCase ? ● Google Index ● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed ● Will break 100 PB barrier soon ● Derived from MapReduce ● now “caffeine” based on “percolator” ● Incremental vs. batch ● In-Memory vs. disk ●
  • 14. © 2013 IBM Corporation14 Map-Reduce → Hadoop → BigInsights
  • 15. © 2013 IBM Corporation15 BigData UseCases ● CERN LHC ● 25 petabytes per year ● Facebook ● Hive Datawarehouse ● 300 PB, Growing 600 TB / d ● > 100 k servers ● Genomics ● Enterprises ● Data center analytics (Logflies, OS/NW monitors, ...) ● Predictive Maintenance, Cybersecurity ● Social Media Analytics ● DWH offload ● Call Detail Record (CDR) data preservation http://www.balthasar-glaettli.ch/vorratsdaten/
  • 16. © 2013 IBM Corporation1616 Why is Big Data important?
  • 17. © 2013 IBM Corporation17 BigData Analytics Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/
  • 18. © 2013 IBM Corporation18 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => No p-Value/z-Scores anymore
  • 19. © 2013 IBM Corporation19 We need Data Parallelism
  • 20. © 2013 IBM Corporation20 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  • 21. © 2013 IBM Corporation21 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < CHF 500  100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD  MTBF ~ 365 d > 1,5 d Source: http://www.cloudcomputingpatterns.org/Watchdog
  • 22. © 2013 IBM Corporation22 NoSQL Databases  Column Store – Hadoop / HBASE – Cassandra – Amazon Simple DB  JSON / Document Store – MongoDB – CouchDB  Key / Value Store – Amazon DynamoDB – Voldemort  Graph DBs – DB2 SPARQL Extension – Neo4J  MP RDBMS – DB2 DPF, DB2 pureScale, PureData for Operational Analytics – Oracle RAC – Greenplum  http://nosql-database.org/ > 150
  • 23. © 2013 IBM Corporation23 CAP Theorem / Brewers Theorem¹  impossible for a distributed computer system simultaneously guarantee all 3 properties – Consistency (all nodes see the same data at the same time) – Availability (guarantee that every request knows whether it was successful or failed) – Partition tolerance (continues to operate despite failure of part of the system)  What about ACID? – Atomicity – Consistency – Isolation – Durability  BASE, the new ACID – Basically Available – Soft state – Eventual consistency • Monotonic Read Consistency • Monotonic Write Consistency • Read Your Own Writes – 
  • 24. © 2013 IBM Corporation24 What role is the cloud playing here?
  • 25. © 2013 IBM Corporation25 “Elastic” Scale-Out Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
  • 26. © 2013 IBM Corporation26 “Elastic” Scale-Out of
  • 27. © 2013 IBM Corporation27 “Elastic” Scale-Out of CPU Cores
  • 28. © 2013 IBM Corporation28 “Elastic” Scale-Out of CPU Cores Storage
  • 29. © 2013 IBM Corporation29 “Elastic” Scale-Out of CPU Cores Storage Memory
  • 30. © 2013 IBM Corporation30 “Elastic” Scale-Out linear Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
  • 31. © 2013 IBM Corporation31 How do Databases Scale-Out? Shared Disk Architectures
  • 32. © 2013 IBM Corporation32 How do Databases Scale-Out? Shared Nothing Architectures
  • 33. © 2013 IBM Corporation33 Hadoop? Shared Nothing Architecture? Shared Disk Architecture?
  • 34. © 2013 IBM Corporation34 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  • 35. © 2013 IBM Corporation35 Large Scale Data Ingestion ● Traditionally ● Crawl to local file system (e.g. wget http://www.heise.de/newsticker/) ● Export RDBMS data to CSV (local file system) ● Batched FTP Servers uploads ● Then: Copy to HDFS ● BigInsights ● Use one of built-in importers ● Imports directly info HDFS ● Use Eclipse-Tooling to deploy custom importers easily
  • 36. © 2013 IBM Corporation36 Large Scale Data Ingestion (ETL on M/R) ● Modern ETL (Extract, Transform, Load) tools support Hadoop as ● Source, Sink (HDFS) ● Engine (MapReduce) ● Example: InfoSphere DataStage
  • 37. © 2013 IBM Corporation37 Real-Time/ In-Memory Data Ingestion ● If volume can be reduced dramatically during first processing steps ● Feature Extraction of ● Video ● Audio ● Semistructured Text (e.g. Logfiles) ● Structured Text ● Filtering ● Compression ● Recommendation: Usage of Streaming Engines ● IBM InfoSphere Streams ● Twitter Storm (now Apache incubator) ● Apache Spark Streaming
  • 38. © 2013 IBM Corporation38 Real-Time/ In-Memory Data Ingestion ● If volume can be reduced dramatically during first processing steps ● Feature Extraction of ● Video ● Audio ● Semistructured Text (e.g. Logfiles) ● Structured Text ● Filtering ● Compression
  • 39. © 2013 IBM Corporation39 SQL on Hadoop ● IBM BigSQL (ANSI 92 compliant) ● HIVE (SQL dialect) ● Cloudera Impala ● Lingual ● ... SQL Hadoop
  • 40. © 2013 IBM Corporation40 BigSQL V3.0 – ANSI SQL 92 compliant IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
  • 41. © 2013 IBM Corporation41 BigSQL V3.0 – Architecture
  • 42. © 2013 IBM Corporation42 BigSQL V3.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 43. © 2013 IBM Corporation43 BigSQL V3.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest'; select count(hour), hour from trace group by hour order by hour -- This command runs on 32 GB / ~650.000.000 rows in HDFS
  • 44. © 2013 IBM Corporation44 BigSQL V3.0 – Demo (small)
  • 45. © 2013 IBM Corporation45 BigSQL V3.0 – Demo (small)
  • 46. © 2013 IBM Corporation46 R on Hadoop ● IBM BigR (based on SystemML Almadan Research project) ● Rhadoop ● RHIPE ● ... “R” Hadoop
  • 47. © 2013 IBM Corporation47 BigR (based on SystemML) Example: Gaussian Non-negative Matrix Factorization package gnmf; import java.io.IOException; import java.net.URISyntaxException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.JobConf; public class MatrixGNMF { public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); package gnmf; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep2 { static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/"; JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); package gnmf; import gnmf.io.MatrixCell; import gnmf.io.MatrixFormats; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep1 { public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); Java Implementation (>1500 lines of code) Equivalent SystemML Implementation (10 lines of code) Experimenting with multiple variants! W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H)) H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H) W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H)) H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H))) W = W*(V/(W%*%H) %*% t(H))/(E%*%t(H)) H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
  • 48. © 2013 IBM Corporation48 BigR (based on SystemML) SystemML compiles hybrid runtime plans ranging from in- memory, single machine (CP) to large-scale, cluster (MR) compute ● Challenge ● Guaranteed hard memory constraints (budget of JVM size) ● for arbitrary complex ML programs ● Key Technical Innovations ● CP & MR Runtime: Single machine & MR operations, integrated runtime ● Caching: Reuse and eviction of in-memory objects ● Cost Model: Accurate time and worst-case memory estimates ● Optimizer: Cost-based runtime plan generation ● Dyn. Recompiler: Re-optimization for initial unknowns Data size Runtime CP CP/MR MR Gradually exploit MR parallelism High performance computing for small data sizes. Scalable computing for large data sizes. Hybrid Plans
  • 49. © 2013 IBM Corporation49 R Clients SystemML Statistics Engine Data Sources Embedded R Execution IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3 © 2014 IBM Corporation17 IBM Internal Use Only BigR Architecture
  • 50. © 2013 IBM Corporation50 BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 51. © 2013 IBM Corporation51 BigR Demo (small) library(bigr) bigr.connect(host="bigdata", port=7052, database="default", user="biadmin", password="xxx") is.bigr.connected() tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"), dataPath="/user/biadmin/32Gtest", delimiter=",", header=F, useMapReduce=T) h <- bigr.histogram.stats(tbr$V1, nbins=24)
  • 52. © 2013 IBM Corporation52 BigR Demo (small) class bins counts centroids 1 ALL 0 18289280 1.583333 2 ALL 1 15360 2.750000 3 ALL 2 55040 3.916667 4 ALL 3 189440 5.083333 5 ALL 4 579840 6.250000 6 ALL 5 5292160 7.416667 7 ALL 6 8074880 8.583333 8 ALL 7 15653120 9.750000 ...
  • 53. © 2013 IBM Corporation53 BigR Demo (small)
  • 54. © 2013 IBM Corporation54 BigR Demo (small) jpeg('hist.jpg') bigr.histogram(tbr$V1, nbins=24) # This command runs on 32 GB / ~650.000.000 rows in HDFS dev.off()
  • 55. © 2013 IBM Corporation55 BigR Demo (small) Sampling, Resampling, Bootstrapping vs Whole Dataset Processing What is your experience?
  • 56. © 2013 IBM Corporation56 Python on Hadoop python Hadoop
  • 57. © 2013 IBM Corporation57 SPSS on Hadoop
  • 58. © 2013 IBM Corporation58 SPSS on Hadoop
  • 59. © 2013 IBM Corporation59 BigSheets Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 60. © 2013 IBM Corporation60 BigSheets Demo (small)
  • 61. © 2013 IBM Corporation61 BigSheets Demo (small) This command runs on 32 GB / ~650.000.000 rows in HDFS
  • 62. © 2013 IBM Corporation62 BigSheets Demo (small)
  • 63. © 2013 IBM Corporation63 Text Extraction (SystemT, AQL)
  • 64. © 2013 IBM Corporation64 Text Extraction (SystemT, AQL)
  • 65. © 2013 IBM Corporation65 If this is not enough? → BigData AppStore
  • 66. © 2013 IBM Corporation66 BigData AppStore, Eclipse Tooling ● Write your apps in ● Java (MapReduce) ● PigLatin,Jaql ● BigSQL/Hive/BigR ● Deploy it to BigInsights via Eclipse ● Automatically ● Schedule ● Update ● hdfs files ● BigSQL tables ● BigSheets collections
  • 67. © 2013 IBM Corporation67 Questions? http://www.ibm.com/software/data/bigdata/ Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps
  • 68. © 2013 IBM Corporation68 DFT/Audio Analytics (as promised) library(tuneR) a <- readWave("whitenoisesine.wav") f<- fft(a@left) jpeg('rplot_wnsine.jpg') plot(Re(f)^2) dev.off() a <- readWave("whitenoise.wav") f<- fft(a@left) jpeg('rplot_wn.jpg') plot(Re(f)^2) dev.off() a <- readWave("whitenoisesine.wav") brv <- as.bigr.vector(a@left) al <- as.list(a@left)
  • 69. © 2013 IBM Corporation69 Backup Slides
  • 70. © 2013 IBM Corporation70
  • 71. © 2013 IBM Corporation71
  • 72. © 2013 IBM Corporation72
  • 73. © 2013 IBM Corporation73
  • 74. © 2013 IBM Corporation74
  • 75. © 2013 IBM Corporation75
  • 76. © 2013 IBM Corporation76
  • 77. © 2013 IBM Corporation77
  • 78. © 2013 IBM Corporation78
  • 79. © 2013 IBM Corporation79
  • 80. © 2013 IBM Corporation80
  • 81. © 2013 IBM Corporation81
  • 82. © 2013 IBM Corporation82
  • 83. © 2013 IBM Corporation83
  • 84. © 2013 IBM Corporation84 Map-Reduce Source: http://www.cloudcomputingpatterns.org/Map_Reduce
  • 85. © 2013 IBM Corporation85
  • 86. © 2013 IBM Corporation86
  • 87. © 2013 IBM Corporation87
  • 88. © 2013 IBM Corporation88
  • 89. © 2013 IBM Corporation89
  • 90. © 2013 IBM Corporation90
  • 91. © 2013 IBM Corporation91
  • 92. © 2013 IBM Corporation92
  • 93. © 2013 IBM Corporation93
  • 94. © 2013 IBM Corporation94
  • 95. © 2013 IBM Corporation95
  • 96. © 2013 IBM Corporation96
  • 97. © 2013 IBM Corporation97
  • 98. © 2013 IBM Corporation98
  • 99. © 2013 IBM Corporation99
  • 100. © 2013 IBM Corporation100
  • 101. © 2013 IBM Corporation101
  • 102. © 2013 IBM Corporation102
  • 103. © 2013 IBM Corporation103
  • 104. © 2013 IBM Corporation104
  • 105. © 2013 IBM Corporation105
  • 106. © 2013 IBM Corporation106
  • 107. © 2013 IBM Corporation107
  • 108. © 2013 IBM Corporation108