SlideShare a Scribd company logo
1 of 112
Download to read offline
Bio Bigdata 
2014.9.30 
Bio Bigdata 
BICube min-kyung Kim 
강남 토즈 타워점 401호
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd pary Biobigdata 
Machine Learning 
Streamng 
Future 
BICube min-kyung Kim 2 daengky@naver.com
Bio Bigdata 
Bigdata 
What is Bigdata? 
• Data is massive, unstructured, and dirty 
• Data sets so large and complex 
• Difficult to process using traditional data processing 
applications 
• Massively parallel software running on tens, hundreds, or 
even thousands of servers 
BICube min-kyung Kim 3 daengky@naver.com
Bio Bigdata 
What is Bigdata? 
Why Biobigdata? 
• Asian Genome:3.3Billion 35bp 104GB 
• Afrian Genome:4.0Billion 35bp 144GB 
• 1000Genomes=5,994,000 process 
=23,976,000 hours 
=2737 years 
by Deepak Singh 
BICube min-kyung Kim 4 daengky@naver.com
Bio Bigdata 
Bigdata 
• Rethink algorithms 
• Rethink Computing 
• Rethink Data Management 
• Rethink Data Sharing 
Unstructured 
Deep Analytics Machine Learning 
Out of core 
Bigdata Intelligence Platform 
Distributed Parallel Scalability 
Distributed In-Memory 
BICube min-kyung Kim 5 daengky@naver.com
Bio Bigdata 
Bigdata 
More Accurate Decision Making 
BICube min-kyung Kim 6 daengky@naver.com
Bio Bigdata 
Bigdata 
Why Bigdata? 
Central limit theorem 
the arithmetic mean of a sufficiently large number of iterates of independent 
random variables 
Overfitting, Six sigma 
So bigdata is not necessary ! 
Large data is enouph 
But …. 
Why Markov Assumtions? 
Why Independent Assumptions? 
BICube min-kyung Kim 7 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
Future 
Future 
Future 
Future 
BICube min-kyung Kim 8 daengky@naver.com
Bio Bigdata 
Hadoop 
What is Hadoop? 
• A open-source software for reliable, scalable, distributed 
computing. 
• A framework that allows for the distributed processing of 
large data sets across clusters of computers using simple 
programming models 
• scale up from single servers to thousands of machines, 
each offering local computation and storage. 
BICube min-kyung Kim 9 daengky@naver.com
Bio Bigdata 
Hadoop 
Included modules(Hadoop 2) 
• Hadoop Common: Utilities that support the other Hadoop 
modules. 
• Hadoop Distributed File System (HDFS™): A distributed file 
system 
• Hadoop YARN: A framework for job scheduling and cluster 
resource management. 
• Hadoop MapReduce: A YARN-based system for parallel 
processing of large data sets. 
BICube min-kyung Kim 10 daengky@naver.com
Bio Bigdata 
Hadoop 
Hadoop Distributed Filesystem(HDFS) 
Overviewing 
• Run on Commodity Server 
• The Idea of “Write onece, Read many times” 
• Large Streaming Reads 
• High Throughtput is more important than low latency 
> 
more important than 
BICube min-kyung Kim 11 daengky@naver.com
Bio Bigdata 
Hadoop 
HDFS Architecture 
• Stored as Block 
• Provide Reliability through Replication 
• NameNode stores Metadata and Manage Access 
• No data caching(Large dataset) 
BICube min-kyung Kim 12 daengky@naver.com
Bio Bigdata 
Hadoop 
HDFS Federation 
– Block management 
BICube min-kyung Kim 13 daengky@naver.com
Bio Bigdata 
Hadoop 
File Storage 
• NameNode 
Metadata: filename, Location of block, file attribute 
keep metadata in RAM 
Metadata Size is limited to the amount of the Name Node 
RAM size 
• DataNode 
Store file contents as block 
Same file is stored on Different DataNode 
Same Block is replicated across Several DataNode 
Periodically sends a report to the NameNode 
# exchange heartbeats with each nodes 
# Client always read from nearest node 
BICube min-kyung Kim 14 daengky@naver.com
Bio Bigdata 
Hadoop 
Internals of file read 
• Client is guided by NameNode 
• Client directly contacts DataNode 
• Scale to large number of clients 
BICube min-kyung Kim 15 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce 
• Method for distributing a task 
• Each node processes data on the self node 
• Two phases : map, reduce 
Map : (K1,V1) → (K2,V2) 
Reduce : (K2, list(V2)) → list(K3,V3) 
• Automatic parallelization 
• Fault-tolerance 
• Usually writen in java 
BICube min-kyung Kim 16 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce: Google 
BICube min-kyung Kim 17 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce 
shuffling : mix recuder input, sorting 
BICube min-kyung Kim 18 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce 
import java.io.IOException; 
import java.util.*; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 
public class WordCount { 
public static void main(String[] args){ 
} 
} 
BICube min-kyung Kim 19 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce 
i 
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws 
IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
context.write(word, one); 
} 
} 
} 
BICube min-kyung Kim 20 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce 
i 
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
context.write(key, new IntWritable(sum)); 
} 
} 
BICube min-kyung Kim 21 daengky@naver.com
Bio Bigdata 
MapReduce 
public class WordCount { 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount"); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 
job.setMapperClass(Map.class); 
job.setReducerClass(Reduce.class); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
FileInputFormat.addInputPath(job, new Path(args[0])); 
FileOutputFormat.setOutputPath(job, new Path(args[1])); 
job.waitForCompletion(true); 
} 
} 
BICube min-kyung Kim 22 daengky@naver.com
Bio Bigdata 
Hadoop 
FS Shell 
Options 
cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, count, cp, du, dus, 
expunge, get, getmerge, ls, lsr, mkdir, moveFromLocal, moveToLocal, mv, 
put, rm, rmr, setrep, stat, tail, test, text, touchz 
hadoop fs -cat /user/hadoop/file 
hadoop fs -put localfile /user/hadoop/hadoopfile 
BICube min-kyung Kim 23 daengky@naver.com
Bio Bigdata 
Hadoop 
MapReduce(Hadoop 1) 
• Jobtracker 
• TaskTracker 
BICube min-kyung Kim 24 daengky@naver.com
Bio Bigdata 
Hadoop 
NextGen MapReduce (YARN:Hadoop 2) 
• Jobtracker → ResourceManager 
• TaskTracker → NodeManager 
BICube min-kyung Kim 25 daengky@naver.com
Bio Bigdata 
Hadoop 
• Hadoop Streamming 
Utility 
None java programming 
Use sdtin, stdout 
Only Text 
BICube min-kyung Kim 26 daengky@naver.com
Bio Bigdata 
Hadoop 
Python MapReduce using Hadoop Streaming 
use sys.sdtin to read data and sys.stdout to print data 
$cat data | mapper.py | sort | reducer.py 
cat mobydick.txt | ./mapper.py | sort | ./reducer.py > 
out/mobydick_cooccur_local.txt 
head | out/mobydick_cooccur_local.tx 
hadoop jar $HADOOP_HOME/hadoop-streaming.jar  
$PARAMS  
-mapper mapper.py  
-reducer reducer.py  
-input cooccur_example/mobydick.txt  
-output cooccur_example/mobydick_out  
-file mapper.py -file reducer.py 
BICube min-kyung Kim 27 daengky@naver.com
Bio Bigdata 
Hadoop 
Python Streaming Mapper 
/*********************************************************** 
***********************mapper .py*************************** 
************************************************************/ 
#!/usr/bin/env python 
import sys, re 
for line in sys.stdin: 
# remove non alpha-numeric 
line = re.sub("[^a-zA-Z0-9]", " ", line) 
# remove leading and trailing whitespaces 
line = line.strip() 
# split the line into words 
words = line.split() 
# increase counters 
for word in words: 
# tab delimited; this word happend one time 
# write the results to STDOUT (standard output); 
# what we output here will be the input for the 
# Reduce step, i.e. the input for reducer.py 
## 
tab-delimited; the trivial word count is 1 
print '%st%s' % (word, 1) 
BICube min-kyung Kim 28 daengky@naver.com
Bio Bigdata 
Python Streaming Reducer 
/*********************************************************** 
***********************reducer.py***************************/ 
#!/usr/bin/env python 
from operator import itemgetter 
import sys 
current_word = None 
current_count = 0 
word = None 
for line in sys.stdin: 
# remove leading and trailing whitespace 
line = line.strip() 
# parse the mapper input 
word, count = line.split('t', 1) 
# convert count (currently a string) to int 
try: 
count = int(count) 
except ValueError: 
# count wasn't a numbe, so we 
# (silently) discard the line 
continue 
# this IF-switch only works because Hadoop sorts map output 
# by key (here: word) before it is passed to the reducer 
if current_word == word: 
current_count += count 
else: 
if current_word: 
# write result to STDOUT 
print '%st%s' % (current_word, current_count) 
current_count = count 
current_word = word 
# do not forget to output the last word if needed! 
if current_word == word: 
print '%st%s' % (current_word, current_count) 
Hadoop 
BICube min-kyung Kim 29 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd party 
Machine Learning 
Streaming 
Future 
BICube min-kyung Kim 30 daengky@naver.com
Bio Bigdata 
Spark 
What is Spark 
• A fast and general engine for large-scale data processing 
• Spark 1.1.0 (latest release) 
BICube min-kyung Kim 31 daengky@naver.com
Bio Bigdata 
Spark 
Process 
val conf = new SparkConf().setAppName("Simple Application") 
val sc = new SparkContext(conf) 
val textFile = sc.textFile("README.md") 
BICube min-kyung Kim 32 daengky@naver.com
Bio Bigdata 
Spark 
Components: 
BICube min-kyung Kim 33 daengky@naver.com
Bio Bigdata 
Spark 
RDD 
Resilient Distributed Datasets 
• Collections of objects spread across a cluster, 
stored in RAM or on Disk 
• Built through parallel transformations 
• Automatically rebuilt on failure 
• Standard RDD operations–map,countByValue,reduce,join 
• Stateful operations–window,countByValueAndWindow,… 
BICube min-kyung Kim 34 daengky@naver.com
Bio Bigdata 
Spark 
Language Support 
Standalone Programs 
• Python, Scala, & Java 
Interactive Shells 
• Python & Scala 
Performance 
• Java & Scala are faster 
due to static typing 
• …but Python is often fine 
Python 
lines = sc.textFile(...) 
lines.filter(lambda s: “ERROR” in s).count() 
Scala 
val lines = sc.textFile(...) 
lines.filter(x => x.contains(“ERROR”)).count() 
Java 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count(); 
BICube min-kyung Kim 35 daengky@naver.com
Bio Bigdata 
Spark 
Spark SQL 
• Unifies access to structured data 
• integrated APIs in Python, Scala and Java 
sqlCtx = new HiveContext(sc) 
results = sqlCtx.sql( 
"SELECT * FROM people") 
names = results.map(lambda p: p.name) 
BICube min-kyung Kim 36 daengky@naver.com
Bio Bigdata 
Spark 
Spark SQL 
Shark Architecture 
BICube min-kyung Kim 37 daengky@naver.com
Bio Bigdata 
Spark 
Spark SQL 
Shark 
Partial DAG Execution (PDE) 
BICube min-kyung Kim 38 daengky@naver.com
Bio Bigdata 
Spark 
Spark SQL 
Shark 
Compare to hive 
BICube min-kyung Kim 39 daengky@naver.com
Bio Bigdata 
Spark 
Spark SQL 
Shark 
Compare to Impala 
BICube min-kyung Kim 40 daengky@naver.com
Bio Bigdata 
Spark 
Spark Streaming 
• makes it easy to build scalable fault-tolerant streaming 
applications. 
• powerful interactive applications 
• provide results in near-real-time 
BICube min-kyung Kim 41 daengky@naver.com
Bio Bigdata 
Spark 
Spark Streaming 
import org.apache.spark._ 
import org.apache.spark.streaming._ 
import org.apache.spark.streaming.StreamingContext._ 
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") 
val ssc = new StreamingContext(conf, Seconds(1)) 
val lines = ssc.socketTextStream("localhost", 9999) 
Batch to memory : Mini batch 
BICube min-kyung Kim 42 daengky@naver.com
Bio Bigdata 
Spark 
Spark Streaming 
• Kafka spark-streaming-kafka_2.10 
• Flume spark-streaming-flume_2.10 
• Kinesis spark-streaming-kinesis-asl_2.10 [Apache Software License] 
• Twitter spark-streaming-twitter_2.10 
• ZeroMQ spark-streaming-zeromq_2.10 
• MQTT spark-streaming-mqtt_2.10 
BICube min-kyung Kim 43 daengky@naver.com
Bio Bigdata 
Spark 
MLlib 
scalable machine learning library Usable in Java, Scala and 
Python,100x faster than MapReduce. 
• linear models (SVMs, logistic regression, linear regression) 
• decision trees 
• naive Bayes 
• alternating least squares (ALS) 
• k-means 
• singular value decomposition (SVD) 
• principal component analysis (PCA) 
• Feature extraction and transformation 
• stochastic gradient descent 
• limited-memory BFGS (L-BFGS) 
BICube min-kyung Kim 44 daengky@naver.com
Bio Bigdata 
Spark 
GraphX 
API for graphs and graph-parallel computation. 
• Graph-parallel primitives on Spark. 
• Currently slower than GraphLab, but 
No need for specialized systems 
• Easier ETL, and easier consumption of output 
Interactive graph data mining 
• Future work will bring performance closer to 
specialized engines. 
BICube min-kyung Kim 45 daengky@naver.com
Bio Bigdata 
Spark 
GraphX 
? 
? 
? 
Liberal Conservative 
Post 
Post 
Post Post 
Post 
Post 
Post 
Post 
Post 
Post 
Post 
Post 
? 
? 
? 
? 
? 
? ? 
? ? 
? 
? 
? 
? 
? ? 
? 
? 
? 
? 
? 
? 
? 
? 
? 
? ? 
Post 
? 
Conditional Random Field 
Belief Propagation 
BICube min-kyung Kim 46 daengky@naver.com
Bio Bigdata 
Spark 
GraphX 
More Graph algorithms 
Collaborative Filtering 
Alternating Least Squares 
Stochastic Gradient Descent 
Tensor Factorization 
SVD 
Structured Prediction 
Loopy Belief Propagation 
Max-Product Linear Programs 
Gibbs Sampling 
Semi-supervised ML 
Graph SSL 
CoEM 
Graph Analytics 
PageRank 
Single Source Shortest Path 
Triangle-Counting 
Graph Coloring 
K-core Decomposition 
Personalized PageRank 
Classification 
Neural Networks 
Lasso 
… 
BICube min-kyung Kim 47 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
• High-throughput, fault-tolerant in-memory storage 
• Interface compatible to HDFS 
• Further improve performance for Spark, and Hadoop 
• Growing community with 10+ organizations contributing 
RAM throughput increasing exponentially, Disk slowly 
BICube min-kyung Kim 48 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Motive 
Different jobs share 
Spark Task 
Spark mem 
block manager 
block 1 
block 3 
Spark Task 
Spark mem 
block manager 
block 3 
block 1 
HDFS 
disk 
block 1 
block 3 
block 2 
block 4 
→ Slow writes to disk 
BICube min-kyung Kim 49 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Motive 
Different framework share 
Spark Task 
Spark mem 
block manager 
block 1 
block 3 
Hadoop MR 
YARN 
HDFS 
disk 
block 1 
block 3 
block 2 
Block 4 
→ Slow writes to disk 
BICube min-kyung Kim 50 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Solution 
execution engine & storage engine same JVM process 
(no GC & duplication) 
Spark Task 
b l o c k 1 Spark mem 
Spark Task 
b l o c k 4 Spark mem 
Tachyon 
HDFS 
in-memory 
disk 
HDFS 
disk 
block 1 
block 3 
block 1 
Block 3 
block 2 
block 4 
block 2 
Block 4 
Different Job Task avoiding GC overhead 
BICube min-kyung Kim 51 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Solution 
execution engine & storage engine same JVM process 
(no GC & duplication) 
Spark Task 
b l o c k 1 Spark mem 
Hadoop MR 
YARN 
Tachyon 
HDFS 
in-memory 
disk 
HDFS 
disk 
block 2 
block 4 
block 2 
Block 4 
Different Freamewok avoiding GC overhead 
block 1 
block 3 
block 1 
Block 3 
BICube min-kyung Kim 52 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Architecture 
BICube min-kyung Kim 53 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Comparison with in Memory HDFS 
BICube min-kyung Kim 54 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Open Source Status 
• V0.4.0 (Feb 2014) 
• 20 Developers (7 from Berkeley, 13 from outside) 
– 11 Companies 
• Writes go synchronously to under filesystem 
(No lineage information in Developer Preview release) 
• MapReduce and Spark can run without any code change 
(ser/de becomes the new bottleneck) 
BICube min-kyung Kim 55 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Using : hdfs → tachyon 
• Spark 
val file = sc.textFile(“tachyon://ip:port/path”) 
• Shark(Spark SQL) 
CREATE TABLE orders_tachyon AS SELECT * FROM 
orders; 
• Hadoop MapReduce 
hadoop jar examples.jar wordcount 
tachyon://localhost/input 
tachyon://localhost/output 
BICube min-kyung Kim 56 daengky@naver.com
Bio Bigdata 
Spark 
Tachyon 
Thanks to Redhat! 
BICube min-kyung Kim 57 daengky@naver.com
Bio Bigdata 
Spark 
BlinkDB 
• A data analysis (warehouse) system 
• is compatible with Apache Hive and Facebook’s 
Presto (storage, serdes, UDFs, types, metadata) 
Support interactive SQL-like aggregate 
queries over massive sets of data 
BICube min-kyung Kim 58 daengky@naver.com
Bio Bigdata 
Spark 
PySpark 
• The Spark Python API 
• All of PySpark’s library dependencies, including Py4J, are 
bundled with PySpark and automatically imported. 
• Can run with Ipython notebook 
PySpark and Ipython nodebook setting manaual link: 
http://nbviewer.ipython.org/gist/fperez/6384491/00-Setup-IPython-PySpark.ipynb 
BICube min-kyung Kim 59 daengky@naver.com
Bio Bigdata 
Spark 
SparkR 
install.packages("rJava") 
library(devtools) 
install_github("amplab-extras/SparkR-pkg", subdir="pkg") 
library(SparkR) 
sc <- sparkR.init(master="local") 
./sparkR 
http://amplab-extras.github.io/SparkR-pkg/ 
http://amplab-extras.github.io/SparkR-pkg/ 
BICube min-kyung Kim 60 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd Party Biobigdata 
Machine Learning 
Streaming 
Future 
BICube min-kyung Kim 61 daengky@naver.com
Bio Bigdata 
3rd party 
Dumbo 
• Python module that makes writing and running 
Hadoop Streaming programs very easy 
http://klbostee.github.io/dumbo/ 
https://github.com/klbostee/dumbo/wiki 
BICube min-kyung Kim 62 daengky@naver.com
Bio Bigdata 
3rd party 
Happy 
Hadoop + Python = Happy 
A framework for writing map-reduce programs for 
Hadoop using Jython 
https://code.google.com/p/happy/ 
BICube min-kyung Kim 63 daengky@naver.com
Bio Bigdata 
3rd party 
Happy 
import sys, happy, happy.log 
happy.log.setLevel("debug") 
log = happy.log.getLogger("wordcount") 
class WordCount(happy.HappyJob): 
def __init__(self, inputpath, outputpath): 
happy.HappyJob.__init__(self) 
self.inputpaths = inputpath 
self.outputpath = outputpath 
self.inputformat = "text" 
def map(self, records, task): 
for _, value in records: 
for word in value.split(): 
task.collect(word, "1") 
def reduce(self, key, values, task): 
count = 0; 
for _ in values: count += 1 
task.collect(key, str(count)) 
log.debug(key + ":" + str(count)) 
happy.results["words"] = happy.results.setdefault("words", 0) + count 
happy.results["unique"] = happy.results.setdefault("unique", 0) + 1 
if __name__ == "__main__": 
if len(sys.argv) < 3: 
print "Usage: <inputpath> <outputpath>" 
sys.exit(-1) 
wc = WordCount(sys.argv[1], sys.argv[2]) 
results = wc.run() 
print str(sum(results["words"])) + " total words" 
print str(sum(results["unique"])) + " unique words" 
BICube min-kyung Kim 64 daengky@naver.com
Bio Bigdata 
3rd party 
Octopy : Google? 
### triangular.py 
# server 
source = dict(zip(range(100), range(100))) 
def final(key, value): 
print key, value 
# client 
def mapfn(key, value): 
for i in range(value + 1): 
yield key, i 
def reducefn(key, value): 
return sum(value) 
https://code.google.com/p/octopy/ 
BICube min-kyung Kim 65 daengky@naver.com
Bio Bigdata 
3rd party 
Disco 
open-source framework for distributed computing based 
on the MapReduce paradigm 
Disco Distributed Filesystem 
http://discoproject.org/ 
BICube min-kyung Kim 66 daengky@naver.com
Bio Bigdata 
3rd party 
Pydoop 
• Python API for Hadoop 
• Access to most MR components, including 
RecordReader,RecordWriter and Partitioner 
• Get configuration, set counters and report status via 
context objects 
http://pydoop.sourceforge.net/docs/index.html 
BICube min-kyung Kim 67 daengky@naver.com
Bio Bigdata 
3rd party 
Pydoop 
Using C/C++ APIs for both MR and HDFS are supported 
by Hadoop Pipes 
BICube min-kyung Kim 68 daengky@naver.com
Bio Bigdata 
3rd party 
Pydoop 
from pydoop.pipes import Mapper, Reducer, Factory, runTask 
class WordCountMapper(Mapper): 
def map(self, context): 
words = context.getInputValue().split() 
for w in words: 
context.emit(w, "1") 
class WordCountReducer(Reducer): 
def reduce(self, context): 
s = 0 
while context.nextValue(): 
s += int(context.getInputValue()) 
context.emit(context.getInputKey(), str(s)) 
runTask(Factory(WordCountMapper, WordCountReducer)) 
BICube min-kyung Kim 69 daengky@naver.com
Bio Bigdata 
3rd party 
Pydoop 
– Performance: 
BICube min-kyung Kim 70 daengky@naver.com
Bio Bigdata 
3rd party 
Pydoop 
Status 
BICube min-kyung Kim 71 daengky@naver.com
Bio Bigdata 
3rd party 
Hadoop-BAM 
• Java library for the manipulation of files using the 
Hadoop MapReduce frameworkwith the Picard SAM 
JDK 
• BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF 
• ADAM, SeqPig using 
• Develop processing pipeline that enables efficient, 
scalable use of cluster/cloud 
• Provide data format that has efficient 
parallel/distributed access across platforms 
http://sourceforge.net/projects/hadoop-bam/ 
BICube min-kyung Kim 72 daengky@naver.com
Bio Bigdata 
3rd party 
ADAM 
A genomics processing engine and specialized file 
format built using Apache Avro, Apache Spark and 
Parquet. 
BICube min-kyung Kim 73 daengky@naver.com
Bio Bigdata 
3rd party 
ADAM 
Parquet 
• Apache Parquet is a columnar storage format available to any project in the 
Hadoop ecosystem, regardless of the choice of data processing framework, data 
model or programming language. 
• OSS Created by Twitter and Cloudera, based on Google Dremel, 
just entered Apache Incubator 
• Columnar File Format: 
• Limits I/O to only data that is needed 
• Compresses very well - ADAM files are 5-25% 
smaller than BAM files without loss of data 
• Fast scans - load only columns you need, e.g. 
scan a read flag on a whole genome, highcoverage 
file in less than a minute 
BICube min-kyung Kim 74 daengky@naver.com
Bio Bigdata 
3rd party 
ADAM 
Avocado 
• A distributed pipeline for calling variants, and is 
built on top of Apache Spark and the ADAM API 
BICube min-kyung Kim 75 daengky@naver.com
Bio Bigdata 
3rd party 
SeqPig 
A library for Apache Pig for the distributed analysis 
user-defined-functions (UDF’s) 
supports BAM/SAM, FastQ and Qseq input and output 
http://sourceforge.net/projects/seqpig/ 
BICube min-kyung Kim 76 daengky@naver.com
Bio Bigdata 
3rd party 
BioPig 
• https://github.com/JGI-Bioinformatics/biopig 
• http://www.slideshare.net/ZhongWang3/biopig-for-large- 
sequencing-data 
BICube min-kyung Kim 77 daengky@naver.com
Bio Bigdata 
3rd party 
Hadoop Seal 
a suite of distributed applications for aligning short 
DNA reads, and manipulating and analyzing short 
read alignments. 
Bcl2Qseq 
Extract reads in qseq format from an Illumina run directory. 
Demux 
Separate sample data in the qseq file format produced by a multiplexed Illumina 
run. 
PairReadsQSeq 
convert files in the Illumina qseq file format to prq format to be processed by the 
alignment program, Seqal. 
Seqal 
Distributed short read mapping and duplicate removal tool. 
ReadSort 
Distributed sorting of read mappings. 
RecabTable 
distributed calculation of covariates table to estimate empirical base qualities. 
BICube min-kyung Kim 78 daengky@naver.com
Bio Bigdata 
3rd party 
CloudBurst 
• Highly Sensitive Short Read Mapping with 
MapReduce 
• SNP discovery, genotyping, and personal genomics 
July 8, 2010 - CloudBurst 1.1.0 released with support for Hadoop 0.20.X 
BICube min-kyung Kim 79 daengky@naver.com
Bio Bigdata 
3rd party 
Crossbow 
• a scalable, portable, and automatic Cloud Computing 
tool for finding SNPs from short read data 
• Genotyping from short reads using cloud computing 
• combines Bowtie and SoapSNP 
• Crossbow scripts require Perl 5.6 or later. 
• Johns Hopkins Univ. 
BICube min-kyung Kim 80 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd party 
Machine Learning 
Streaming 
Future 
BICube min-kyung Kim 81 daengky@naver.com
Bio Bigdata 
Machine Learning 
Apache Mahout 
• Scalable machine learning library. 
• Goodbye MapReduce ! 
• Mahout's Spark Shell Binding 
• Algorithms 
• User and Item based recommenders 
• Matrix factorization based recommenders 
• K-Means, Fuzzy K-Means clustering 
• Latent Dirichlet Allocation 
• Singular Value Decomposition 
• Logistic regression classifier 
• (Complementary) Naive Bayes classifier 
• Random forest classifier 
• High performance java collections 
BICube min-kyung Kim 82 daengky@naver.com
Bio Bigdata 
Machine Learning 
H2O 
• The world’s fastest in-memory platform for machine 
learning and predictive analytics on big data 
• Fine-Grain Distributed Processing 
• Support Deep Learning Neural Network using R 
• Support for Java, Scala, and Python,R. 
The solution’s interface is driven by JSON APIs 
BICube min-kyung Kim 83 daengky@naver.com
Bio Bigdata 
Machine Learning 
H2O 
Fine-Grain Distributed Processing 
combines the responsiveness of in-memory processing with the 
ability to run fast serialization between nodes and clusters 
BICube min-kyung Kim 84 daengky@naver.com
Bio Bigdata 
Machine Learning 
Vowpal Wabbit 
• a fast out-of-core learning system. 
• Two ways to have a fast learning algorithm 
start with a slow algorithm and speed it up 
build an intrinsically fast learning algorithm 
• by Microsoft Research and (previously) Yahoo! Research. 
https://github.com/JohnLangford/vowpal_wabbit/wiki 
BICube min-kyung Kim 85 daengky@naver.com
Bio Bigdata 
Machine Learning 
GraphLab 
• Carnegie Mellon University in 2009 to develop a new 
parallel computation abstraction tailored to machine 
learning 
• Toolkits 
• Topic Modeling 
• graph_algorithms 
• Graph Analytics 
• Clustering 
• Collaborative Filtering 
• Graphical Models 
• Factor Graph Toolkit 
• Linear iterative solver 
• Computer Vision 
BICube min-kyung Kim 86 daengky@naver.com
Bio Bigdata 
Machine Learning 
GraphLab 
– Automatically distributions computation on HDFS 
BICube min-kyung Kim 87 daengky@naver.com
Bio Bigdata 
Machine Learning 
GraphLab 
GraphChi 
Disk-based large-scale graph computation 
BICube min-kyung Kim 88 daengky@naver.com
Bio Bigdata 
Machine Learning 
Apache Giraph 
An iterative graph processing system built for high 
scalability 
Giraph originated as the open-source counterpart to 
Pregel 
are inspired by the Bulk Synchronous Parallel model of 
distributed computation introduced by Leslie Valiant. 
BICube min-kyung Kim 89 daengky@naver.com
Bio Bigdata 
Machine Learning 
Apache Hama 
A general BSP framework on top of Hadoop 
BICube min-kyung Kim 90 daengky@naver.com
Bio Bigdata 
Machine Learning 
Apache Hama 
http://www.slideshare.net/udanax/apache-hama-at-samsung- 
open-source-conference 
BICube min-kyung Kim 91 daengky@naver.com
Bio Bigdata 
Machine Learning 
Apache Flink(Stratosphere) 
A platform for efficient, distributed, general-purpose data 
processingProgramming APIs for different languages (Java, Scala) and 
paradigms (record-oriented, graph-oriented). 
Incubating.. 
BICube min-kyung Kim 92 daengky@naver.com
Bio Bigdata 
Machine Learning 
Tajo 
A big data warehouse system on Hadoop 
Compatible 
• ANSI/ISO SQL standard compliance 
• Hive MetaStore access support 
• JDBC driver support 
• Various file formats support, such as CSV, RCFile, 
RowFile, SequenceFile and Parque 
BICube min-kyung Kim 93 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd party 
Machine Learning 
Streaming 
Future 
BICube min-kyung Kim 94 daengky@naver.com
Bio Bigdata 
Streaming 
Kafka 
• A high-throughput distributed messaging system. 
• Producers 
publish data to the topics of their choice. 
• Consumers 
Messaging traditionally has two models: queuing and 
publish-subscribe. 
BICube min-kyung Kim 95 daengky@naver.com
Bio Bigdata 
Streaming 
Kafka 
Ecosystem: 
• Storm - A stream-processing framework. 
• Samza - A YARN-based stream processing framework. 
• Storm Spout - Consume messages from Kafka and emit as 
Storm tuples 
• Kafka-Storm - Kafka 0.8, Storm 0.9, Avro integration 
• SparkStreaming - Kafka reciever supports Kafka 0.8 and 
above 
• Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used 
for all data at LinkedIn, and works great. 
• Kafka Hadoop Loader A different take on Hadoop loading 
functionality from what is included in the main distribution. 
• Flume - Contains Kafka Source (consumer) and Sink 
(producer) 
BICube min-kyung Kim 96 daengky@naver.com
Bio Bigdata 
Streaming 
Flume 
A distributed, reliable, and available service for 
efficiently collecting, aggregating, and moving large 
amounts of log data. 
Avro 
Thrift 
Syslog 
Netcat 
BICube min-kyung Kim 97 daengky@naver.com
Bio Bigdata 
Streaming 
Storm 
a free and open source distributed realtime computation 
system 
Topologies 
To do realtime computation on Storm, you create what 
are called “topologies”. A topology is a graph of 
computation. 
BICube min-kyung Kim 98 daengky@naver.com
Bio Bigdata 
Streaming 
MQTT(MQ Telemetry Transport) 
A machine-to-machine (M2M)/"Internet of Things" connectivity 
protocol. 
• publish/subscribe messaging transport 
• used in sensors communicating to a broker via satellite link, over 
occasional dial-up connections with healthcare providers 
Three qualities of service for message delivery: 
• "At most once", where messages are delivered according to the best efforts of 
the underlying TCP/IP network. Message loss or duplication can occur. This level 
could be used, for example, with ambient sensor data where it does not matter if 
an individual reading is lost as the next one will be published soon after. 
• "At least once", where messages are assured to arrive but duplicates may occur. 
• "Exactly once", where message are assured to arrive exactly once. This level 
could be used, for example, with billing systems where duplicate or lost messages 
could lead to incorrect charges being applied. 
BICube min-kyung Kim 99 daengky@naver.com
Bio Bigdata 
Streaming 
Akka 
• Scalable real-time transaction processing 
• an unified runtime and programming model 
• Scale up (Concurrency) 
• Scale out (Remoting) 
• Fault tolerance 
• Actor 
• Simple and high-level abstractions for concurrency and 
parallelism. 
• Asynchronous, non-blocking and highly performant event-driven 
programming model. 
• Very lightweight event-driven processes (several million actors per 
GB of heap memory). 
BICube min-kyung Kim 100 daengky@naver.com
Bio Bigdata 
Streaming 
Akka 
Remote Deployment 
BICube min-kyung Kim 101 daengky@naver.com
Bio Bigdata 
Streaming 
Esper 
• a component for complex event processing (CEP) and event series analysis 
• Event Processing Language (EPL) 
highly scalable, memory-efficient, in-memory computing, SQL-standard, 
minimal latency, real-time streaming-capable 
• offers a Domain Specific Language (DSL) for processing events. 
BICube min-kyung Kim 102 daengky@naver.com
Bio Bigdata 
Outline 
Bigdata 
Hadoop 
Spark 
3rd party 
Machine Learning 
Streaming 
Future 
BICube min-kyung Kim 103 daengky@naver.com
Bio Bigdata 
Future 
Bigdata and Machine Learning 
Very Difficult ! 
BICube min-kyung Kim 104 daengky@naver.com
Bio Bigdata 
Future 
Solution 
• For End User 
• For ML Developer 
• Interactive Advanced Machine Learning 
Development Platform 
• DSL style iteractive ML modeling( like matlab) 
• Domain specific libraries and tool 
BICube min-kyung Kim 105 daengky@naver.com
Bio Bigdata 
Future 
Conclusion 
• Bigdata is very important 
• ML needs Computation power (Graphical model) 
• Real time streaming and computation 
• High throughput Streaming 
• Data size will improve the ML 
• Biobigdata needs scalable ML 
• High throughput sequence analysis 
• De novo Genome Assembly is base step 
BICube min-kyung Kim 106 daengky@naver.com
Bio Bigdata 
Demo 
BICube 
Manipul Bioinfo 
Preprocess Classify Cluster Associate Visual 
Yarn 
(2.0) 
Mashup 
MapReduce 
Hive 
Pig 
Machine Learning Engine 
Mahout 
Spark 
Stream ML GraphX 
Spark 
Spark 
SQL 
Tachyon 
HDFS (Hadoop Distributed File System) 
Sqoop Flume 
Mesos 
Zookeeper 
RDB 
Storm Kafka Akka 
RDB Mashup SNS Crawler 
Authentication 
CubeManager 
Log Crawler 
BICube min-kyung Kim 107 daengky@naver.com
Bio Bigdata 
Demo 
Spark Interactive Shell 
BICube min-kyung Kim 108 daengky@naver.com
Bio Bigdata 
Demo 
View GPU Cluster3D using Spark Interactive Shell 
BICube min-kyung Kim 109 daengky@naver.com
Bio Bigdata 
The End 
Thanks ! 
Question ? 
and Demo... 
BICube min-kyung Kim 110 daengky@naver.com
Bio Bigdata 
import com.chimpler.example.kmeans._ 
import com.bicube.akka.visualization.VisualDeploy 
val args2 = Array( 
"/user/root/spark/data/twitter.location" 
,"/user/root/spark/data/kmean.image3.png" 
) 
KMeansAppHadoop.run(args2, sc) 
import com.bicube.akka.visualization.VisualDeploy 
val v=new VisualDeploy() 
v.deploy("{" + 
""name":"ImageViewer"" + 
", "module":"image"" + 
", "path":"spark/data/kmean.image3.png"" + 
"}"); 
BICube min-kyung Kim 111 daengky@naver.com
Bio Bigdata 
Spark Shell 
Demo 
import com.bicube.akka.visualization._ 
val v=new VisualDeploy() 
v.deploy("{" + 
""name":"PDBViewer"" + 
", "module":"pdb3d"" + 
", "path":"hdfs:/user/root/spark/data/1CRN.pdb.gz"" + 
"}"); 
import com.bicube.akka.visualization._ 
val v=new VisualDeploy() 
v.deploy("{" + 
""name":"Cluster3D"" + 
", "module":"cluster3d"" + 
", "path":"hdfs:/user/root/cluster/kmeans-iris/output-iris"" + 
"}"); 
BICube min-kyung Kim 112 daengky@naver.com

More Related Content

What's hot

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 

What's hot (20)

Bigdata
Bigdata Bigdata
Bigdata
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data
Big dataBig data
Big data
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 

Viewers also liked

Data Analytics Practice at Paxcel
Data Analytics Practice at PaxcelData Analytics Practice at Paxcel
Data Analytics Practice at PaxcelPushpinder Singh
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiMakoto Yui
 
Introducing Agile Scrum XP and Kanban
Introducing Agile Scrum XP and KanbanIntroducing Agile Scrum XP and Kanban
Introducing Agile Scrum XP and KanbanDimitri Ponomareff
 

Viewers also liked (6)

Data Analytics Practice at Paxcel
Data Analytics Practice at PaxcelData Analytics Practice at Paxcel
Data Analytics Practice at Paxcel
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
 
Introducing Agile Scrum XP and Kanban
Introducing Agile Scrum XP and KanbanIntroducing Agile Scrum XP and Kanban
Introducing Agile Scrum XP and Kanban
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 

Similar to Bio bigdata

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQMariaDB plc
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 

Similar to Bio bigdata (20)

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Handout3o
Handout3oHandout3o
Handout3o
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
NoSQL Infrastructure
NoSQL InfrastructureNoSQL Infrastructure
NoSQL Infrastructure
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 

More from Mk Kim

Startuplab Cube Cluster
Startuplab Cube ClusterStartuplab Cube Cluster
Startuplab Cube ClusterMk Kim
 
Cube advisor 2.0
Cube advisor 2.0Cube advisor 2.0
Cube advisor 2.0Mk Kim
 
Fraud Detection System on Neural Stream
Fraud Detection System on Neural StreamFraud Detection System on Neural Stream
Fraud Detection System on Neural StreamMk Kim
 
Bigdata IoT Cluster
Bigdata IoT ClusterBigdata IoT Cluster
Bigdata IoT ClusterMk Kim
 
Direct paysystem
Direct paysystemDirect paysystem
Direct paysystemMk Kim
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detectionMk Kim
 
Prostate cancer detection
Prostate cancer detection Prostate cancer detection
Prostate cancer detection Mk Kim
 
Meetup history
Meetup historyMeetup history
Meetup historyMk Kim
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning PlatformMk Kim
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learningMk Kim
 
Fin tech and Fraud Detection System
Fin tech and Fraud Detection SystemFin tech and Fraud Detection System
Fin tech and Fraud Detection SystemMk Kim
 
Bigdata Intelligence Platform- BICube
Bigdata Intelligence Platform- BICubeBigdata Intelligence Platform- BICube
Bigdata Intelligence Platform- BICubeMk Kim
 
Neural stream
Neural streamNeural stream
Neural streamMk Kim
 

More from Mk Kim (13)

Startuplab Cube Cluster
Startuplab Cube ClusterStartuplab Cube Cluster
Startuplab Cube Cluster
 
Cube advisor 2.0
Cube advisor 2.0Cube advisor 2.0
Cube advisor 2.0
 
Fraud Detection System on Neural Stream
Fraud Detection System on Neural StreamFraud Detection System on Neural Stream
Fraud Detection System on Neural Stream
 
Bigdata IoT Cluster
Bigdata IoT ClusterBigdata IoT Cluster
Bigdata IoT Cluster
 
Direct paysystem
Direct paysystemDirect paysystem
Direct paysystem
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Prostate cancer detection
Prostate cancer detection Prostate cancer detection
Prostate cancer detection
 
Meetup history
Meetup historyMeetup history
Meetup history
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
Financial security and machine learning
Financial security and machine learningFinancial security and machine learning
Financial security and machine learning
 
Fin tech and Fraud Detection System
Fin tech and Fraud Detection SystemFin tech and Fraud Detection System
Fin tech and Fraud Detection System
 
Bigdata Intelligence Platform- BICube
Bigdata Intelligence Platform- BICubeBigdata Intelligence Platform- BICube
Bigdata Intelligence Platform- BICube
 
Neural stream
Neural streamNeural stream
Neural stream
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Bio bigdata

  • 1. Bio Bigdata 2014.9.30 Bio Bigdata BICube min-kyung Kim 강남 토즈 타워점 401호
  • 2. Bio Bigdata Outline Bigdata Hadoop Spark 3rd pary Biobigdata Machine Learning Streamng Future BICube min-kyung Kim 2 daengky@naver.com
  • 3. Bio Bigdata Bigdata What is Bigdata? • Data is massive, unstructured, and dirty • Data sets so large and complex • Difficult to process using traditional data processing applications • Massively parallel software running on tens, hundreds, or even thousands of servers BICube min-kyung Kim 3 daengky@naver.com
  • 4. Bio Bigdata What is Bigdata? Why Biobigdata? • Asian Genome:3.3Billion 35bp 104GB • Afrian Genome:4.0Billion 35bp 144GB • 1000Genomes=5,994,000 process =23,976,000 hours =2737 years by Deepak Singh BICube min-kyung Kim 4 daengky@naver.com
  • 5. Bio Bigdata Bigdata • Rethink algorithms • Rethink Computing • Rethink Data Management • Rethink Data Sharing Unstructured Deep Analytics Machine Learning Out of core Bigdata Intelligence Platform Distributed Parallel Scalability Distributed In-Memory BICube min-kyung Kim 5 daengky@naver.com
  • 6. Bio Bigdata Bigdata More Accurate Decision Making BICube min-kyung Kim 6 daengky@naver.com
  • 7. Bio Bigdata Bigdata Why Bigdata? Central limit theorem the arithmetic mean of a sufficiently large number of iterates of independent random variables Overfitting, Six sigma So bigdata is not necessary ! Large data is enouph But …. Why Markov Assumtions? Why Independent Assumptions? BICube min-kyung Kim 7 daengky@naver.com
  • 8. Bio Bigdata Outline Bigdata Hadoop Spark Future Future Future Future BICube min-kyung Kim 8 daengky@naver.com
  • 9. Bio Bigdata Hadoop What is Hadoop? • A open-source software for reliable, scalable, distributed computing. • A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • scale up from single servers to thousands of machines, each offering local computation and storage. BICube min-kyung Kim 9 daengky@naver.com
  • 10. Bio Bigdata Hadoop Included modules(Hadoop 2) • Hadoop Common: Utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. BICube min-kyung Kim 10 daengky@naver.com
  • 11. Bio Bigdata Hadoop Hadoop Distributed Filesystem(HDFS) Overviewing • Run on Commodity Server • The Idea of “Write onece, Read many times” • Large Streaming Reads • High Throughtput is more important than low latency > more important than BICube min-kyung Kim 11 daengky@naver.com
  • 12. Bio Bigdata Hadoop HDFS Architecture • Stored as Block • Provide Reliability through Replication • NameNode stores Metadata and Manage Access • No data caching(Large dataset) BICube min-kyung Kim 12 daengky@naver.com
  • 13. Bio Bigdata Hadoop HDFS Federation – Block management BICube min-kyung Kim 13 daengky@naver.com
  • 14. Bio Bigdata Hadoop File Storage • NameNode Metadata: filename, Location of block, file attribute keep metadata in RAM Metadata Size is limited to the amount of the Name Node RAM size • DataNode Store file contents as block Same file is stored on Different DataNode Same Block is replicated across Several DataNode Periodically sends a report to the NameNode # exchange heartbeats with each nodes # Client always read from nearest node BICube min-kyung Kim 14 daengky@naver.com
  • 15. Bio Bigdata Hadoop Internals of file read • Client is guided by NameNode • Client directly contacts DataNode • Scale to large number of clients BICube min-kyung Kim 15 daengky@naver.com
  • 16. Bio Bigdata Hadoop MapReduce • Method for distributing a task • Each node processes data on the self node • Two phases : map, reduce Map : (K1,V1) → (K2,V2) Reduce : (K2, list(V2)) → list(K3,V3) • Automatic parallelization • Fault-tolerance • Usually writen in java BICube min-kyung Kim 16 daengky@naver.com
  • 17. Bio Bigdata Hadoop MapReduce: Google BICube min-kyung Kim 17 daengky@naver.com
  • 18. Bio Bigdata Hadoop MapReduce shuffling : mix recuder input, sorting BICube min-kyung Kim 18 daengky@naver.com
  • 19. Bio Bigdata Hadoop MapReduce import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static void main(String[] args){ } } BICube min-kyung Kim 19 daengky@naver.com
  • 20. Bio Bigdata Hadoop MapReduce i public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } BICube min-kyung Kim 20 daengky@naver.com
  • 21. Bio Bigdata Hadoop MapReduce i public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } BICube min-kyung Kim 21 daengky@naver.com
  • 22. Bio Bigdata MapReduce public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } BICube min-kyung Kim 22 daengky@naver.com
  • 23. Bio Bigdata Hadoop FS Shell Options cat, chgrp, chmod, chown, copyFromLocal, copyToLocal, count, cp, du, dus, expunge, get, getmerge, ls, lsr, mkdir, moveFromLocal, moveToLocal, mv, put, rm, rmr, setrep, stat, tail, test, text, touchz hadoop fs -cat /user/hadoop/file hadoop fs -put localfile /user/hadoop/hadoopfile BICube min-kyung Kim 23 daengky@naver.com
  • 24. Bio Bigdata Hadoop MapReduce(Hadoop 1) • Jobtracker • TaskTracker BICube min-kyung Kim 24 daengky@naver.com
  • 25. Bio Bigdata Hadoop NextGen MapReduce (YARN:Hadoop 2) • Jobtracker → ResourceManager • TaskTracker → NodeManager BICube min-kyung Kim 25 daengky@naver.com
  • 26. Bio Bigdata Hadoop • Hadoop Streamming Utility None java programming Use sdtin, stdout Only Text BICube min-kyung Kim 26 daengky@naver.com
  • 27. Bio Bigdata Hadoop Python MapReduce using Hadoop Streaming use sys.sdtin to read data and sys.stdout to print data $cat data | mapper.py | sort | reducer.py cat mobydick.txt | ./mapper.py | sort | ./reducer.py > out/mobydick_cooccur_local.txt head | out/mobydick_cooccur_local.tx hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer.py -input cooccur_example/mobydick.txt -output cooccur_example/mobydick_out -file mapper.py -file reducer.py BICube min-kyung Kim 27 daengky@naver.com
  • 28. Bio Bigdata Hadoop Python Streaming Mapper /*********************************************************** ***********************mapper .py*************************** ************************************************************/ #!/usr/bin/env python import sys, re for line in sys.stdin: # remove non alpha-numeric line = re.sub("[^a-zA-Z0-9]", " ", line) # remove leading and trailing whitespaces line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # tab delimited; this word happend one time # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py ## tab-delimited; the trivial word count is 1 print '%st%s' % (word, 1) BICube min-kyung Kim 28 daengky@naver.com
  • 29. Bio Bigdata Python Streaming Reducer /*********************************************************** ***********************reducer.py***************************/ #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the mapper input word, count = line.split('t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count wasn't a numbe, so we # (silently) discard the line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%st%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%st%s' % (current_word, current_count) Hadoop BICube min-kyung Kim 29 daengky@naver.com
  • 30. Bio Bigdata Outline Bigdata Hadoop Spark 3rd party Machine Learning Streaming Future BICube min-kyung Kim 30 daengky@naver.com
  • 31. Bio Bigdata Spark What is Spark • A fast and general engine for large-scale data processing • Spark 1.1.0 (latest release) BICube min-kyung Kim 31 daengky@naver.com
  • 32. Bio Bigdata Spark Process val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val textFile = sc.textFile("README.md") BICube min-kyung Kim 32 daengky@naver.com
  • 33. Bio Bigdata Spark Components: BICube min-kyung Kim 33 daengky@naver.com
  • 34. Bio Bigdata Spark RDD Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Standard RDD operations–map,countByValue,reduce,join • Stateful operations–window,countByValueAndWindow,… BICube min-kyung Kim 34 daengky@naver.com
  • 35. Bio Bigdata Spark Language Support Standalone Programs • Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); BICube min-kyung Kim 35 daengky@naver.com
  • 36. Bio Bigdata Spark Spark SQL • Unifies access to structured data • integrated APIs in Python, Scala and Java sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name) BICube min-kyung Kim 36 daengky@naver.com
  • 37. Bio Bigdata Spark Spark SQL Shark Architecture BICube min-kyung Kim 37 daengky@naver.com
  • 38. Bio Bigdata Spark Spark SQL Shark Partial DAG Execution (PDE) BICube min-kyung Kim 38 daengky@naver.com
  • 39. Bio Bigdata Spark Spark SQL Shark Compare to hive BICube min-kyung Kim 39 daengky@naver.com
  • 40. Bio Bigdata Spark Spark SQL Shark Compare to Impala BICube min-kyung Kim 40 daengky@naver.com
  • 41. Bio Bigdata Spark Spark Streaming • makes it easy to build scalable fault-tolerant streaming applications. • powerful interactive applications • provide results in near-real-time BICube min-kyung Kim 41 daengky@naver.com
  • 42. Bio Bigdata Spark Spark Streaming import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) Batch to memory : Mini batch BICube min-kyung Kim 42 daengky@naver.com
  • 43. Bio Bigdata Spark Spark Streaming • Kafka spark-streaming-kafka_2.10 • Flume spark-streaming-flume_2.10 • Kinesis spark-streaming-kinesis-asl_2.10 [Apache Software License] • Twitter spark-streaming-twitter_2.10 • ZeroMQ spark-streaming-zeromq_2.10 • MQTT spark-streaming-mqtt_2.10 BICube min-kyung Kim 43 daengky@naver.com
  • 44. Bio Bigdata Spark MLlib scalable machine learning library Usable in Java, Scala and Python,100x faster than MapReduce. • linear models (SVMs, logistic regression, linear regression) • decision trees • naive Bayes • alternating least squares (ALS) • k-means • singular value decomposition (SVD) • principal component analysis (PCA) • Feature extraction and transformation • stochastic gradient descent • limited-memory BFGS (L-BFGS) BICube min-kyung Kim 44 daengky@naver.com
  • 45. Bio Bigdata Spark GraphX API for graphs and graph-parallel computation. • Graph-parallel primitives on Spark. • Currently slower than GraphLab, but No need for specialized systems • Easier ETL, and easier consumption of output Interactive graph data mining • Future work will bring performance closer to specialized engines. BICube min-kyung Kim 45 daengky@naver.com
  • 46. Bio Bigdata Spark GraphX ? ? ? Liberal Conservative Post Post Post Post Post Post Post Post Post Post Post Post ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Post ? Conditional Random Field Belief Propagation BICube min-kyung Kim 46 daengky@naver.com
  • 47. Bio Bigdata Spark GraphX More Graph algorithms Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent Tensor Factorization SVD Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL CoEM Graph Analytics PageRank Single Source Shortest Path Triangle-Counting Graph Coloring K-core Decomposition Personalized PageRank Classification Neural Networks Lasso … BICube min-kyung Kim 47 daengky@naver.com
  • 48. Bio Bigdata Spark Tachyon • High-throughput, fault-tolerant in-memory storage • Interface compatible to HDFS • Further improve performance for Spark, and Hadoop • Growing community with 10+ organizations contributing RAM throughput increasing exponentially, Disk slowly BICube min-kyung Kim 48 daengky@naver.com
  • 49. Bio Bigdata Spark Tachyon Motive Different jobs share Spark Task Spark mem block manager block 1 block 3 Spark Task Spark mem block manager block 3 block 1 HDFS disk block 1 block 3 block 2 block 4 → Slow writes to disk BICube min-kyung Kim 49 daengky@naver.com
  • 50. Bio Bigdata Spark Tachyon Motive Different framework share Spark Task Spark mem block manager block 1 block 3 Hadoop MR YARN HDFS disk block 1 block 3 block 2 Block 4 → Slow writes to disk BICube min-kyung Kim 50 daengky@naver.com
  • 51. Bio Bigdata Spark Tachyon Solution execution engine & storage engine same JVM process (no GC & duplication) Spark Task b l o c k 1 Spark mem Spark Task b l o c k 4 Spark mem Tachyon HDFS in-memory disk HDFS disk block 1 block 3 block 1 Block 3 block 2 block 4 block 2 Block 4 Different Job Task avoiding GC overhead BICube min-kyung Kim 51 daengky@naver.com
  • 52. Bio Bigdata Spark Tachyon Solution execution engine & storage engine same JVM process (no GC & duplication) Spark Task b l o c k 1 Spark mem Hadoop MR YARN Tachyon HDFS in-memory disk HDFS disk block 2 block 4 block 2 Block 4 Different Freamewok avoiding GC overhead block 1 block 3 block 1 Block 3 BICube min-kyung Kim 52 daengky@naver.com
  • 53. Bio Bigdata Spark Tachyon Architecture BICube min-kyung Kim 53 daengky@naver.com
  • 54. Bio Bigdata Spark Tachyon Comparison with in Memory HDFS BICube min-kyung Kim 54 daengky@naver.com
  • 55. Bio Bigdata Spark Tachyon Open Source Status • V0.4.0 (Feb 2014) • 20 Developers (7 from Berkeley, 13 from outside) – 11 Companies • Writes go synchronously to under filesystem (No lineage information in Developer Preview release) • MapReduce and Spark can run without any code change (ser/de becomes the new bottleneck) BICube min-kyung Kim 55 daengky@naver.com
  • 56. Bio Bigdata Spark Tachyon Using : hdfs → tachyon • Spark val file = sc.textFile(“tachyon://ip:port/path”) • Shark(Spark SQL) CREATE TABLE orders_tachyon AS SELECT * FROM orders; • Hadoop MapReduce hadoop jar examples.jar wordcount tachyon://localhost/input tachyon://localhost/output BICube min-kyung Kim 56 daengky@naver.com
  • 57. Bio Bigdata Spark Tachyon Thanks to Redhat! BICube min-kyung Kim 57 daengky@naver.com
  • 58. Bio Bigdata Spark BlinkDB • A data analysis (warehouse) system • is compatible with Apache Hive and Facebook’s Presto (storage, serdes, UDFs, types, metadata) Support interactive SQL-like aggregate queries over massive sets of data BICube min-kyung Kim 58 daengky@naver.com
  • 59. Bio Bigdata Spark PySpark • The Spark Python API • All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported. • Can run with Ipython notebook PySpark and Ipython nodebook setting manaual link: http://nbviewer.ipython.org/gist/fperez/6384491/00-Setup-IPython-PySpark.ipynb BICube min-kyung Kim 59 daengky@naver.com
  • 60. Bio Bigdata Spark SparkR install.packages("rJava") library(devtools) install_github("amplab-extras/SparkR-pkg", subdir="pkg") library(SparkR) sc <- sparkR.init(master="local") ./sparkR http://amplab-extras.github.io/SparkR-pkg/ http://amplab-extras.github.io/SparkR-pkg/ BICube min-kyung Kim 60 daengky@naver.com
  • 61. Bio Bigdata Outline Bigdata Hadoop Spark 3rd Party Biobigdata Machine Learning Streaming Future BICube min-kyung Kim 61 daengky@naver.com
  • 62. Bio Bigdata 3rd party Dumbo • Python module that makes writing and running Hadoop Streaming programs very easy http://klbostee.github.io/dumbo/ https://github.com/klbostee/dumbo/wiki BICube min-kyung Kim 62 daengky@naver.com
  • 63. Bio Bigdata 3rd party Happy Hadoop + Python = Happy A framework for writing map-reduce programs for Hadoop using Jython https://code.google.com/p/happy/ BICube min-kyung Kim 63 daengky@naver.com
  • 64. Bio Bigdata 3rd party Happy import sys, happy, happy.log happy.log.setLevel("debug") log = happy.log.getLogger("wordcount") class WordCount(happy.HappyJob): def __init__(self, inputpath, outputpath): happy.HappyJob.__init__(self) self.inputpaths = inputpath self.outputpath = outputpath self.inputformat = "text" def map(self, records, task): for _, value in records: for word in value.split(): task.collect(word, "1") def reduce(self, key, values, task): count = 0; for _ in values: count += 1 task.collect(key, str(count)) log.debug(key + ":" + str(count)) happy.results["words"] = happy.results.setdefault("words", 0) + count happy.results["unique"] = happy.results.setdefault("unique", 0) + 1 if __name__ == "__main__": if len(sys.argv) < 3: print "Usage: <inputpath> <outputpath>" sys.exit(-1) wc = WordCount(sys.argv[1], sys.argv[2]) results = wc.run() print str(sum(results["words"])) + " total words" print str(sum(results["unique"])) + " unique words" BICube min-kyung Kim 64 daengky@naver.com
  • 65. Bio Bigdata 3rd party Octopy : Google? ### triangular.py # server source = dict(zip(range(100), range(100))) def final(key, value): print key, value # client def mapfn(key, value): for i in range(value + 1): yield key, i def reducefn(key, value): return sum(value) https://code.google.com/p/octopy/ BICube min-kyung Kim 65 daengky@naver.com
  • 66. Bio Bigdata 3rd party Disco open-source framework for distributed computing based on the MapReduce paradigm Disco Distributed Filesystem http://discoproject.org/ BICube min-kyung Kim 66 daengky@naver.com
  • 67. Bio Bigdata 3rd party Pydoop • Python API for Hadoop • Access to most MR components, including RecordReader,RecordWriter and Partitioner • Get configuration, set counters and report status via context objects http://pydoop.sourceforge.net/docs/index.html BICube min-kyung Kim 67 daengky@naver.com
  • 68. Bio Bigdata 3rd party Pydoop Using C/C++ APIs for both MR and HDFS are supported by Hadoop Pipes BICube min-kyung Kim 68 daengky@naver.com
  • 69. Bio Bigdata 3rd party Pydoop from pydoop.pipes import Mapper, Reducer, Factory, runTask class WordCountMapper(Mapper): def map(self, context): words = context.getInputValue().split() for w in words: context.emit(w, "1") class WordCountReducer(Reducer): def reduce(self, context): s = 0 while context.nextValue(): s += int(context.getInputValue()) context.emit(context.getInputKey(), str(s)) runTask(Factory(WordCountMapper, WordCountReducer)) BICube min-kyung Kim 69 daengky@naver.com
  • 70. Bio Bigdata 3rd party Pydoop – Performance: BICube min-kyung Kim 70 daengky@naver.com
  • 71. Bio Bigdata 3rd party Pydoop Status BICube min-kyung Kim 71 daengky@naver.com
  • 72. Bio Bigdata 3rd party Hadoop-BAM • Java library for the manipulation of files using the Hadoop MapReduce frameworkwith the Picard SAM JDK • BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF • ADAM, SeqPig using • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms http://sourceforge.net/projects/hadoop-bam/ BICube min-kyung Kim 72 daengky@naver.com
  • 73. Bio Bigdata 3rd party ADAM A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. BICube min-kyung Kim 73 daengky@naver.com
  • 74. Bio Bigdata 3rd party ADAM Parquet • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. • OSS Created by Twitter and Cloudera, based on Google Dremel, just entered Apache Incubator • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, highcoverage file in less than a minute BICube min-kyung Kim 74 daengky@naver.com
  • 75. Bio Bigdata 3rd party ADAM Avocado • A distributed pipeline for calling variants, and is built on top of Apache Spark and the ADAM API BICube min-kyung Kim 75 daengky@naver.com
  • 76. Bio Bigdata 3rd party SeqPig A library for Apache Pig for the distributed analysis user-defined-functions (UDF’s) supports BAM/SAM, FastQ and Qseq input and output http://sourceforge.net/projects/seqpig/ BICube min-kyung Kim 76 daengky@naver.com
  • 77. Bio Bigdata 3rd party BioPig • https://github.com/JGI-Bioinformatics/biopig • http://www.slideshare.net/ZhongWang3/biopig-for-large- sequencing-data BICube min-kyung Kim 77 daengky@naver.com
  • 78. Bio Bigdata 3rd party Hadoop Seal a suite of distributed applications for aligning short DNA reads, and manipulating and analyzing short read alignments. Bcl2Qseq Extract reads in qseq format from an Illumina run directory. Demux Separate sample data in the qseq file format produced by a multiplexed Illumina run. PairReadsQSeq convert files in the Illumina qseq file format to prq format to be processed by the alignment program, Seqal. Seqal Distributed short read mapping and duplicate removal tool. ReadSort Distributed sorting of read mappings. RecabTable distributed calculation of covariates table to estimate empirical base qualities. BICube min-kyung Kim 78 daengky@naver.com
  • 79. Bio Bigdata 3rd party CloudBurst • Highly Sensitive Short Read Mapping with MapReduce • SNP discovery, genotyping, and personal genomics July 8, 2010 - CloudBurst 1.1.0 released with support for Hadoop 0.20.X BICube min-kyung Kim 79 daengky@naver.com
  • 80. Bio Bigdata 3rd party Crossbow • a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data • Genotyping from short reads using cloud computing • combines Bowtie and SoapSNP • Crossbow scripts require Perl 5.6 or later. • Johns Hopkins Univ. BICube min-kyung Kim 80 daengky@naver.com
  • 81. Bio Bigdata Outline Bigdata Hadoop Spark 3rd party Machine Learning Streaming Future BICube min-kyung Kim 81 daengky@naver.com
  • 82. Bio Bigdata Machine Learning Apache Mahout • Scalable machine learning library. • Goodbye MapReduce ! • Mahout's Spark Shell Binding • Algorithms • User and Item based recommenders • Matrix factorization based recommenders • K-Means, Fuzzy K-Means clustering • Latent Dirichlet Allocation • Singular Value Decomposition • Logistic regression classifier • (Complementary) Naive Bayes classifier • Random forest classifier • High performance java collections BICube min-kyung Kim 82 daengky@naver.com
  • 83. Bio Bigdata Machine Learning H2O • The world’s fastest in-memory platform for machine learning and predictive analytics on big data • Fine-Grain Distributed Processing • Support Deep Learning Neural Network using R • Support for Java, Scala, and Python,R. The solution’s interface is driven by JSON APIs BICube min-kyung Kim 83 daengky@naver.com
  • 84. Bio Bigdata Machine Learning H2O Fine-Grain Distributed Processing combines the responsiveness of in-memory processing with the ability to run fast serialization between nodes and clusters BICube min-kyung Kim 84 daengky@naver.com
  • 85. Bio Bigdata Machine Learning Vowpal Wabbit • a fast out-of-core learning system. • Two ways to have a fast learning algorithm start with a slow algorithm and speed it up build an intrinsically fast learning algorithm • by Microsoft Research and (previously) Yahoo! Research. https://github.com/JohnLangford/vowpal_wabbit/wiki BICube min-kyung Kim 85 daengky@naver.com
  • 86. Bio Bigdata Machine Learning GraphLab • Carnegie Mellon University in 2009 to develop a new parallel computation abstraction tailored to machine learning • Toolkits • Topic Modeling • graph_algorithms • Graph Analytics • Clustering • Collaborative Filtering • Graphical Models • Factor Graph Toolkit • Linear iterative solver • Computer Vision BICube min-kyung Kim 86 daengky@naver.com
  • 87. Bio Bigdata Machine Learning GraphLab – Automatically distributions computation on HDFS BICube min-kyung Kim 87 daengky@naver.com
  • 88. Bio Bigdata Machine Learning GraphLab GraphChi Disk-based large-scale graph computation BICube min-kyung Kim 88 daengky@naver.com
  • 89. Bio Bigdata Machine Learning Apache Giraph An iterative graph processing system built for high scalability Giraph originated as the open-source counterpart to Pregel are inspired by the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant. BICube min-kyung Kim 89 daengky@naver.com
  • 90. Bio Bigdata Machine Learning Apache Hama A general BSP framework on top of Hadoop BICube min-kyung Kim 90 daengky@naver.com
  • 91. Bio Bigdata Machine Learning Apache Hama http://www.slideshare.net/udanax/apache-hama-at-samsung- open-source-conference BICube min-kyung Kim 91 daengky@naver.com
  • 92. Bio Bigdata Machine Learning Apache Flink(Stratosphere) A platform for efficient, distributed, general-purpose data processingProgramming APIs for different languages (Java, Scala) and paradigms (record-oriented, graph-oriented). Incubating.. BICube min-kyung Kim 92 daengky@naver.com
  • 93. Bio Bigdata Machine Learning Tajo A big data warehouse system on Hadoop Compatible • ANSI/ISO SQL standard compliance • Hive MetaStore access support • JDBC driver support • Various file formats support, such as CSV, RCFile, RowFile, SequenceFile and Parque BICube min-kyung Kim 93 daengky@naver.com
  • 94. Bio Bigdata Outline Bigdata Hadoop Spark 3rd party Machine Learning Streaming Future BICube min-kyung Kim 94 daengky@naver.com
  • 95. Bio Bigdata Streaming Kafka • A high-throughput distributed messaging system. • Producers publish data to the topics of their choice. • Consumers Messaging traditionally has two models: queuing and publish-subscribe. BICube min-kyung Kim 95 daengky@naver.com
  • 96. Bio Bigdata Streaming Kafka Ecosystem: • Storm - A stream-processing framework. • Samza - A YARN-based stream processing framework. • Storm Spout - Consume messages from Kafka and emit as Storm tuples • Kafka-Storm - Kafka 0.8, Storm 0.9, Avro integration • SparkStreaming - Kafka reciever supports Kafka 0.8 and above • Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great. • Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution. • Flume - Contains Kafka Source (consumer) and Sink (producer) BICube min-kyung Kim 96 daengky@naver.com
  • 97. Bio Bigdata Streaming Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Avro Thrift Syslog Netcat BICube min-kyung Kim 97 daengky@naver.com
  • 98. Bio Bigdata Streaming Storm a free and open source distributed realtime computation system Topologies To do realtime computation on Storm, you create what are called “topologies”. A topology is a graph of computation. BICube min-kyung Kim 98 daengky@naver.com
  • 99. Bio Bigdata Streaming MQTT(MQ Telemetry Transport) A machine-to-machine (M2M)/"Internet of Things" connectivity protocol. • publish/subscribe messaging transport • used in sensors communicating to a broker via satellite link, over occasional dial-up connections with healthcare providers Three qualities of service for message delivery: • "At most once", where messages are delivered according to the best efforts of the underlying TCP/IP network. Message loss or duplication can occur. This level could be used, for example, with ambient sensor data where it does not matter if an individual reading is lost as the next one will be published soon after. • "At least once", where messages are assured to arrive but duplicates may occur. • "Exactly once", where message are assured to arrive exactly once. This level could be used, for example, with billing systems where duplicate or lost messages could lead to incorrect charges being applied. BICube min-kyung Kim 99 daengky@naver.com
  • 100. Bio Bigdata Streaming Akka • Scalable real-time transaction processing • an unified runtime and programming model • Scale up (Concurrency) • Scale out (Remoting) • Fault tolerance • Actor • Simple and high-level abstractions for concurrency and parallelism. • Asynchronous, non-blocking and highly performant event-driven programming model. • Very lightweight event-driven processes (several million actors per GB of heap memory). BICube min-kyung Kim 100 daengky@naver.com
  • 101. Bio Bigdata Streaming Akka Remote Deployment BICube min-kyung Kim 101 daengky@naver.com
  • 102. Bio Bigdata Streaming Esper • a component for complex event processing (CEP) and event series analysis • Event Processing Language (EPL) highly scalable, memory-efficient, in-memory computing, SQL-standard, minimal latency, real-time streaming-capable • offers a Domain Specific Language (DSL) for processing events. BICube min-kyung Kim 102 daengky@naver.com
  • 103. Bio Bigdata Outline Bigdata Hadoop Spark 3rd party Machine Learning Streaming Future BICube min-kyung Kim 103 daengky@naver.com
  • 104. Bio Bigdata Future Bigdata and Machine Learning Very Difficult ! BICube min-kyung Kim 104 daengky@naver.com
  • 105. Bio Bigdata Future Solution • For End User • For ML Developer • Interactive Advanced Machine Learning Development Platform • DSL style iteractive ML modeling( like matlab) • Domain specific libraries and tool BICube min-kyung Kim 105 daengky@naver.com
  • 106. Bio Bigdata Future Conclusion • Bigdata is very important • ML needs Computation power (Graphical model) • Real time streaming and computation • High throughput Streaming • Data size will improve the ML • Biobigdata needs scalable ML • High throughput sequence analysis • De novo Genome Assembly is base step BICube min-kyung Kim 106 daengky@naver.com
  • 107. Bio Bigdata Demo BICube Manipul Bioinfo Preprocess Classify Cluster Associate Visual Yarn (2.0) Mashup MapReduce Hive Pig Machine Learning Engine Mahout Spark Stream ML GraphX Spark Spark SQL Tachyon HDFS (Hadoop Distributed File System) Sqoop Flume Mesos Zookeeper RDB Storm Kafka Akka RDB Mashup SNS Crawler Authentication CubeManager Log Crawler BICube min-kyung Kim 107 daengky@naver.com
  • 108. Bio Bigdata Demo Spark Interactive Shell BICube min-kyung Kim 108 daengky@naver.com
  • 109. Bio Bigdata Demo View GPU Cluster3D using Spark Interactive Shell BICube min-kyung Kim 109 daengky@naver.com
  • 110. Bio Bigdata The End Thanks ! Question ? and Demo... BICube min-kyung Kim 110 daengky@naver.com
  • 111. Bio Bigdata import com.chimpler.example.kmeans._ import com.bicube.akka.visualization.VisualDeploy val args2 = Array( "/user/root/spark/data/twitter.location" ,"/user/root/spark/data/kmean.image3.png" ) KMeansAppHadoop.run(args2, sc) import com.bicube.akka.visualization.VisualDeploy val v=new VisualDeploy() v.deploy("{" + ""name":"ImageViewer"" + ", "module":"image"" + ", "path":"spark/data/kmean.image3.png"" + "}"); BICube min-kyung Kim 111 daengky@naver.com
  • 112. Bio Bigdata Spark Shell Demo import com.bicube.akka.visualization._ val v=new VisualDeploy() v.deploy("{" + ""name":"PDBViewer"" + ", "module":"pdb3d"" + ", "path":"hdfs:/user/root/spark/data/1CRN.pdb.gz"" + "}"); import com.bicube.akka.visualization._ val v=new VisualDeploy() v.deploy("{" + ""name":"Cluster3D"" + ", "module":"cluster3d"" + ", "path":"hdfs:/user/root/cluster/kmeans-iris/output-iris"" + "}"); BICube min-kyung Kim 112 daengky@naver.com