4. 4
Headlines
Data driven business
Data democratization
Data scientists
5. 5
The White House
+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system
www.whitehouse.gov
11. What is big data 11
+ Big data:
+ “Data you can’t process by traditional tools”
+ “A phenomenon defined by the rapid acceleration in the
expanding volume of high velocity, complex and diverse
types of data.”
+ “Refers to a collection of tools, techniques and technologies
for working with data productively, at any scale.”
12. 12
What is Big data
+ 3V
+ Volume: petabytes (1000TB) to exabytes (1000PB)
+ Variety: structured, semi-structured, unstructured
+ Velocity: Tb/s data streams
+ Requires distributed processing
+ Big data = storage + processing
+ Big data = Hadoop (not only)
19. 19
GFS/HDFS
+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
20. 20
MapReduce
+ Scalable:
+ no file IO
+ no networking
+ no synchronization
+ Master-slave architecture
+ MapReduce programming model:
+ Master: divide, schedule, monitor work
+ functional programming
+ Slave: actual processing
+ like UNIX pipeline
21. 21
Data movement
+ store and process data on the same nodes
+ bring code to data, data “locality”
www.cloudera.com
24. Data Base NoSQL 24
Revolution
+ Needed:
+ fast read/write time
+ high concurrency
+ easy horizontally scalable
+ Flat data structure
+ Sacrificed:
+ DB Schema
+ SQL
+ Transactions
27. 27
Hadoop tools
+ Pig
+ high level scripting language (PigLatin)
+ converts to MapReduce jobs
+ Hive
+ SQL like queries on dat in HDFS
+ converts in MapReduce jobs
34. 34
Cloudera
+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
+ CDH 4 (cloudera distrobution hadoop)
+ Impala
+ Consulting and training
www.cloudera.com
35. 35
MapR
+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
changing Map/Reduce related technologies
+ Products:
+ M3,M5,M7
+ NFS, no single node failure
+ NOT open source !
www.mapr.com
40. 40
Datameer
+ Founded 2009,
Funding $17,8M
+ Big data:
+ Data integration
+ Data Analytics
+ Data Visualization
www.datameer.com
41. 41
Datasift
+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data
www.datasift.com
42. 42
Infochimps
+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time
www.infochimps.com
44. Big data Startups 44
2012
+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
45. Big data startups 45
2013!
+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
47. 47
Big data Processing
Batch
interactive stream
processing
minutes to Millisecond to
Query time continues
hours seconds
data volume TB to PT GB to PB continues
programming
MapReduce Queries DAG
model
Users Developers Analysts Developers
Hadoop
Open Source Drill, Impala Storm, Kafka
mapreduce
48. 48
New technologies
+ Real time quering
+ Drill (based on Google Dremmel)
+ Impala (Cloudera)
+ Data stream processing
+ Storm (Twitter), real time analytics
+ Kafka (LinkedIn), messaging system
49. 49
Machine learning
+ Predictive analytics
+ Patterns discovery
+ Data mining
+ Tools:
+ Mahout
+ R
51. 51
Observations
+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
52. 52
Data scientist
+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL
“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”
54. 54
Contacts
+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
Higher School of Economics, NRU-HSE
+ lzhukov@hse.ru
+ www.leonidzhukov.ru