What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
1. TITLE and title
BIG DATA SCIENCE
Chandan Rajah – CEO, Parallel AI
“The price of light is far less than the cost of darkness”
2. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
BENEFITS OF BIG DATA
COST SPEED
AGILITY CAPABILITY
3. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
BIG DATA JOURNEY
WHERE
WHAT WHY
HOW
4. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
What is Big Data ?
Big Data ≠ Data Volume
Big Data = Crude Oil
Think of data like ‘Crude Oil’
Big Data is about extracting ‘crude oil’; transporting it in ‘pipelines’; storing it in ‘mega tanks’
Source: Data Science London
5. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
What is Data Science ?
Data Science ≠ Statistical Analysis
Data Science = Oil Refinery
Data science is about ‘treating’ data; applying ‘science’ to the data;
Refine the data ‘results’; and combine to form ‘insight’
Source: Data Science London
6. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
What is the Big Data Science Toolkit ?
• Scala, Java, Python, R… (bonus: Clojure Haskell, Erlang)
• Hadoop, HDFS, MapReduce… (bonus: Spark, Storm, Tez)
• Scalding, HBase, Hive… (bonus: Shark, Titan, Giraph)
• Flume, Sqoop, ETL, Webscrapers… (bonus: Hume)
• SQL, RDBMS, DW, OLAP… (bonus: SOLR, ElasticSearch)
• Knime, Weka RapidMiner… (bonus: SciPy, NumPy, Pandas)
• D3.js, Kibana, ggplot2, Flare… (bonus: Shiny, Flare, Datameer)
• NoSQL, MongoDB, Cassandra, CouchDB
• And sometimes… MS Excel
Source: Data Science London
7. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
Knowns, Unknowns & DIKUW FTW!
known knowns
we know we know
known unknowns
we know we don’t know
unknown unknowns
we don’t know we don’t know
D I K U W
DATA INFORMATION KNOWLEDGE UNDERSTANDING WISDOM
raw what how to why when
numbers description experience cause & effect prediction
letters context tested proven what’s best
symbols relationship instruction
signals reports programs models
PAST FUTURE
Data Engineer Data Analyst Data Miner Data Scientist
known knowns
known unknowns unknown unknowns
Source: Data Science London
8. TITLE
TITLE TITLE
TITLE
Business Intelligence to Data Discovery ?
data you know
data you don’t know
questionsyou’reasking
questionsyou’renotasking
Data Analyst
Data Scientist
Business
Intelligence
Data Discovery
DATA MODELLING
Y F( X, random noise, parameters)
ALGORITHMIC MODELLING
Y [ BLACK BOX ] X
Source: Applied Data Labs & Leo Breiman
9. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
BIG DATA JOURNEY
WHERE
WHAT WHY
HOW
10. TITLE
TITLE TITLE
TITLE
Why is Big Data needed ?
VOLUME VELOCITY VARIETY
Exponential growth; 2x in 2 yrs
PB (1000 TB) is now common
Event streams; never at rest
640k GB per internet minute
100s of data sources
85% not in a table
11. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
BIG DATA JOURNEY
WHERE
WHAT WHY
HOW
18. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
BIG DATA JOURNEY
WHERE
WHAT WHY
HOW
19. TITLE
TITLE TITLE
TITLE
How will Big Data Evolve?
EXTERNAL ALIGNMENT INTERNAL COHERENCE
Align with Existing BI; Maximise Value
Exploit Capability; Respond Rapidly
Focus; Innovate; Stay Ahead
Repeat; Stabilize; Governance
20. TITLE and title
SUB TITLE SUB TITLE
footnote footnote
RECAP OF BENEFITS
COST SPEED
AGILITY CAPABILITY
COST – 20x less per TB v/s Teradata, Netezza, Oracle– 75% less average marginal cost per capacitySPEED – 10x faster than Teradata, NetezzaAGILITY – 115% lesser average cost per data source v/s OracleSCIENCE – Machine learning, prediction
WHAT - What is Big Data Science?WHY - Why is it needed?WHERE - Where is it being used?HOW - How will it evolve?
COST – 20x less per TB v/s Teradata, Netezza, Oracle– 75% less average marginal cost per capacitySPEED – 10x faster than Teradata, NetezzaAGILITY – 115% lesser average cost per data source v/s OracleSCIENCE – Machine learning, prediction
TIME VALUE - Yesterday’s data is less valuable than today’s data - Historical data is more valuable than just now alonePOWER - Get from unknown unknowns to known unknowns or known knowns is powerfulLEAD TO ROME - Exploring with no direct business impact is not a bad thingINDIVUDUAL - Treat every customer as an individual not an aggregate and analyse - Aggregate only individual insights