The main objective of “Big Data intelligence” is to understand all of us better to predict the future. Be it 4 billion google queries a day or 1 billion FB users, we need smarter AI algorithms to learn and connect the dots from the ocean of data. With massive parallelism and Map-Reduce techniques, millions of servers take us one step closer to the “Turing’s Intelligent machine”. Near AI success stories are google, facebook, twitter, youtube and Amazon. Let's begin our journey by knowing big hype, big dreams of 50's , big laws, big growth and basic operations to extract big data intelligence.For more information on Big Data training in coimbatore, please visit https://bigzettab.wordpress.com/ . - Prof. Ashok.R, +91-9943900101, ashok@zettab.com.
BDI- The Beginning (Big data training in Coimbatore)
1. Big Data Intelligence
The Beginning
Prof.Ashok.R | +91-9943900101 | ashok@zettab.com
ZettaB.com
Big Data Training in Coimbatore
Ref: Ullman et.al, Mining Massive Datasets
2. Caution
• The grass is always green on the other side
Be inspired!
• Stories.. and more stories…
Be informed!
• The devil is in the details
Be challenged!
2Hsuan- Tien Lin
3. The Dream in 1945
3
• A dream machine of Vannevar Bush (1945)
• An extended supplement to Human Memory
• A device which stores individual library such as
books, records and communications
• Microfilms can be searched, copied and shared
• Useful to store and share information among
lawyers, patent attorney, Doctor and chemists
• The base concept from which WWW evolved
4. Leads to Web and Web Scale Data
• Data Volume: doubles every 1.2 years
5 EB – Total data produced in 5000 years till 2003
20 EB – Data collected by Google alone for a day now
• Data Variety : structured, semi and unstructured
xml, JSON, doc, pdf, html, email body, .mp4,.jpeg…
• Data Velocity : Lot happens in a minute
72 hours of new video uploaded in YouTube
3 million searches in google
200 million emails sent
350 thousand Tweets | 1 million searches in Twitter
690 thousand shares | 420 GB data handled in FB
20 million photo views in Flickr
4
Source: Qmee, Wikibon
https://blog.kissmetrics.com/facebook-statistics/
5. Desktop
Hobbyist
Internet
Big Data
Byte one grain of rice
Kilobyte cup of rice
Source: What is big data, Slideshare.net
Megabyte 8 bags of rice
Gigabyte 3 Trucks of rice
Terabyte 2 container ships
Petabyte Fills half the area of Tirupur
Exabyte Fills the area of south india
ZettaByte Fills Indian ocean twice
PB/EB/ZB
210
220
230
240
250
260
270
1
6. Big Data Intelligence (BDI)
The ability to understand all of us better by connecting the dots
from massive data sets (with TB/PB Volume, streaming
Velocity and Variety in sources) to
predict the future.
6
8. To Survive
8
With largest neural network brain to store and process
big volume of data with 100 billions of neurons and 2.5
PB equivalent memory @ 100 million MIPS (33K i7 cores)
Vision | Touch | Hearing | Smell | Taste
Scientificamerican.com, Storagecraft.com
1250 MB/s | 125 MB/s | 12.5 MB/s | 1.25 MB/s
You only feel 0.7% of
What you sense
9. The Prediction Power
• 10000 hours (7-8 years) of rigorous practice is required to be
the world-class expert—in anything Daniel Levitin, The neurologist
• This enables the ability to predict 2 seconds before others-
“Two Second Advantage”
– Wayne Gretzky, The greatest Ice-hockey player of all time, was able to predict where the puck was
going to be, an instant before it arrived
– Sachin Tendulkar
– Warren Buffet
– Viswanath Anand
9
wins the competitors
10. Can Machines Think (to Predict)?
10
Alan Turing asked this question
in 1950 and proposed a test
to validate it.
11. Which is machine, and
which is woman???
Can Machine Imitate Brain?
Which is man,
and which is
woman???
Turing Test
11
12. Did any machine pass?
12
Any machine nearer? (near AI)
“William Wilkinson’s ‘An account of the
principalities of Wallachia and Modavia’
inspired this author’s most famous novel.”
Jeopardy! Quiz Contest.
The challenge is to predict the question and
bet with reasonable confidence.
No.
14. Near AI Solutions
• Natural language processing
• Machine learning
• Prediction analytics
• Face recognition
• Languages translation
• Speech recognition
14
15. Can machine predict?
• Share price in a stock market next day
• Top 5 products consumers want to buy next week
• Price of Tomato(1 Kg) next month
• No. of cars to be sold next quarter
• Potential criminals in the city/ mega event
• When machine/human will become sick
• Best matched course/school to study
• Best matched job/company to work
15
16. Google Story: Where it all began
• 50 billion indexed pages
• Thousands/Millions/Billions of pages may match each search
query
• How to rank them in order to display the most relevant
(important) pages in the top.
• Predict what you want to see. Not what you asked.
Do You Know?
4 billion searches happen in a day
Each query uses 1000 nodes
Result returned in 0.2 seconds.
20 billion pages crawled per day
20 Exabytes of data collected in a day
17. Page Rank
• Give pages ranks (scores) based on links to them
– Links from many pages high rank
– Link from a high-rank page high rank
Parallel Programming With Spark, Matei Zaharia
ji
t
it
j
r
r
i
1
d
“rank” rj for page j
18. Matrix-Vector Multiplication
MatrixGoogle
A.rr t1t
A
• Page rank equation in a practical form,
(Rank vector r is the Eigen vector of A)
Iteration is repeated, till rank vector converges
(or max. iteration reaches)
For iteration t+1,
19. RAM is not Enough
• Won’t be a problem for small dimension (NxN)
• Consider, N=1 billion (pages that match a query)
• Dimension is now in billions
– A is billion x billion matrix
– r is billion size rank vector
– r(old,new) has two billion entries ( 16 GB for 8 bytes double values)
– A has billion x billion entries ( 8 ExaBytes)
Though, we have methods such as sparse matrix to reduce dimensions in actual implementation.
RAM size of a highly configured server node: 128-512 GB
21. Disk is slowest and not Enough
• 50 billion web pages x 20KB = 1 PB
• 1 computer reads 30-35 MB/sec from disk
~10 months to read all
• Also, it requires 1,000 hard drives to store all
21J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
can you wait that long for one search?
22. Parallelism using Cluster
• 8-64 nodes/rack, 4-16 racks in a cluster
• 1 Gbps bandwidth within rack, 8 Gbps out of rack
• Node specs :
8-16 cores, 128-512 GB RAM, 10×1 TB disks
Aggregation switch
Rack switch
ToR
23. But Nodes Fail at Scale
• One server may stay upto 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• Google has 1 Million servers
–Hence 1000 machines will fail every day.
23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
24. Traditional RDBMS Fail
• Not designed for variety of data types (Text, Video, Images)
• Not capable to handle big volume (PB/EB/ZB)
• Not designed for parallelism
• Poor fault tolerance at Scale (Million servers)
• Slow down due to joins, volume, ACID check and high velocity
requests
• Designed for transaction processing; Not designed for deep
analytics (intensive computing)
24
25. Google Solution: DFS
• Distributed File System
• Divide the bigger data file into smaller chunks of size 16-64
MB and store them in different nodes in different racks.
• Chunks are replicated (2-3) for fault tolerance
25
C0 C1
C2C5
Chunk server 1
D1
C5
Chunk server 3
C1
C3C5
Chunk server 2
…
C2D0
D0
C0 C5
Chunk server N
C2
D0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
26. & Map-Reduce
Map-Reduce environment(Master) takes care of:
• Handling machine failures (with replica nodes)
• Partitioning the input data
• Scheduling workers
• Performing the group by key step
• Managing inter-machine communication
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
26
27. Big Data Platform
M-R App
MapReduce Stack
(Hadoop & Spark)
Distributed File Systems
(HDFS/ GFS)
29. DFS is useful, only when
• Size is big (> 1 TB)
• Files are rarely updated
– Works for Google (to store indexed pages)
– Will not be effective for Airline reservation system
(where frequent data updates are done)
29
30. M-R is useful, only when
• Dimension in billions
– Matrix-vector multiplication in Google Pagerank
• Graph with millions of nodes and billions of edges
– FB Network Graph
• Deep analytical application with intensive computing
– Useful in Finding users with similar buying pattern for products
recommendations in Amazon
– But not useful to manage online retail sales of Amazon (frequent data
updates, transactions)
30
34. SCALA
• Uses and Runs on Java Virtual Machine
• Yet, simpler to write (succinct) than Java
– Strong Type Inference (statically typed)
– Lesser Code
• Functional Programming (+ OOP)
– First class functions
• Used to develop Spark stack (Hadoop 2.0)
• Most suited for Map-Reduce applications
– Traits, collections, nested classes
– Immutable dataset
– Scalable
35. Mining
• Link Analysis
• Classification
• Content based recommendation
• Collaborative Filtering
• Finding similar items
• Clustering, Decomposition…
Machine Learning (Supervised/ Unsupervised)
36. Cloud
• Amazon AWS
• Google Cloud Platform
• IBM BlueMix
• OpenStack
• Data Bricks, Cloudera, HortonWorks, MapR,…
• SAP, Oracle…
Spark as a service, Hadoop as a service
39. Big Market
• $16.9 billion in 2015
• $50 billion by 2017
• 90 percent of the Fortune 500 already initiated big data
projects
• Big Data Spending : $8M Per company
• 200 TB of stored data per company
– with >1000 employees
Ref: McKinsey 2011
39
40. Big Players
• Leaders
– IBM, HP, Dell, SAP, Teradata, Oracle, SAS, Accenture
(>$400 Million)
• Pure players (100% revenue from Big data)
– Palantir, Pivotal, Splunk, Mu Sigma, Actian, Opera Solutions
(>$100 Million)
• Indian Players
– TCS, CapGemini
(>$10 million)
40
WikiBon
2013
41. Big Jobs
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
41
43. Smart phones
• 1.2 billion sold in 2014
– 23.1 % increase over 2013
• Accounts to 27% of global handsets
– but consumes 95% of global traffic (1.5 EB/month)
• Daily SMS count exceeded the world population
Ref: Znet, Fool.com
43
46. Kryder's Law
46
In 2020, 2.5-inch disk drive would
store ~ 40 TB and cost about $40.
Storage capacity (doubles every 12
months) grows faster than Moore’s
law (processing capacity doubles
every 18-24 months).
50. Facebook
Ref: Chassis-plans.com, Wikibon
50
60 million posts per day
2.6 billion likes per day
375 million photos uploaded per day
15 TB data uploaded per day
600 TB data handled per day
700 TB Graph search DB
300 PB user data
http://allfacebook.com/orcfile b130817
52. Youtube
• 100 hours of new video every minute
• 53% mobile traffic is video
• Avg Human vision input: ½ million hours/life
• Youtube new uploads: 15 million hours/ year
52
54. House of Cards
Big data analytics picked up on the success of the British version
of House of Cards, and the popularity of David Fincher (Actor)
and Kevin Spacey (Director) movies
Netflix then made a major decision to commit $100 million for
two 13-episode sessions for its remake (US version) with
above team and streamed online
Netflix earned $1 Billion in that Quarter.
The Atlantic: May 2012
https://gigaom.com/2013/04/22/netflix-q1-2014-earnings/
first Emmy-winning
Streaming show
55. Lumiata
creates personalized treatment recommendations based on patients'
health data, using 170 million data points
55
raised US$10 Million from VCs
Ash Damle
Founder & CEO
56. MedAware
Avoids prescription errors due to
Drug mix-up
Patient mix-up
Unawareness of clinical data
Dosage mix-up
56
Example: Chlorambucil (chemotherapy) prescribed to a patient without
cancer, instead of Chloramphenicol (antibiotic)
Using mathematical model derived from Millions of EMRs which
represents real-world treatment patterns
Raised US$1 million funding
57. Windward
• Only platform to analyze maritime data from ships and ocean
to maintain ship history, predict threats and help make huge
financial decisions on shipping and commodity flows
• Earlier to 2010, it was impossible to know vessel’s location
once it sailed past 30 miles off shores; Then commercial
satellites were introduced ; But the big data collected from
ships gave corrupted picture
57
Raised $15.8 million.
58. mnubo
• Analytics of IoT Data
• Analytics of data from Connected car for driving
habits, vehicle failure pattern, inventory
management, usage based insurance etc
(36M connected cars will be on the road in 2020)
58
Raised $6 million
59. rocana
59
How many of your servers
are talking to blacklisted IPs?
How long has your
business been hacked?
Recana helps IT identify the root cause of performance or
security issues at any scale and complexity and resolve
underlying issues in real-time.
Instead of employing “brute force” searches against
millions of log entries, advanced analytics identifies
anomalies for investigation.
raised $19.4 million
60. Whetlab
• Only 5 data scientists worked
• Twitter acquired at undisclosed deal to increase the ability to
show users the kinds of tweets and content they actually want
to see.
60
61. Applied Predictive Technologies
Cloud based cause and effect analytics platform to accurately
measure the profit impact of pricing, marketing,
merchandising, operations, and capital initiatives, tailoring
investments in these areas to maximize ROI.
Acquired by MasterCard for $600 million.
61
62. Netflix Challenge
• Data: How users have rated movies
– 100.5 million ratings by 5 Lakh users to 18K movies
• Goal: Predict how a user would rate an unrated movie
– A recommender system problem
– 10% improvement: 1 million dollar prize
62Hsuan- Tien Lin
63. KDD Cup Challenge
• Data: How users rated songs
– 252.8 million ratings by 1 million users to 650K songs (Yahoo!)
• Goal: Recommend new songs that user would like
63Hsuan- Tien Lin
64. BDI for National Security
• TIA after (11/9)
• NATGRID after Mumbai attack (26/11)
– We could have stopped both, if we would have connected the pieces
of intel from all security agencies and info tracked from suspects
together.
64
65. More Applications
• Building a Stock Investment Strategy Model
• Predicting Customer Transaction Behavior
• Failure Prediction
• Opinion Mining to Determine User Sentiments
• Financial Loss Prediction
• Insurance Claim Prediction Model
• Bond Trade Price Prediction
• Prediction of Number of Days in the Hospital
• Accelerating Discovery of Drugs for Mutants of H1N1
• Molecular Activity Prediction
• Job Recommendation Engine
65
https://insofeprojects.wordpress.com/insofe-projects/
67. A first course on BDI
Day Topics
Day 1 FN BDI: The Beginning
DFS and Map-Reduce
Distributed Graph (Pregel)
Page Rank algorithm
Day 1 AN BDI Tools Landscape
Dremel and Big Query
Naïve Bayes Classifier
Day 2 FN TF-IDF, Jaccard and Cosine
Collaborative filtering
Shingling, Minhashing
Locality Sensitive Hashing
Day Topics
Day 2 AN Scala Basics for MR apps
Practice session
More fun with Scala
Day 3 FN Spark projects using
Scala
Day 3 AN Student Projects ideas
Q&A
68. M.S. Options in USA
68
University Program
Stanford University M.S-CS, Specialization in Information
Management and Analytics
Four course graduate certificate in mining
massive datasets (link)
Northwestern University Master of Science In Analytics
DePaul University Master of Science in Predictive Analytics
North Carolina State University Master of Science In Analytics
University of Ottawa, Canada M.Sc in Analytics
University of Connecticut MS in Business Analytics and Project
Management
informationweek.com
IBM Director Dr. Spohrer's short list
69. PG options in India
69
Institute Program
Indian School of Business Certified Program in Business
Analytics (CBA)
Great Lakes Institute of
Management
PGP in Business Analytics
IIM Bangalore Analytics Essentials, BAI
IIM Ahmedabad Advanced Analytics for
Management
AnalyticsVidya.com, analyticsindiamag.com
70. Road Ahead
”The ultimate
search engine would
understand exactly
what you mean and
give back exactly
what you want.”
- Larry Page
Watson was developed by 25 researchers over four years. The software runs on a supercomputer with 2,880 IBM Power750 cores, or computing brains, and 15 terabytes of memory. One of Watson’s advantages is that it can hit the buzzer to answer a question faster than any human possibly can — six to 10 milliseconds. Watson won $1 million and all of its winnings will be donated to charity. Watson is an analytical computing system that specializes in natural human language and provides specific answers to complex questions at rapid speeds. Watson cannot respond to video or audio clues and they were omitted by jeopardy producers.
An Osborne Executive portable computer, from 1982 with aZilog Z80 4MHz CPU, and a 2007 Apple iPhone with a 412MHzARM11 CPU; the Executive weighs 100 times as much, has nearly 500 times as much volume, cost approximately 10 times as much (adjusted for inflation), and has about 1/100th the clock frequencyof the smartphone.
“House of Cards” is one of the first major test cases of this Big Data-driven creative strategy. For almost a year, Netflix executives have told us that their detailed knowledge of Netflix subscriber viewing preferences clinched their decision to license a remake of the popular and critically well regarded 1990 BBC miniseries. Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.