Depuis les 5 dernières années, nous avons créé plus de données que depuis les débuts de l'humanité. Nous produisons aujourd'hui tellement de données qu'il devient difficile de les gérer. C'est ce qu'on appelle le Big Data. Durant ce workshop nous parlerons des enjeux du Big Data et de ses applications concrètes dans notre société.
5. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
I. Concepts & Definitions
II. Applications
III. How will it change our life?
IV. Data Lifecycle
V. A little bit of Machine Learning
Contents
7. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Big Data vs Machine Learning
• Big Data does not mean Machine Learning!
• Big Data is more related to computer science, cloud computing, storage techniques, processing tools (Cassandra, Hadoop, etc).
• Big Data -> technologies, new tools and software.
• Machine Learning means “intelligence”, predictive methods introducing a capacity to learn from experience, part of Data Science
(very large concept).
• Machine Learning -> artificial intelligence, algorithms and techniques.
• But together they may represent a perfect match!
It is a duo: we perform some Machine Learning ON Big Data.
Buzzwords
8. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
What is Big Data?
“Big data refers to data sets whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze.”
TheMcKinseyGlobalInstitute
“Big data is data sets that are so voluminous and complex that traditional data processing
application software are inadequate to deal with them. Big data challenges include capturing
data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating
and information privacy. ”
Wikipedia
9. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
History of Big Data
ENIAC: first computer in 1946
IBM Roadrunner: in 2008
→ First supercomputer to reach the speed of 1 pétaFLOPS
(10^15 operations/second)
10. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
History of Big Data
Google Server in 1997
36 data centers containing > 800K servers
40 servers/rack
11. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
How Big is Big Data?
“For 2017, 90% of the data in the world today has been created in the last
two years alone, at 2.5 quintillion bytes of data a day!”
IBM Marketing
→ More data was created in the last two years than the previous 5,000 years of humanity.
→ Yet, recent research has found that less than 0.5 percent of that data is actually being
analyzed for operational decision making.
12. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
How Big is Big Data?
In2010, thedigitaluniversewas
1.2 Zettabytes
In a decade, the digital universe will be
35 Zettabytes
15. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Science
Ex: Large Hadron Collider (LHC)
• 40 million collisions per second
• After filtering, 100 collisions of interest per second
• A Megabyte of data digitized for each collision =
recording rate of 0.1 Gigabytes/sec
Ex: Astronomical instruments
SKA (Square Kilometer Array) is the
world's largest radio telescope
→ 15 PB / year
→ 400 PB / year
16. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Web
Twitter
Facebook
Google
Industry
→ 15 TB / day
A single airplane engine generates more than 10 TB of data every 30
minutes.
→ 20 PB / day
17. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Finance
New York Stock Exchange produces 1TB of data everyday.
Telecoms, Credit Card companies, Recommendations Systems,
Airlines, GPS Systems, etc.
18. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Volume
Total data stored in the world is going to double every two years.
Ex: Twitter and Facebook are both generating about 15 Terabytes of data per day ( 60 standard PC hard disks).
→ Scalability requires distributed storage and horizontal computation.
Variety
New kind of data, not only linear and classical data anymore: click streams, Internet of Objects, connected devices, tweets, Facebook
posts, texts, images, videos analysis, geolocation, etc.
→ Necessity of developing the ability to analyze and exploit those new types of data -> new kind of intelligence.
Velocity
Initially, companies analyzed data using a batch process. With the new sources of data such as social and mobile applications, the batch
process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the
delay is very short.
→ Need of specialized software solutions, to collect data stream and produce real time complex analysis.
Big Data is characterized by the 3 Vs
19. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V1: Volume
10n Prefix Symbol Since Decimal number Name
1024 yotta Y 1991 1 000 000 000 000 000 000 000 000 Septillion
1021 zetta Z 1991 1 000 000 000 000 000 000 000 Sextillion
1018 exa E 1975 1 000 000 000 000 000000 Quintillion
1015
péta P 1975 1 000 000 000 000000 Quadrillion
1012
téra T 1960 1 000 000 000000 Trillion
109 giga G 1960 1 000 000000 Billion
16 GB
500 GB
10 PB
20. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V2: Variety
The goal is to link everything together and extract some knowledge…
21. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V3: Velocity
Data is been generated very quickly and need to be processed fast! → real time data
Late decisions lead to missed opportunities!
(In advertisement but also medicine, finance, etc.)
Example of Criteo :
9,000 targeted ads per second
2,5 billions ads banners per day
< 100 milliseconds to decide
Estimate in real-time the probability fora visitor to click on a banner from such or such brand
22. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Technological impacts of the 3 Vs
Volume
Cost per byte stored becomes critical.
Scalability requires distributed storage & horizontal computation.
Variety
SQL organization & structure do not fit new data types.
Various data formats: list of values, text, image.
Real-time change…
Velocity
Real time collection…
Collecting data stream requires specialized software solutions.
23. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Veracity
Data quality is not sure: data is incomplete, inconsistent (between many sources), ambiguous, etc.
Managing data quality is a required process.
Value
How fast can data be analyzed and acted on to provide business value?
Variability
Data meaning can change over time (e.g.: text interpretation).
Requires reprocessing data with « new rules » of understanding.
Visualisation
Visualize data to understand, explore & communicate is part of the Big Data approach.
Representing huge volumes of data requires specific tools.
Other Vs you can hear about…
24. All data streams feed the data lake
Illustration: Xebia TechLabs
25. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application
GAFA business models changed the world of Big Data !
26. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Medicine
UCLA is using Big Data analysis to prevent complications from brain injuries.
Skin cancer detection thanks to image recognition, Stanford University.
27. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Politics and Sport
• In the 2012 presidential election, the Obama Campaign created a Big Data team, to perform data
modelling and made use of voter models on a scale never before seen.
Ex: “the Johnson family Maple Lane in Columbus, Ohio will vote for us if they know our stance on
social security.”
• Oakland Athletics baseball team and its general manager Billy Beane.
• OA’s front office looks at a whole bunch of nontraditional baseball stats and uses them to make
player comparisons and, predict player performance.
• Moneyball had a huge impact in other teams in MLB (Major League Baseball)
28. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Artificial Intelligence
Artificial intelligence is the simulation of human intelligence by machines.
• Chatbots
• Robots
• Siri
• Autonomous cars
29. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Finance
• Data Science is playing an increasingly important role in calibrating
trading decisions in real time → decision-making
• One field of algorithmic trading is almost entirely based on Machine
Learning algorithms: ‘high-frequency trading’ (HFT).
• Price discovering process
• Profiling : Ex: ‘robo-advisors’
• Sentiment analysis and text mining
• Fraud detection
30. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
An example of data collection
Airline ticket Restaurant check
Grocery Bill
Hotel Bill
Credit cards companies collect more information than we think…
31. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Limits
Decisions like your credit score and your insurance rates may be based on the analysis of big data,
for good or bad -> Alipay is a worrying example….
After Haiti’s 2010 earthquake, Columbia University tracked the movements of 2 million refugees.
The real challenge: are you willing to get better value and more innovation for some loss of
privacy?
• Image risks
• Legal risks
• Privacy risks
But how to avoid Big Data???
32. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Personal Data Protection
• What is technically possible is not legally and ethically possible!
• Be careful to the massive amount of personal data available on the Internet, of which the user is
not aware….
Ex: https://www.google.com/Settings/Dashboard
• But anonymization is not totally powerful….
33. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Data Lifecycle
Data mining
Data acquisition
Data visualisation
Data archiving
Data analysis (Machine Learning)Data selection
Data storage
34. Extraction of knowledge from data and use of this knowledge to find solutions for previously unseen
observations -> generalization.
But isn’t just statistics? Yes and No
In practice:
When we deal with high-dimensional data (over than 100 features) -> Machine Learning,
When variables are correlated -> Machine Learning,
-> Machine Learning improved the classical statistics methods, but mostly, it introduced new models able to
deal with very large datasets
deal with non parametric situations!
Machine Learning
36. Machine Learning
A supervised learning model is composed of :
• The variable to predict: 𝒀
• The explicative variables : 𝑿 𝟏, … , 𝑿 𝒏 , called predictors or features
• A learning function 𝒇 that best maps input variables X to predict target Y
• A noise composant 𝜺
Our goal is to find the best estimation of function f:
𝒀 = 𝒇 𝑿 + 𝜺
We would like to make predictions in the future (𝒀) given new examples of input variables (𝑿).
40. STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Conclusion
Progress and innovation are no longer driven by the ability to collect data.
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion.