SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Data Science
Bootcamp
8 semaines de formation à
temps partiel
Notre Speaker
—
Victoria Galano,
Data Scientist chez Air France
Introduction to Big Data
Victoria GALANO
Data Scientist @ Air France
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Who am I?
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
I. Concepts & Definitions
II. Applications
III. How will it change our life?
IV. Data Lifecycle
V. A little bit of Machine Learning
Contents
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Many concepts around Data
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Big Data vs Machine Learning
• Big Data does not mean Machine Learning!
• Big Data is more related to computer science, cloud computing, storage techniques, processing tools (Cassandra, Hadoop, etc).
• Big Data -> technologies, new tools and software.
• Machine Learning means “intelligence”, predictive methods introducing a capacity to learn from experience, part of Data Science
(very large concept).
• Machine Learning -> artificial intelligence, algorithms and techniques.
• But together they may represent a perfect match!
It is a duo: we perform some Machine Learning ON Big Data.
Buzzwords
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
What is Big Data?
“Big data refers to data sets whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze.”
TheMcKinseyGlobalInstitute
“Big data is data sets that are so voluminous and complex that traditional data processing
application software are inadequate to deal with them. Big data challenges include capturing
data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating
and information privacy. ”
Wikipedia
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
History of Big Data
ENIAC: first computer in 1946
IBM Roadrunner: in 2008
→ First supercomputer to reach the speed of 1 pétaFLOPS
(10^15 operations/second)
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
History of Big Data
Google Server in 1997
36 data centers containing > 800K servers
40 servers/rack
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
How Big is Big Data?
“For 2017, 90% of the data in the world today has been created in the last
two years alone, at 2.5 quintillion bytes of data a day!”
IBM Marketing
→ More data was created in the last two years than the previous 5,000 years of humanity.
→ Yet, recent research has found that less than 0.5 percent of that data is actually being
analyzed for operational decision making.
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
How Big is Big Data?
In2010, thedigitaluniversewas
1.2 Zettabytes
In a decade, the digital universe will be
35 Zettabytes
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
How Big is Big Data?
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Data
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Science
Ex: Large Hadron Collider (LHC)
• 40 million collisions per second
• After filtering, 100 collisions of interest per second
• A Megabyte of data digitized for each collision =
recording rate of 0.1 Gigabytes/sec
Ex: Astronomical instruments
SKA (Square Kilometer Array) is the
world's largest radio telescope
→ 15 PB / year
→ 400 PB / year
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Web
Twitter
Facebook
Google
Industry
→ 15 TB / day
A single airplane engine generates more than 10 TB of data every 30
minutes.
→ 20 PB / day
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Sources of Big Data
Finance
New York Stock Exchange produces 1TB of data everyday.
Telecoms, Credit Card companies, Recommendations Systems,
Airlines, GPS Systems, etc.
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Volume
Total data stored in the world is going to double every two years.
Ex: Twitter and Facebook are both generating about 15 Terabytes of data per day ( 60 standard PC hard disks).
→ Scalability requires distributed storage and horizontal computation.
Variety
New kind of data, not only linear and classical data anymore: click streams, Internet of Objects, connected devices, tweets, Facebook
posts, texts, images, videos analysis, geolocation, etc.
→ Necessity of developing the ability to analyze and exploit those new types of data -> new kind of intelligence.
Velocity
Initially, companies analyzed data using a batch process. With the new sources of data such as social and mobile applications, the batch
process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the
delay is very short.
→ Need of specialized software solutions, to collect data stream and produce real time complex analysis.
Big Data is characterized by the 3 Vs
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V1: Volume
10n Prefix Symbol Since Decimal number Name
1024 yotta Y 1991 1 000 000 000 000 000 000 000 000 Septillion
1021 zetta Z 1991 1 000 000 000 000 000 000 000 Sextillion
1018 exa E 1975 1 000 000 000 000 000000 Quintillion
1015
péta P 1975 1 000 000 000 000000 Quadrillion
1012
téra T 1960 1 000 000 000000 Trillion
109 giga G 1960 1 000 000000 Billion
16 GB
500 GB
10 PB
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V2: Variety
The goal is to link everything together and extract some knowledge…
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
V3: Velocity
Data is been generated very quickly and need to be processed fast! → real time data
Late decisions lead to missed opportunities!
(In advertisement but also medicine, finance, etc.)
Example of Criteo :
 9,000 targeted ads per second
 2,5 billions ads banners per day
 < 100 milliseconds to decide
 Estimate in real-time the probability fora visitor to click on a banner from such or such brand
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Technological impacts of the 3 Vs
Volume
Cost per byte stored becomes critical.
Scalability requires distributed storage & horizontal computation.
Variety
SQL organization & structure do not fit new data types.
Various data formats: list of values, text, image.
Real-time change…
Velocity
Real time collection…
Collecting data stream requires specialized software solutions.
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Veracity
Data quality is not sure: data is incomplete, inconsistent (between many sources), ambiguous, etc.
Managing data quality is a required process.
Value
How fast can data be analyzed and acted on to provide business value?
Variability
Data meaning can change over time (e.g.: text interpretation).
Requires reprocessing data with « new rules » of understanding.
Visualisation
Visualize data to understand, explore & communicate is part of the Big Data approach.
Representing huge volumes of data requires specific tools.
Other Vs you can hear about…
All data streams feed the data lake
Illustration: Xebia TechLabs
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application
GAFA business models changed the world of Big Data !
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Medicine
UCLA is using Big Data analysis to prevent complications from brain injuries.
Skin cancer detection thanks to image recognition, Stanford University.
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Politics and Sport
• In the 2012 presidential election, the Obama Campaign created a Big Data team, to perform data
modelling and made use of voter models on a scale never before seen.
Ex: “the Johnson family Maple Lane in Columbus, Ohio will vote for us if they know our stance on
social security.”
• Oakland Athletics baseball team and its general manager Billy Beane.
• OA’s front office looks at a whole bunch of nontraditional baseball stats and uses them to make
player comparisons and, predict player performance.
• Moneyball had a huge impact in other teams in MLB (Major League Baseball)
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Artificial Intelligence
Artificial intelligence is the simulation of human intelligence by machines.
• Chatbots
• Robots
• Siri
• Autonomous cars
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Application in Finance
• Data Science is playing an increasingly important role in calibrating
trading decisions in real time → decision-making
• One field of algorithmic trading is almost entirely based on Machine
Learning algorithms: ‘high-frequency trading’ (HFT).
• Price discovering process
• Profiling : Ex: ‘robo-advisors’
• Sentiment analysis and text mining
• Fraud detection
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
An example of data collection
Airline ticket Restaurant check
Grocery Bill
Hotel Bill
Credit cards companies collect more information than we think…
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Limits
Decisions like your credit score and your insurance rates may be based on the analysis of big data,
for good or bad -> Alipay is a worrying example….
After Haiti’s 2010 earthquake, Columbia University tracked the movements of 2 million refugees.
The real challenge: are you willing to get better value and more innovation for some loss of
privacy?
• Image risks
• Legal risks
• Privacy risks
But how to avoid Big Data???
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Personal Data Protection
• What is technically possible is not legally and ethically possible!
• Be careful to the massive amount of personal data available on the Internet, of which the user is
not aware….
Ex: https://www.google.com/Settings/Dashboard
• But anonymization is not totally powerful….
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Data Lifecycle
Data mining
Data acquisition
Data visualisation
Data archiving
Data analysis (Machine Learning)Data selection
Data storage
Extraction of knowledge from data and use of this knowledge to find solutions for previously unseen
observations -> generalization.
But isn’t just statistics? Yes and No
In practice:
When we deal with high-dimensional data (over than 100 features) -> Machine Learning,
When variables are correlated -> Machine Learning,
-> Machine Learning improved the classical statistics methods, but mostly, it introduced new models able to
deal with very large datasets
deal with non parametric situations!
Machine Learning
Machine Learning
Source: Data Science: fondamentaux et études de cas, E. Biernat & M. Lutz
Machine Learning
A supervised learning model is composed of :
• The variable to predict: 𝒀
• The explicative variables : 𝑿 𝟏, … , 𝑿 𝒏 , called predictors or features
• A learning function 𝒇 that best maps input variables X to predict target Y
• A noise composant 𝜺
Our goal is to find the best estimation of function f:
𝒀 = 𝒇 𝑿 + 𝜺
We would like to make predictions in the future (𝒀) given new examples of input variables (𝑿).
Machine Learning
Data analysis process
error(X) = noise(X) + bias^2(X) + variance(X)
Bias-Variance trade off
STRICTLY CONFIDENTIAL - FOR INTERNAL USE
Conclusion
Progress and innovation are no longer driven by the ability to collect data.
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion.
Thank you for your attention!
vgalano92@gmail.com
Nos Prochains Workshops
—
6 Juin
—
Kent Aquereburu, Data Scientist à la Société
Générale
Nos Prochaines Sessions
—
6 au 17 Août
Tous les jours
9h30 - 15h30
Intensives
—
Nos Prochaines Sessions
—
4 Sept - 25 Oct
Mardis / Jeudis
18h30 - 21h00
Semaine
—
1 Sept - 25 Oct
Samedis
9h30 - 15h30
Weekend
—
Data Science
Bootcamp
Merci ! A la prochaine :)

Weitere ähnliche Inhalte

Was ist angesagt?

Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big dataPrashant Sharma
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data TechnologiesMahindra Comviva
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Big data introduction
Big data introductionBig data introduction
Big data introductionvikas samant
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Bohitesh Misra, PMP
 
Big Data for Defense and Security
Big Data for Defense and SecurityBig Data for Defense and Security
Big Data for Defense and SecurityEMC
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 

Was ist angesagt? (20)

L18 Big Data and Analytics
L18 Big Data and AnalyticsL18 Big Data and Analytics
L18 Big Data and Analytics
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
130214 copy
130214   copy130214   copy
130214 copy
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021Use of data science for startups_Sept 2021
Use of data science for startups_Sept 2021
 
Big Data for Defense and Security
Big Data for Defense and SecurityBig Data for Defense and Security
Big Data for Defense and Security
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
What is big data?
What is big data?What is big data?
What is big data?
 

Ähnlich wie Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France

Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxVaishnavGhadge1
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxBIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxtangyechloe
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big DataSonovate
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraChun Myung Kyu
 

Ähnlich wie Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France (20)

Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
 
Using Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay VinzeUsing Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay Vinze
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
big data.pptx
big data.pptxbig data.pptx
big data.pptx
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docxBIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
BIGDATAPrepared ByMuhammad Abrar UddinIntrodu.docx
 
Our big data
Our big dataOur big data
Our big data
 
Big data
Big dataBig data
Big data
 
Bigdata
Bigdata Bigdata
Bigdata
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big Data
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 

Mehr von Jedha Bootcamp

DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfJedha Bootcamp
 
L'IA face à l'épreuve du covid-19 - Jedha x Kardinal
L'IA face à l'épreuve du covid-19 - Jedha x KardinalL'IA face à l'épreuve du covid-19 - Jedha x Kardinal
L'IA face à l'épreuve du covid-19 - Jedha x KardinalJedha Bootcamp
 
Générer une image à partir d'un texte - Fullstack Paris #5
Générer une image à partir d'un texte - Fullstack Paris #5Générer une image à partir d'un texte - Fullstack Paris #5
Générer une image à partir d'un texte - Fullstack Paris #5Jedha Bootcamp
 
Recommander des films - Andreea - Fullstack Lyon #1
Recommander des films - Andreea - Fullstack Lyon #1Recommander des films - Andreea - Fullstack Lyon #1
Recommander des films - Andreea - Fullstack Lyon #1Jedha Bootcamp
 
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1Jedha Bootcamp
 
Construction d'une voiture autonome - Adrien Dodinet, alumni Fullstack
Construction d'une voiture autonome - Adrien Dodinet, alumni FullstackConstruction d'une voiture autonome - Adrien Dodinet, alumni Fullstack
Construction d'une voiture autonome - Adrien Dodinet, alumni FullstackJedha Bootcamp
 
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...Jedha Bootcamp
 
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De Forzanz
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De ForzanzEstimer les prix de vente sur une marketplace - Fabien Herry & Marc De Forzanz
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De ForzanzJedha Bootcamp
 
Trouver des offres d'emploi grâce au traitement de texte - Mohamed Zebli
Trouver des offres d'emploi grâce au traitement de texte - Mohamed ZebliTrouver des offres d'emploi grâce au traitement de texte - Mohamed Zebli
Trouver des offres d'emploi grâce au traitement de texte - Mohamed ZebliJedha Bootcamp
 
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed Zebli
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed ZebliOptimiser sa stratégie de paris sportifs : le cas du football - Mohamed Zebli
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed ZebliJedha Bootcamp
 
Reconnaître du mobilier design sur une photographie - Emmanuelle Guyot
Reconnaître du mobilier design sur une photographie - Emmanuelle GuyotReconnaître du mobilier design sur une photographie - Emmanuelle Guyot
Reconnaître du mobilier design sur une photographie - Emmanuelle GuyotJedha Bootcamp
 
Estimer le prix de bijou lors d'une vente aux enchères - Katie Ross
Estimer le prix de bijou lors d'une vente aux enchères - Katie RossEstimer le prix de bijou lors d'une vente aux enchères - Katie Ross
Estimer le prix de bijou lors d'une vente aux enchères - Katie RossJedha Bootcamp
 
Workshop Data Visualisation - Jedha Paris
Workshop Data Visualisation - Jedha ParisWorkshop Data Visualisation - Jedha Paris
Workshop Data Visualisation - Jedha ParisJedha Bootcamp
 
Les applications du Deep Learning - Jedha Lyon
Les applications du Deep Learning - Jedha LyonLes applications du Deep Learning - Jedha Lyon
Les applications du Deep Learning - Jedha LyonJedha Bootcamp
 
Optimiser ses publicités grâce à la Data Science
Optimiser ses publicités grâce à la Data ScienceOptimiser ses publicités grâce à la Data Science
Optimiser ses publicités grâce à la Data ScienceJedha Bootcamp
 
Connaître son audience grâce à la Data - Parisa MAjlessi
Connaître son audience grâce à la Data - Parisa MAjlessiConnaître son audience grâce à la Data - Parisa MAjlessi
Connaître son audience grâce à la Data - Parisa MAjlessiJedha Bootcamp
 
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...Jedha Bootcamp
 
Automatiser la classification d'un jeu vidéo
Automatiser la classification d'un jeu vidéoAutomatiser la classification d'un jeu vidéo
Automatiser la classification d'un jeu vidéoJedha Bootcamp
 
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...Jedha Bootcamp
 

Mehr von Jedha Bootcamp (20)

DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
L'IA face à l'épreuve du covid-19 - Jedha x Kardinal
L'IA face à l'épreuve du covid-19 - Jedha x KardinalL'IA face à l'épreuve du covid-19 - Jedha x Kardinal
L'IA face à l'épreuve du covid-19 - Jedha x Kardinal
 
Générer une image à partir d'un texte - Fullstack Paris #5
Générer une image à partir d'un texte - Fullstack Paris #5Générer une image à partir d'un texte - Fullstack Paris #5
Générer une image à partir d'un texte - Fullstack Paris #5
 
Recommander des films - Andreea - Fullstack Lyon #1
Recommander des films - Andreea - Fullstack Lyon #1Recommander des films - Andreea - Fullstack Lyon #1
Recommander des films - Andreea - Fullstack Lyon #1
 
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1
Localiser des objets en intérieur - Abdelilah - Fullstack Lyon #1
 
Construction d'une voiture autonome - Adrien Dodinet, alumni Fullstack
Construction d'une voiture autonome - Adrien Dodinet, alumni FullstackConstruction d'une voiture autonome - Adrien Dodinet, alumni Fullstack
Construction d'une voiture autonome - Adrien Dodinet, alumni Fullstack
 
Slide portes ouvertes
Slide portes ouvertesSlide portes ouvertes
Slide portes ouvertes
 
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...
Prédire le comportement consommateurs grâce à la Data Science - Jimmy Brumant...
 
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De Forzanz
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De ForzanzEstimer les prix de vente sur une marketplace - Fabien Herry & Marc De Forzanz
Estimer les prix de vente sur une marketplace - Fabien Herry & Marc De Forzanz
 
Trouver des offres d'emploi grâce au traitement de texte - Mohamed Zebli
Trouver des offres d'emploi grâce au traitement de texte - Mohamed ZebliTrouver des offres d'emploi grâce au traitement de texte - Mohamed Zebli
Trouver des offres d'emploi grâce au traitement de texte - Mohamed Zebli
 
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed Zebli
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed ZebliOptimiser sa stratégie de paris sportifs : le cas du football - Mohamed Zebli
Optimiser sa stratégie de paris sportifs : le cas du football - Mohamed Zebli
 
Reconnaître du mobilier design sur une photographie - Emmanuelle Guyot
Reconnaître du mobilier design sur une photographie - Emmanuelle GuyotReconnaître du mobilier design sur une photographie - Emmanuelle Guyot
Reconnaître du mobilier design sur une photographie - Emmanuelle Guyot
 
Estimer le prix de bijou lors d'une vente aux enchères - Katie Ross
Estimer le prix de bijou lors d'une vente aux enchères - Katie RossEstimer le prix de bijou lors d'une vente aux enchères - Katie Ross
Estimer le prix de bijou lors d'une vente aux enchères - Katie Ross
 
Workshop Data Visualisation - Jedha Paris
Workshop Data Visualisation - Jedha ParisWorkshop Data Visualisation - Jedha Paris
Workshop Data Visualisation - Jedha Paris
 
Les applications du Deep Learning - Jedha Lyon
Les applications du Deep Learning - Jedha LyonLes applications du Deep Learning - Jedha Lyon
Les applications du Deep Learning - Jedha Lyon
 
Optimiser ses publicités grâce à la Data Science
Optimiser ses publicités grâce à la Data ScienceOptimiser ses publicités grâce à la Data Science
Optimiser ses publicités grâce à la Data Science
 
Connaître son audience grâce à la Data - Parisa MAjlessi
Connaître son audience grâce à la Data - Parisa MAjlessiConnaître son audience grâce à la Data - Parisa MAjlessi
Connaître son audience grâce à la Data - Parisa MAjlessi
 
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...
ONU : baisser la mortalité infantile en optimisant les interventions - Antoin...
 
Automatiser la classification d'un jeu vidéo
Automatiser la classification d'un jeu vidéoAutomatiser la classification d'un jeu vidéo
Automatiser la classification d'un jeu vidéo
 
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...
Reconnaître automatiquement les positions de Yoga - Marine Gubler, programme ...
 

Kürzlich hochgeladen

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Kürzlich hochgeladen (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France

  • 1. Data Science Bootcamp 8 semaines de formation à temps partiel
  • 2. Notre Speaker — Victoria Galano, Data Scientist chez Air France
  • 3. Introduction to Big Data Victoria GALANO Data Scientist @ Air France
  • 4. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Who am I?
  • 5. STRICTLY CONFIDENTIAL - FOR INTERNAL USE I. Concepts & Definitions II. Applications III. How will it change our life? IV. Data Lifecycle V. A little bit of Machine Learning Contents
  • 6. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Many concepts around Data
  • 7. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Big Data vs Machine Learning • Big Data does not mean Machine Learning! • Big Data is more related to computer science, cloud computing, storage techniques, processing tools (Cassandra, Hadoop, etc). • Big Data -> technologies, new tools and software. • Machine Learning means “intelligence”, predictive methods introducing a capacity to learn from experience, part of Data Science (very large concept). • Machine Learning -> artificial intelligence, algorithms and techniques. • But together they may represent a perfect match! It is a duo: we perform some Machine Learning ON Big Data. Buzzwords
  • 8. STRICTLY CONFIDENTIAL - FOR INTERNAL USE What is Big Data? “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” TheMcKinseyGlobalInstitute “Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy. ” Wikipedia
  • 9. STRICTLY CONFIDENTIAL - FOR INTERNAL USE History of Big Data ENIAC: first computer in 1946 IBM Roadrunner: in 2008 → First supercomputer to reach the speed of 1 pétaFLOPS (10^15 operations/second)
  • 10. STRICTLY CONFIDENTIAL - FOR INTERNAL USE History of Big Data Google Server in 1997 36 data centers containing > 800K servers 40 servers/rack
  • 11. STRICTLY CONFIDENTIAL - FOR INTERNAL USE How Big is Big Data? “For 2017, 90% of the data in the world today has been created in the last two years alone, at 2.5 quintillion bytes of data a day!” IBM Marketing → More data was created in the last two years than the previous 5,000 years of humanity. → Yet, recent research has found that less than 0.5 percent of that data is actually being analyzed for operational decision making.
  • 12. STRICTLY CONFIDENTIAL - FOR INTERNAL USE How Big is Big Data? In2010, thedigitaluniversewas 1.2 Zettabytes In a decade, the digital universe will be 35 Zettabytes
  • 13. STRICTLY CONFIDENTIAL - FOR INTERNAL USE How Big is Big Data?
  • 14. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Sources of Data
  • 15. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Sources of Big Data Science Ex: Large Hadron Collider (LHC) • 40 million collisions per second • After filtering, 100 collisions of interest per second • A Megabyte of data digitized for each collision = recording rate of 0.1 Gigabytes/sec Ex: Astronomical instruments SKA (Square Kilometer Array) is the world's largest radio telescope → 15 PB / year → 400 PB / year
  • 16. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Sources of Big Data Web Twitter Facebook Google Industry → 15 TB / day A single airplane engine generates more than 10 TB of data every 30 minutes. → 20 PB / day
  • 17. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Sources of Big Data Finance New York Stock Exchange produces 1TB of data everyday. Telecoms, Credit Card companies, Recommendations Systems, Airlines, GPS Systems, etc.
  • 18. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Volume Total data stored in the world is going to double every two years. Ex: Twitter and Facebook are both generating about 15 Terabytes of data per day ( 60 standard PC hard disks). → Scalability requires distributed storage and horizontal computation. Variety New kind of data, not only linear and classical data anymore: click streams, Internet of Objects, connected devices, tweets, Facebook posts, texts, images, videos analysis, geolocation, etc. → Necessity of developing the ability to analyze and exploit those new types of data -> new kind of intelligence. Velocity Initially, companies analyzed data using a batch process. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short. → Need of specialized software solutions, to collect data stream and produce real time complex analysis. Big Data is characterized by the 3 Vs
  • 19. STRICTLY CONFIDENTIAL - FOR INTERNAL USE V1: Volume 10n Prefix Symbol Since Decimal number Name 1024 yotta Y 1991 1 000 000 000 000 000 000 000 000 Septillion 1021 zetta Z 1991 1 000 000 000 000 000 000 000 Sextillion 1018 exa E 1975 1 000 000 000 000 000000 Quintillion 1015 péta P 1975 1 000 000 000 000000 Quadrillion 1012 téra T 1960 1 000 000 000000 Trillion 109 giga G 1960 1 000 000000 Billion 16 GB 500 GB 10 PB
  • 20. STRICTLY CONFIDENTIAL - FOR INTERNAL USE V2: Variety The goal is to link everything together and extract some knowledge…
  • 21. STRICTLY CONFIDENTIAL - FOR INTERNAL USE V3: Velocity Data is been generated very quickly and need to be processed fast! → real time data Late decisions lead to missed opportunities! (In advertisement but also medicine, finance, etc.) Example of Criteo :  9,000 targeted ads per second  2,5 billions ads banners per day  < 100 milliseconds to decide  Estimate in real-time the probability fora visitor to click on a banner from such or such brand
  • 22. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Technological impacts of the 3 Vs Volume Cost per byte stored becomes critical. Scalability requires distributed storage & horizontal computation. Variety SQL organization & structure do not fit new data types. Various data formats: list of values, text, image. Real-time change… Velocity Real time collection… Collecting data stream requires specialized software solutions.
  • 23. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Veracity Data quality is not sure: data is incomplete, inconsistent (between many sources), ambiguous, etc. Managing data quality is a required process. Value How fast can data be analyzed and acted on to provide business value? Variability Data meaning can change over time (e.g.: text interpretation). Requires reprocessing data with « new rules » of understanding. Visualisation Visualize data to understand, explore & communicate is part of the Big Data approach. Representing huge volumes of data requires specific tools. Other Vs you can hear about…
  • 24. All data streams feed the data lake Illustration: Xebia TechLabs
  • 25. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Application GAFA business models changed the world of Big Data !
  • 26. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Application in Medicine UCLA is using Big Data analysis to prevent complications from brain injuries. Skin cancer detection thanks to image recognition, Stanford University.
  • 27. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Application in Politics and Sport • In the 2012 presidential election, the Obama Campaign created a Big Data team, to perform data modelling and made use of voter models on a scale never before seen. Ex: “the Johnson family Maple Lane in Columbus, Ohio will vote for us if they know our stance on social security.” • Oakland Athletics baseball team and its general manager Billy Beane. • OA’s front office looks at a whole bunch of nontraditional baseball stats and uses them to make player comparisons and, predict player performance. • Moneyball had a huge impact in other teams in MLB (Major League Baseball)
  • 28. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Application in Artificial Intelligence Artificial intelligence is the simulation of human intelligence by machines. • Chatbots • Robots • Siri • Autonomous cars
  • 29. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Application in Finance • Data Science is playing an increasingly important role in calibrating trading decisions in real time → decision-making • One field of algorithmic trading is almost entirely based on Machine Learning algorithms: ‘high-frequency trading’ (HFT). • Price discovering process • Profiling : Ex: ‘robo-advisors’ • Sentiment analysis and text mining • Fraud detection
  • 30. STRICTLY CONFIDENTIAL - FOR INTERNAL USE An example of data collection Airline ticket Restaurant check Grocery Bill Hotel Bill Credit cards companies collect more information than we think…
  • 31. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Limits Decisions like your credit score and your insurance rates may be based on the analysis of big data, for good or bad -> Alipay is a worrying example…. After Haiti’s 2010 earthquake, Columbia University tracked the movements of 2 million refugees. The real challenge: are you willing to get better value and more innovation for some loss of privacy? • Image risks • Legal risks • Privacy risks But how to avoid Big Data???
  • 32. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Personal Data Protection • What is technically possible is not legally and ethically possible! • Be careful to the massive amount of personal data available on the Internet, of which the user is not aware…. Ex: https://www.google.com/Settings/Dashboard • But anonymization is not totally powerful….
  • 33. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Data Lifecycle Data mining Data acquisition Data visualisation Data archiving Data analysis (Machine Learning)Data selection Data storage
  • 34. Extraction of knowledge from data and use of this knowledge to find solutions for previously unseen observations -> generalization. But isn’t just statistics? Yes and No In practice: When we deal with high-dimensional data (over than 100 features) -> Machine Learning, When variables are correlated -> Machine Learning, -> Machine Learning improved the classical statistics methods, but mostly, it introduced new models able to deal with very large datasets deal with non parametric situations! Machine Learning
  • 35. Machine Learning Source: Data Science: fondamentaux et études de cas, E. Biernat & M. Lutz
  • 36. Machine Learning A supervised learning model is composed of : • The variable to predict: 𝒀 • The explicative variables : 𝑿 𝟏, … , 𝑿 𝒏 , called predictors or features • A learning function 𝒇 that best maps input variables X to predict target Y • A noise composant 𝜺 Our goal is to find the best estimation of function f: 𝒀 = 𝒇 𝑿 + 𝜺 We would like to make predictions in the future (𝒀) given new examples of input variables (𝑿).
  • 39. error(X) = noise(X) + bias^2(X) + variance(X) Bias-Variance trade off
  • 40. STRICTLY CONFIDENTIAL - FOR INTERNAL USE Conclusion Progress and innovation are no longer driven by the ability to collect data. But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion.
  • 41. Thank you for your attention! vgalano92@gmail.com
  • 42. Nos Prochains Workshops — 6 Juin — Kent Aquereburu, Data Scientist à la Société Générale
  • 43. Nos Prochaines Sessions — 6 au 17 Août Tous les jours 9h30 - 15h30 Intensives —
  • 44. Nos Prochaines Sessions — 4 Sept - 25 Oct Mardis / Jeudis 18h30 - 21h00 Semaine — 1 Sept - 25 Oct Samedis 9h30 - 15h30 Weekend —
  • 45. Data Science Bootcamp Merci ! A la prochaine :)