SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Make the elephant fly,
once again
Luangsay Sourygna
sluangsay@bol.com
• Bol.com
• Data Quality
• Hive_compared_bq
• DataPrep
• PII data
• Should we lift & shift?
Agenda
Bol.com
• Number 1 webshop in Netherlands & Belgium
• Yes, in Spain you’ve also heard about us:
Winner of the Entrepreneurial Award
This year, the winner is Bol.com from The
Netherlands
(Barcelona, 2014)
It all began in 2008…
S0 …
How big is
bol.com?
Really?
> 16 million products
for sale
> 50 million in catalog
Hadoop in production
since 2010
> 6 million active
customers
> 40 million visits per
month
> 5000 million
pageviews/ye
ar
Hadoop at bol.com
• On Premise Production cluster = 35 nodes
• 30+ IT teams
• Several Business Teams
But lot of challenges…
• Lack of flexibility:
• Version HDP
• Christmas’ peak
• Security issues
• Who likes Kerberos?
• No PII
• We’re overloaded:
• Sysops
• YARN
Let’s go to the Cloud
Moving data seems easy
Then: migrating ETL…
Difficult in Big Data…
Same challenge in 2015
• Hortonworks’ project (audit):
• SQL server -> Hive
• Huge ETL process
• To test: 500MB – 60GB databases
• 1st tests:
• #rows
• Manually reading few lines
• Developed 3 validation tools
• Ex: https://
community.hortonworks.com/articles/1283/hive-script-to-validate-tabl
es-compare-one-with-an.html
hive_compared_bq
• Improved validation tool:
• No Sqoop
• Totally scalable (moving only aggregated checksums)
• Now: & (also …)
• https://github.com/bolcom/hive_compared_bq
Improving test coverage
• Not only for ETL migration
• Also for “integration” test
• Testing tools:
• Unit tests: quick, automated
• Hive_compared_bq: huge tests, capturing outliers
Data Quality with DataPrep
Yet another BigData processing tool…
• input:
• “Intelligent, visual & serverless data preparation”
• Business People focused
• “Excell” like
• See data + statistics
• Just push “Run” button…
• Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
Why Data Quality?
• For IT: better understanding of data
• See & “feel” data
• Help before developing
• Stats: outliers, skew
• For Business:
• Before: 1st
analysis  better Jiras
• After: validate with stats
Beware of the dog
Security in our Hadoop
• Kerberos issues:
• Integration Java libraries
• REST interfaces not secured
• Not encrypted
• No strong audit
• No strict HDFS, HBase, Hive permissions
PII in BigQuery!
• Always encrypted: disk + network
• Serverless: patching on time
• No Kerberos
• Central access control
• Hide PII columns with views
• Advanced logging
Example of logs
Migrating: which strategy?
Initial idea: many Dataproc
• “a Hadoop cluster in 90 seconds”
• 1 cluster for each team (per job?)
• Capacity issue
• Configuration issue
• Easy migration: no rewrite of MR, Pig, Hive, Flink…
HBase  BigTable
• HBase replication hooked to BigTable
Finally, no lift & shift
• Use of HBase is not optimal
• Better using native Google Tools:
• Cheaper
• Faster
Hive vs BigQuery
• “Now with BigQuery, I start to think that a query is
slow when it takes more than 30 seconds”
Alternatives to Pig
• BigQuery
• Beam
Flexibility of Beam
• Easy for future migration:
• In our Hadoop
• In Google Cloud
• Or maybe Beam in Cloud = +
• Overhead of DataFlow = 2 mn, similar to Dataproc
• Flink = better metrics + debugging
Netherlands…
• Want to discover what it feels like living 4 meters
under the sea level?
• Yes… We’re hiring!
Thanks
till next bol.com
Sourygna Luangsay
sluangsay@bol.com

Weitere ähnliche Inhalte

Ähnlich wie Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Ähnlich wie Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017 (20)

To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
 
2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics2015 WritersUA Sourcing Graphics
2015 WritersUA Sourcing Graphics
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Real Time BI with Hadoop
Real Time BI with HadoopReal Time BI with Hadoop
Real Time BI with Hadoop
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
 
Visualizing Library Data
Visualizing Library DataVisualizing Library Data
Visualizing Library Data
 
Building a Data Driven Company
Building a Data Driven CompanyBuilding a Data Driven Company
Building a Data Driven Company
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big data at zulily
Big data at zulilyBig data at zulily
Big data at zulily
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Advanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scaleAdvanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scale
 
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headacheTips in migrating to SharePoint 2016 or O365, to avoid a migration headache
Tips in migrating to SharePoint 2016 or O365, to avoid a migration headache
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 

Mehr von Big Data Spain

Mehr von Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

  • 1.
  • 2. Make the elephant fly, once again Luangsay Sourygna sluangsay@bol.com
  • 3. • Bol.com • Data Quality • Hive_compared_bq • DataPrep • PII data • Should we lift & shift? Agenda
  • 4. Bol.com • Number 1 webshop in Netherlands & Belgium • Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)
  • 5. It all began in 2008…
  • 6. S0 … How big is bol.com? Really?
  • 7. > 16 million products for sale > 50 million in catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar
  • 8. Hadoop at bol.com • On Premise Production cluster = 35 nodes • 30+ IT teams • Several Business Teams
  • 9. But lot of challenges… • Lack of flexibility: • Version HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN
  • 10. Let’s go to the Cloud
  • 13. Difficult in Big Data…
  • 14. Same challenge in 2015 • Hortonworks’ project (audit): • SQL server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html
  • 15. hive_compared_bq • Improved validation tool: • No Sqoop • Totally scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq
  • 16. Improving test coverage • Not only for ETL migration • Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers
  • 17. Data Quality with DataPrep
  • 18. Yet another BigData processing tool… • input: • “Intelligent, visual & serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
  • 19.
  • 20. Why Data Quality? • For IT: better understanding of data • See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats
  • 22. Security in our Hadoop • Kerberos issues: • Integration Java libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions
  • 23. PII in BigQuery! • Always encrypted: disk + network • Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging
  • 26. Initial idea: many Dataproc • “a Hadoop cluster in 90 seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…
  • 27. HBase  BigTable • HBase replication hooked to BigTable
  • 28. Finally, no lift & shift • Use of HBase is not optimal • Better using native Google Tools: • Cheaper • Faster
  • 29. Hive vs BigQuery • “Now with BigQuery, I start to think that a query is slow when it takes more than 30 seconds”
  • 30. Alternatives to Pig • BigQuery • Beam
  • 31. Flexibility of Beam • Easy for future migration: • In our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging
  • 32. Netherlands… • Want to discover what it feels like living 4 meters under the sea level? • Yes… We’re hiring!
  • 33. Thanks till next bol.com Sourygna Luangsay sluangsay@bol.com