Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
1.
2. Make the elephant fly,
once again
Luangsay Sourygna
sluangsay@bol.com
3. • Bol.com
• Data Quality
• Hive_compared_bq
• DataPrep
• PII data
• Should we lift & shift?
Agenda
4. Bol.com
• Number 1 webshop in Netherlands & Belgium
• Yes, in Spain you’ve also heard about us:
Winner of the Entrepreneurial Award
This year, the winner is Bol.com from The
Netherlands
(Barcelona, 2014)
7. > 16 million products
for sale
> 50 million in catalog
Hadoop in production
since 2010
> 6 million active
customers
> 40 million visits per
month
> 5000 million
pageviews/ye
ar
8. Hadoop at bol.com
• On Premise Production cluster = 35 nodes
• 30+ IT teams
• Several Business Teams
9. But lot of challenges…
• Lack of flexibility:
• Version HDP
• Christmas’ peak
• Security issues
• Who likes Kerberos?
• No PII
• We’re overloaded:
• Sysops
• YARN
16. Improving test coverage
• Not only for ETL migration
• Also for “integration” test
• Testing tools:
• Unit tests: quick, automated
• Hive_compared_bq: huge tests, capturing outliers
18. Yet another BigData processing tool…
• input:
• “Intelligent, visual & serverless data preparation”
• Business People focused
• “Excell” like
• See data + statistics
• Just push “Run” button…
• Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
19.
20. Why Data Quality?
• For IT: better understanding of data
• See & “feel” data
• Help before developing
• Stats: outliers, skew
• For Business:
• Before: 1st
analysis better Jiras
• After: validate with stats
22. Security in our Hadoop
• Kerberos issues:
• Integration Java libraries
• REST interfaces not secured
• Not encrypted
• No strong audit
• No strict HDFS, HBase, Hive permissions
23. PII in BigQuery!
• Always encrypted: disk + network
• Serverless: patching on time
• No Kerberos
• Central access control
• Hide PII columns with views
• Advanced logging
31. Flexibility of Beam
• Easy for future migration:
• In our Hadoop
• In Google Cloud
• Or maybe Beam in Cloud = +
• Overhead of DataFlow = 2 mn, similar to Dataproc
• Flink = better metrics + debugging
32. Netherlands…
• Want to discover what it feels like living 4 meters
under the sea level?
• Yes… We’re hiring!