I was a speaker at Big data world conference in London on the 18th september 2012.
http://www.terrapinn.com/2012/big-data-world-europe/
See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/
Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations
Understanding how Hadoop can provide insightful data analysis to the end user
Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends
Will Hadoop replace the need for relational data warehousing systems?
3. Key tasks for Wikimart
What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing
Who
• Core analytics team
• Analytics members in other departments
• IT site operations
6. Our idea
New platform for “Big Data” tasks only
• Start research on Map Reduce software
• First patient - recommendation engine
Difficulties
- no planned budget -> Hadoop is free
- no experts -> learn it
- no hardware -> virtual cluster
9. Accomplishments
Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)
Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)
Decision: Invest to Hardware cluster
10. End user
Internal high-level languages
• HiveQL
• Pig
Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
support HiveQL
11. Data Integration
• SQOOP
• Parallel data exchange with RDBMS
(MS SQL, MySQL, Oracle, Teradata… )
• Incremental updates
• HDFS, Hive, HBASE
• Talend Open Studio
12. Hadoop vs RDBMS
• Never replace RDBMS:
• Latency
• Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
• Machine learning
• Queries to Big tables
• ….
• Real time: NOSQL
14. Conclusion
• Hadoop is not Rocket Science
• Intermediate data can be Big Data
Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources