Leveraging Hadoop to mine customer insights in a developing market

•

2 gefällt mir•1,152 views

I was a speaker at Big data world conference in London on the 18th september 2012. http://www.terrapinn.com/2012/big-data-world-europe/ See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/ Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations Understanding how Hadoop can provide insightful data analysis to the end user Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends Will Hadoop replace the need for relational data warehousing systems?

Technologie

Leveraging Hadoop in Wikimart
Roman Zykov
Head of analytics
http://wikimart.ru

London, Big Data World Europe, 20th September 2012

Key problem

To be or not to be….

Hadoop

Introduction

Key tasks for Wikimart

What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing

Who
• Core analytics team
• Analytics members in other departments
• IT site operations

Problem

Too time consuming or too
expensive?
• Data volume
• # of data services

Map Reduce

Standalone

DATA

Map Reduce

Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software
• First patient - recommendation engine

Difficulties
- no planned budget -> Hadoop is free
- no experts -> learn it
- no hardware -> virtual cluster

Requirements for Hadoop

• Easy scalable
• Easy deployment
• Easy integration
• Less low level Java coding
• SQL-like querries

Accomplishments

Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)

Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster

End user

Internal high-level languages
• HiveQL
• Pig

Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
support HiveQL

Data Integration

• SQOOP
• Parallel data exchange with RDBMS
(MS SQL, MySQL, Oracle, Teradata… )
• Incremental updates
• HDFS, Hive, HBASE

• Talend Open Studio

Hadoop vs RDBMS

• Never replace RDBMS:
• Latency
• Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
• Machine learning
• Queries to Big tables
• ….
• Real time: NOSQL

Hadoop myth

Terabytes?
Petabytes?

Big tasks!

Conclusion

• Hadoop is not Rocket Science
• Intermediate data can be Big Data

Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources

Thank you!!!

rzykov@gmail.com
linkedin.com/in/romanzykov

Empfohlen

Kib Rif 2015. Make money from your dataRoman Zykov

imu2010 - Интернет-мерчендайзинг. Методы сбора и анализа данных. Флакс, ОвоксUAMASTER Digital Agency

How to eliminate ideas as soon as possibleRoman Zykov

Владислав Флакс "Как Big Data делает вашу жизнь лучше"Yandex

Hadoop in Wikimart. Part 1. BusinessRoman Zykov

сервисы персонализации на основе данныхRoman Zykov

Электронная коммерция: от Hadoop к Spark ScalaRoman Zykov

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Empfohlen

Kib Rif 2015. Make money from your dataRoman Zykov

imu2010 - Интернет-мерчендайзинг. Методы сбора и анализа данных. Флакс, ОвоксUAMASTER Digital Agency

How to eliminate ideas as soon as possibleRoman Zykov

Владислав Флакс "Как Big Data делает вашу жизнь лучше"Yandex

Hadoop in Wikimart. Part 1. BusinessRoman Zykov

сервисы персонализации на основе данныхRoman Zykov

Электронная коммерция: от Hadoop к Spark ScalaRoman Zykov

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Big data europe 2012 brochure (3)Roman Zykov

Wikimart recommendationsRoman Zykov

Hadoop implementation in WikimartRoman Zykov

Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsRoman Zykov

MIPhT presentation about BIRoman Zykov

Owox rzykov kp_iexamplesRoman Zykov

Owox rzykovRoman Zykov

Roman zykovcertificatesRoman Zykov

Wpaper 005 functionalism_new_approachRoman Zykov

Searchpatterns 100519055231-phpapp02Roman Zykov

Metrics drivendesignRoman Zykov

E-commerce KPIsRoman Zykov

Ozon в высшей школе экономики часть 4Roman Zykov

Ozon в высшей школе экономики часть 3Roman Zykov

Ozon в высшей школе экономики часть 2Roman Zykov

Ozon в высшей школе экономики часть 1Roman Zykov

Roman Zykov CertificatesRoman Zykov

Связной клубRoman Zykov

Complete Ga Power User WebRoman Zykov

RIW2009 Анализ продвиженияRoman Zykov

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Weitere ähnliche Inhalte

Mehr von Roman Zykov

Big data europe 2012 brochure (3)Roman Zykov

Wikimart recommendationsRoman Zykov

Hadoop implementation in WikimartRoman Zykov

Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsRoman Zykov

MIPhT presentation about BIRoman Zykov

Owox rzykov kp_iexamplesRoman Zykov

Owox rzykovRoman Zykov

Roman zykovcertificatesRoman Zykov

Wpaper 005 functionalism_new_approachRoman Zykov

Searchpatterns 100519055231-phpapp02Roman Zykov

Metrics drivendesignRoman Zykov

E-commerce KPIsRoman Zykov

Ozon в высшей школе экономики часть 4Roman Zykov

Ozon в высшей школе экономики часть 3Roman Zykov

Ozon в высшей школе экономики часть 2Roman Zykov

Ozon в высшей школе экономики часть 1Roman Zykov

Roman Zykov CertificatesRoman Zykov

Связной клубRoman Zykov

Complete Ga Power User WebRoman Zykov

RIW2009 Анализ продвиженияRoman Zykov

Mehr von Roman Zykov (20)

Big data europe 2012 brochure (3)

Wikimart recommendations

Hadoop implementation in Wikimart

Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics

MIPhT presentation about BI

Owox rzykov kp_iexamples

Owox rzykov

Roman zykovcertificates

Wpaper 005 functionalism_new_approach

Searchpatterns 100519055231-phpapp02

Metrics drivendesign

E-commerce KPIs

Ozon в высшей школе экономики часть 4

Ozon в высшей школе экономики часть 3

Ozon в высшей школе экономики часть 2

Ozon в высшей школе экономики часть 1

Roman Zykov Certificates

Связной клуб

Complete Ga Power User Web

RIW2009 Анализ продвижения

Kürzlich hochgeladen

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

How to write a Business Continuity PlanDatabarracks

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

From Family Reminiscence to Scholarly Archive .Alan Dix

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

A Journey Into the Emotions of Software DevelopersNicole Novielli

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Kürzlich hochgeladen (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Scale your database traffic with Read & Write split using MySQL Router

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

UiPath Community: Communication Mining from Zero to Hero

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

Genislab builds better products and faster go-to-market with Lean project man...

Generative AI for Technical Writer or Information Developers

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

How to write a Business Continuity Plan

Long journey of Ruby standard library at RubyConf AU 2024

Generative Artificial Intelligence: How generative AI works.pdf

From Family Reminiscence to Scholarly Archive .

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

The State of Passkeys with FIDO Alliance.pptx

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

A Journey Into the Emotions of Software Developers

Take control of your SAP testing with UiPath Test Suite

Leveraging Hadoop to mine customer insights in a developing market

1. Leveraging Hadoop in Wikimart Roman Zykov Head of analytics http://wikimart.ru London, Big Data World Europe, 20th September 2012

2. Key problem To be or not to be…. Hadoop Introduction

3. Key tasks for Wikimart What • BI tasks • Web analytics (in-house solution) • Recommendations on site • Data services for marketing Who • Core analytics team • Analytics members in other departments • IT site operations

4. Problem Too time consuming or too expensive? • Data volume • # of data services

5. Map Reduce Standalone DATA Map Reduce

6. Our idea New platform for “Big Data” tasks only • Start research on Map Reduce software • First patient - recommendation engine Difficulties - no planned budget -> Hadoop is free - no experts -> learn it - no hardware -> virtual cluster

7. Requirements for Hadoop • Easy scalable • Easy deployment • Easy integration • Less low level Java coding • SQL-like querries

8. Data flow DWH Data feeds

9. Accomplishments Recommendations • Collaborative filtering (item-to-item on browsing history, PIG) • Similar products (items attributes, PIG) • Most popular items (browsing history + orders, HiveQL) • Internal and external search recommendations (HiveQL) Some statistics after 1 year • >10% of revenue • 3 months to launch • Tens of gigabytes are processed 2 hours daily • 1 crash only (cluster lost power) Decision: Invest to Hardware cluster

10. End user Internal high-level languages • HiveQL • Pig Reporting • Pre-aggregated data for OLAP • RDBMS - front end • OLAP and Reporting software should support HiveQL

11. Data Integration • SQOOP • Parallel data exchange with RDBMS (MS SQL, MySQL, Oracle, Teradata… ) • Incremental updates • HDFS, Hive, HBASE • Talend Open Studio

12. Hadoop vs RDBMS • Never replace RDBMS: • Latency • Weak capabilities of HiveQL vs SQL • Only some tasks with offline processing: • Machine learning • Queries to Big tables • …. • Real time: NOSQL

13. Hadoop myth Terabytes? Petabytes? Big tasks!

14. Conclusion • Hadoop is not Rocket Science • Intermediate data can be Big Data Starter kit • Hadoop management system • Virtual hardware (cloud, virtual servers, etc) • Offline data tasks • Pig or HiveQL • Sqoop: import data from existing data sources

15. Thank you!!! rzykov@gmail.com linkedin.com/in/romanzykov