SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Leveraging Hadoop in Wikimart
             Roman Zykov
          Head of analytics
          http://wikimart.ru




London, Big Data World Europe, 20th September 2012
Key problem


To be or not to be….

Hadoop

Introduction
Key tasks for Wikimart

What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing

Who
• Core analytics team
• Analytics members in other departments
• IT site operations
Problem

Too time consuming or too
expensive?
• Data volume
• # of data services
Map Reduce



       Standalone


DATA


       Map Reduce
Our idea

New platform for “Big Data” tasks only

• Start research on Map Reduce software
• First patient - recommendation engine

Difficulties
- no planned budget ->   Hadoop is free
- no experts        ->   learn it
- no hardware       ->   virtual cluster
Requirements for Hadoop


•   Easy scalable
•   Easy deployment
•   Easy integration
•   Less low level Java coding
•   SQL-like querries
Data flow




DWH
         Data feeds
Accomplishments

Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)

Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)

Decision: Invest to Hardware cluster
End user

Internal high-level languages
• HiveQL
• Pig

Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
  support HiveQL
Data Integration

• SQOOP
  • Parallel data exchange with RDBMS
    (MS SQL, MySQL, Oracle, Teradata… )
  • Incremental updates
  • HDFS, Hive, HBASE

• Talend Open Studio
Hadoop vs RDBMS

• Never replace RDBMS:
   • Latency
   • Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
   • Machine learning
   • Queries to Big tables
   • ….
• Real time: NOSQL
Hadoop myth


      Terabytes?
      Petabytes?

      Big tasks!
Conclusion

• Hadoop is not Rocket Science
• Intermediate data can be Big Data

Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources
Thank you!!!

     rzykov@gmail.com
linkedin.com/in/romanzykov

Weitere ähnliche Inhalte

Mehr von Roman Zykov

Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)Roman Zykov
 
Wikimart recommendations
Wikimart recommendationsWikimart recommendations
Wikimart recommendationsRoman Zykov
 
Hadoop implementation in Wikimart
Hadoop implementation in WikimartHadoop implementation in Wikimart
Hadoop implementation in WikimartRoman Zykov
 
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsGoogle Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsRoman Zykov
 
MIPhT presentation about BI
MIPhT presentation about BIMIPhT presentation about BI
MIPhT presentation about BIRoman Zykov
 
Owox rzykov kp_iexamples
Owox rzykov kp_iexamplesOwox rzykov kp_iexamples
Owox rzykov kp_iexamplesRoman Zykov
 
Roman zykovcertificates
Roman zykovcertificatesRoman zykovcertificates
Roman zykovcertificatesRoman Zykov
 
Wpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approachWpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approachRoman Zykov
 
Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02Roman Zykov
 
Metrics drivendesign
Metrics drivendesignMetrics drivendesign
Metrics drivendesignRoman Zykov
 
Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4Roman Zykov
 
Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3Roman Zykov
 
Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2Roman Zykov
 
Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1Roman Zykov
 
Roman Zykov Certificates
Roman Zykov CertificatesRoman Zykov Certificates
Roman Zykov CertificatesRoman Zykov
 
Связной клуб
Связной клубСвязной клуб
Связной клубRoman Zykov
 
Complete Ga Power User Web
Complete Ga Power User WebComplete Ga Power User Web
Complete Ga Power User WebRoman Zykov
 
RIW2009 Анализ продвижения
RIW2009 Анализ продвиженияRIW2009 Анализ продвижения
RIW2009 Анализ продвиженияRoman Zykov
 

Mehr von Roman Zykov (20)

Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)Big data europe 2012 brochure (3)
Big data europe 2012 brochure (3)
 
Wikimart recommendations
Wikimart recommendationsWikimart recommendations
Wikimart recommendations
 
Hadoop implementation in Wikimart
Hadoop implementation in WikimartHadoop implementation in Wikimart
Hadoop implementation in Wikimart
 
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetricsGoogle Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
Google Analytics vs Omniture SiteCatalyst vs In-ouse Webanalytics at iMetrics
 
MIPhT presentation about BI
MIPhT presentation about BIMIPhT presentation about BI
MIPhT presentation about BI
 
Owox rzykov kp_iexamples
Owox rzykov kp_iexamplesOwox rzykov kp_iexamples
Owox rzykov kp_iexamples
 
Owox rzykov
Owox rzykovOwox rzykov
Owox rzykov
 
Roman zykovcertificates
Roman zykovcertificatesRoman zykovcertificates
Roman zykovcertificates
 
Wpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approachWpaper 005 functionalism_new_approach
Wpaper 005 functionalism_new_approach
 
Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02Searchpatterns 100519055231-phpapp02
Searchpatterns 100519055231-phpapp02
 
Metrics drivendesign
Metrics drivendesignMetrics drivendesign
Metrics drivendesign
 
E-commerce KPIs
E-commerce KPIsE-commerce KPIs
E-commerce KPIs
 
Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4Ozon в высшей школе экономики часть 4
Ozon в высшей школе экономики часть 4
 
Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3Ozon в высшей школе экономики часть 3
Ozon в высшей школе экономики часть 3
 
Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2Ozon в высшей школе экономики часть 2
Ozon в высшей школе экономики часть 2
 
Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1Ozon в высшей школе экономики часть 1
Ozon в высшей школе экономики часть 1
 
Roman Zykov Certificates
Roman Zykov CertificatesRoman Zykov Certificates
Roman Zykov Certificates
 
Связной клуб
Связной клубСвязной клуб
Связной клуб
 
Complete Ga Power User Web
Complete Ga Power User WebComplete Ga Power User Web
Complete Ga Power User Web
 
RIW2009 Анализ продвижения
RIW2009 Анализ продвиженияRIW2009 Анализ продвижения
RIW2009 Анализ продвижения
 

Kürzlich hochgeladen

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Leveraging Hadoop to mine customer insights in a developing market

  • 1. Leveraging Hadoop in Wikimart Roman Zykov Head of analytics http://wikimart.ru London, Big Data World Europe, 20th September 2012
  • 2. Key problem To be or not to be…. Hadoop Introduction
  • 3. Key tasks for Wikimart What • BI tasks • Web analytics (in-house solution) • Recommendations on site • Data services for marketing Who • Core analytics team • Analytics members in other departments • IT site operations
  • 4. Problem Too time consuming or too expensive? • Data volume • # of data services
  • 5. Map Reduce Standalone DATA Map Reduce
  • 6. Our idea New platform for “Big Data” tasks only • Start research on Map Reduce software • First patient - recommendation engine Difficulties - no planned budget -> Hadoop is free - no experts -> learn it - no hardware -> virtual cluster
  • 7. Requirements for Hadoop • Easy scalable • Easy deployment • Easy integration • Less low level Java coding • SQL-like querries
  • 8. Data flow DWH Data feeds
  • 9. Accomplishments Recommendations • Collaborative filtering (item-to-item on browsing history, PIG) • Similar products (items attributes, PIG) • Most popular items (browsing history + orders, HiveQL) • Internal and external search recommendations (HiveQL) Some statistics after 1 year • >10% of revenue • 3 months to launch • Tens of gigabytes are processed 2 hours daily • 1 crash only (cluster lost power) Decision: Invest to Hardware cluster
  • 10. End user Internal high-level languages • HiveQL • Pig Reporting • Pre-aggregated data for OLAP • RDBMS - front end • OLAP and Reporting software should support HiveQL
  • 11. Data Integration • SQOOP • Parallel data exchange with RDBMS (MS SQL, MySQL, Oracle, Teradata… ) • Incremental updates • HDFS, Hive, HBASE • Talend Open Studio
  • 12. Hadoop vs RDBMS • Never replace RDBMS: • Latency • Weak capabilities of HiveQL vs SQL • Only some tasks with offline processing: • Machine learning • Queries to Big tables • …. • Real time: NOSQL
  • 13. Hadoop myth Terabytes? Petabytes? Big tasks!
  • 14. Conclusion • Hadoop is not Rocket Science • Intermediate data can be Big Data Starter kit • Hadoop management system • Virtual hardware (cloud, virtual servers, etc) • Offline data tasks • Pig or HiveQL • Sqoop: import data from existing data sources
  • 15. Thank you!!! rzykov@gmail.com linkedin.com/in/romanzykov