SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Spotting Hadoop in the wild
                         Practical use cases from Last.fm and Massive Media


                                             @klbostee



Thursday 12 January 12
• “Data scientist is a job title for an
                         employee who analyses data, particularly
                         large amounts of it, to help a business gain a
                         competitive edge” —WhatIs.com
                    • “Someone who can obtain, scrub, explore,
                         model and interpret data, blending
                         hacking, statistics and machine
                         learning” —Hilary Mason, bit.ly


Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media




Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media
                    • Created Dumbo, a Python API for Hadoop
                    • Contributed some code to Hadoop itself
                    • Organized several HUGUK meetups
Thursday 12 January 12
What are those yellow things?




Thursday 12 January 12
Core principles


                    • Distributed
                    • Fault tolerant
                    • Sequential reads and writes
                    • Data locality

Thursday 12 January 12
Pars pro toto

                                                 Pig     Hive

                                         HBase
                             ZooKeeper

                                                  MapReduce

                                                  HDFS

                         Hadoop itself is basically the kernel that
                         provides a file system and task scheduler


Thursday 12 January 12
Hadoop file system




                         DataNode   DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =

                         File B =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system
                                           Linux
                         File A =
                                           block
                         File B =
                                             Hadoop
                                               block


                          DataNode       DataNode      DataNode




Thursday 12 January 12
Hadoop file system
                                             Linux
                         File A =
                                             block
                         File B =
                                              Hadoop
                                                block
                         No random writes!

                          DataNode       DataNode       DataNode




Thursday 12 January 12
Hadoop task scheduler


                         TaskTracker   TaskTracker   TaskTracker


                         DataNode      DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =


                         TaskTracker   TaskTracker   TaskTracker


                          DataNode     DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =              Job B =


                         TaskTracker   TaskTracker      TaskTracker


                          DataNode     DataNode         DataNode




Thursday 12 January 12
Some practical tips


                    • Install a distribution
                    • Use compression
                    • Consider increasing your block size
                    • Watch out for small files

Thursday 12 January 12
HBase

                                                  Pig     Hive

                                         HBase
                             ZooKeeper

                                                   MapReduce

                                                   HDFS

                         HBase is a database on top of HDFS that
                         can easily be accessed from MapReduce


Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V


                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V
         sorted




                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                   Column family A       Column family B

                    Row keys    Column X    Column Y   Column U    Column V
         sorted




                         ...        ...        ...        ...         ...



                    •    Configurable number of versions per cell
                    •    Each cell version has a timestamp
                    •    TTL can be specified per column family


Thursday 12 January 12
Random becomes sequential


                            ...       KeyValue

                                      KeyValue
                         KeyValue




                                                 sorted
                                                          HDFS
                         KeyValue
                                        ...
                                      KeyValue

                         Commit log   Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...              KeyValue

                                             KeyValue
                         KeyValue




                                                        sorted
                                                                 HDFS
                         KeyValue
                                               ...
                                             KeyValue

                         Commit log          Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                                       HDFS
                         KeyValue
                                                     ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                       HDFS
                         KeyValue
                                                      ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                            HDFS
                         KeyValue
                                                      ...              sequential
                         KeyValue                  KeyValue              write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!


                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!
                                                                           + efficient scans
                                                                           + free empty cells
                                                                           + no fragmentation
                            ...                    KeyValue                + ...

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                               sorted




                         Region




                   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                    sorted




                         Region           Region         Region

                         Region
                           ...              ...             ...
                   RegionServer         RegionServer   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                                sorted




                         Region                Region                Region

                         Region
                           ...                   ...                    ...
                   RegionServer              RegionServer          RegionServer

                •        Each region has its own commit log and memstores
                •        Moving regions is easy since the data is all in HDFS
                •        Strong consistency as each region is served only once

Thursday 12 January 12
Some practical tips

                    • Restrict the number of regions per server
                    • Restrict the number column families
                    • Use compression
                    • Increase file descriptor limits on nodes
                    • Use a large enough buffer when scanning

Thursday 12 January 12
Look, a herd of Hadoops!




Thursday 12 January 12
• “Last.fm lets you effortlessly keep a record
                         of what you listen to from any player. Based
                         on your taste, Last.fm recommends you
                         more music and concerts” —Last.fm
                    • Over 60 billion tracks scrobbled since 2003
                    • Started using Hadoop in 2006, before Yahoo

Thursday 12 January 12
• “Massive Media is the social media
                         company behind the successful digital
                         brands Netlog.com and Twoo.com.
                         We enable members to meet nearby
                         people instantly” —MassiveMedia.eu
                    • Over 80 million users on web and mobile
                    • Using Hadoop for about a year now
Thursday 12 January 12
Hadoop adoption

                    1. Business intelligence
                    2. Testing and experimentation
                    3. Fraud and abuse detection
                    4. Product features
                    5. PR and marketing



Thursday 12 January 12
Hadoop adoption




                                                         m f
                                                       st.
                                                     La
                    1. Business intelligence         √
                    2. Testing and experimentation   √
                    3. Fraud and abuse detection     √
                    4. Product features              √
                    5. PR and marketing              √



Thursday 12 January 12
Hadoop adoption




                                                                       ia
                                                                     ed
                                                                    Me
                                                         m


                                                                  siv
                                                           f
                                                       st.


                                                                as
                                                     La


                                                               M
                    1. Business intelligence         √ √
                    2. Testing and experimentation   √ √
                    3. Fraud and abuse detection     √ √
                    4. Product features              √ √
                    5. PR and marketing              √



Thursday 12 January 12
Business intelligence




Thursday 12 January 12
Testing and experimentation




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Product features




Thursday 12 January 12
PR and marketing




Thursday 12 January 12
Let’s dive into the first use case!




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions
                    1. Scale with very large number of events
                    2. History for graphs must be long enough
                    3. Accessing the graphs must be instantaneous
                    4. Possibility to analyse in detail when needed


Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly




Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly
                    1. Large number of events      √
                    2. Long enough history          ⁄
                    3. Instantaneous access         ⁄
                    4. Analyse in detail           √

Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event




Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event
                    1. Large number of events      ⁄
                    2. Long enough history        √
                    3. Instantaneous access       √
                    4. Analyse in detail           ⁄

Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase




Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase
                    1. Large number of events      √
                    2. Long enough history         √
                    3. Instantaneous access        √
                    4. Analyse in detail           √

Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS

                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS
                                         Realtime
                                        processing
                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                              Syslog-ng

                               HDFS
                                             Realtime
                   Ad-hoc                   processing
                             MapReduce
                   results

                               HBase


Thursday 12 January 12
HBase schema

                    • Separate table for each time granularity
                    • Global segmentations in row keys
                         •   <language>||<country>||...|||<timestamp>
                         •   * for “not specified”
                         •   trailing *s are omitted
                    • Further segmentations in column keys
                     • e.g. payments_via_paypal, payments_via_sms
                    • Related metrics in same column family
Thursday 12 January 12
Questions?



Thursday 12 January 12

Weitere ähnliche Inhalte

Was ist angesagt?

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentationjexp
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneEnkitec
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Was ist angesagt? (20)

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
HDFS
HDFSHDFS
HDFS
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentation
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry Osborne
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Ähnlich wie Spotting Hadoop in the wild

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
DataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix B.V.
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionEdureka!
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander
 

Ähnlich wie Spotting Hadoop in the wild (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop .pdf
Hadoop .pdfHadoop .pdf
Hadoop .pdf
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
DataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix Hadoop Solution
DataLogix Hadoop Solution
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop
HadoopHadoop
Hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
 

Kürzlich hochgeladen

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Kürzlich hochgeladen (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Spotting Hadoop in the wild

  • 1. Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbostee Thursday 12 January 12
  • 2. • “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com • “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly Thursday 12 January 12
  • 3. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media Thursday 12 January 12
  • 4. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media • Created Dumbo, a Python API for Hadoop • Contributed some code to Hadoop itself • Organized several HUGUK meetups Thursday 12 January 12
  • 5. What are those yellow things? Thursday 12 January 12
  • 6. Core principles • Distributed • Fault tolerant • Sequential reads and writes • Data locality Thursday 12 January 12
  • 7. Pars pro toto Pig Hive HBase ZooKeeper MapReduce HDFS Hadoop itself is basically the kernel that provides a file system and task scheduler Thursday 12 January 12
  • 8. Hadoop file system DataNode DataNode DataNode Thursday 12 January 12
  • 9. Hadoop file system File A = DataNode DataNode DataNode Thursday 12 January 12
  • 10. Hadoop file system File A = File B = DataNode DataNode DataNode Thursday 12 January 12
  • 11. Hadoop file system Linux File A = block File B = Hadoop block DataNode DataNode DataNode Thursday 12 January 12
  • 12. Hadoop file system Linux File A = block File B = Hadoop block No random writes! DataNode DataNode DataNode Thursday 12 January 12
  • 13. Hadoop task scheduler TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 14. Hadoop task scheduler Job A = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 15. Hadoop task scheduler Job A = Job B = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 16. Some practical tips • Install a distribution • Use compression • Consider increasing your block size • Watch out for small files Thursday 12 January 12
  • 17. HBase Pig Hive HBase ZooKeeper MapReduce HDFS HBase is a database on top of HDFS that can easily be accessed from MapReduce Thursday 12 January 12
  • 18. Data model Column family A Column family B Row keys Column X Column Y Column U Column V ... ... ... ... ... Thursday 12 January 12
  • 19. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... Thursday 12 January 12
  • 20. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... • Configurable number of versions per cell • Each cell version has a timestamp • TTL can be specified per column family Thursday 12 January 12
  • 21. Random becomes sequential ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 22. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 23. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 24. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 25. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 26. Random becomes sequential KeyValue High write throughput! ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 27. Random becomes sequential KeyValue High write throughput! + efficient scans + free empty cells + no fragmentation ... KeyValue + ... KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 28. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 29. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 30. Horizontal scaling Row keys sorted Region RegionServer Thursday 12 January 12
  • 31. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer Thursday 12 January 12
  • 32. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer • Each region has its own commit log and memstores • Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once Thursday 12 January 12
  • 33. Some practical tips • Restrict the number of regions per server • Restrict the number column families • Use compression • Increase file descriptor limits on nodes • Use a large enough buffer when scanning Thursday 12 January 12
  • 34. Look, a herd of Hadoops! Thursday 12 January 12
  • 35. • “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm • Over 60 billion tracks scrobbled since 2003 • Started using Hadoop in 2006, before Yahoo Thursday 12 January 12
  • 36. • “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu • Over 80 million users on web and mobile • Using Hadoop for about a year now Thursday 12 January 12
  • 37. Hadoop adoption 1. Business intelligence 2. Testing and experimentation 3. Fraud and abuse detection 4. Product features 5. PR and marketing Thursday 12 January 12
  • 38. Hadoop adoption m f st. La 1. Business intelligence √ 2. Testing and experimentation √ 3. Fraud and abuse detection √ 4. Product features √ 5. PR and marketing √ Thursday 12 January 12
  • 39. Hadoop adoption ia ed Me m siv f st. as La M 1. Business intelligence √ √ 2. Testing and experimentation √ √ 3. Fraud and abuse detection √ √ 4. Product features √ √ 5. PR and marketing √ Thursday 12 January 12
  • 42. Fraud and abuse detection Thursday 12 January 12
  • 43. Fraud and abuse detection Thursday 12 January 12
  • 45. PR and marketing Thursday 12 January 12
  • 46. Let’s dive into the first use case! Thursday 12 January 12
  • 47. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions Thursday 12 January 12
  • 48. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions 1. Scale with very large number of events 2. History for graphs must be long enough 3. Accessing the graphs must be instantaneous 4. Possibility to analyse in detail when needed Thursday 12 January 12
  • 49. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly Thursday 12 January 12
  • 50. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly 1. Large number of events √ 2. Long enough history ⁄ 3. Instantaneous access ⁄ 4. Analyse in detail √ Thursday 12 January 12
  • 51. Attempt #2 • Counters in MySQL table • Update counters on every event Thursday 12 January 12
  • 52. Attempt #2 • Counters in MySQL table • Update counters on every event 1. Large number of events ⁄ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail ⁄ Thursday 12 January 12
  • 53. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase Thursday 12 January 12
  • 54. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase 1. Large number of events √ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail √ Thursday 12 January 12
  • 55. Architecture Syslog-ng HDFS MapReduce HBase Thursday 12 January 12
  • 56. Architecture Syslog-ng HDFS Realtime processing MapReduce HBase Thursday 12 January 12
  • 57. Architecture Syslog-ng HDFS Realtime Ad-hoc processing MapReduce results HBase Thursday 12 January 12
  • 58. HBase schema • Separate table for each time granularity • Global segmentations in row keys • <language>||<country>||...|||<timestamp> • * for “not specified” • trailing *s are omitted • Further segmentations in column keys • e.g. payments_via_paypal, payments_via_sms • Related metrics in same column family Thursday 12 January 12