SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Scaling Big Data
Mining Infrastructure


            Jimmy Lin   Dmitriy Ryaboy
            @lintool    @squarecog

            Hadoop Summit Europe
            Thursday, March 21, 2013
From the Ivory Tower…

Source: Wikipedia (All Souls College, Oxford)
… to building sh*t that works.
Source: Wikipedia (Factory)
IMHO
     Represents personal opinion. Not official position of Twitter.
Management not responsible for misuse. Void where prohibited. YMMV.




              (If someone asks, I probably wasn’t here)
“Yesterday”



        ~150 people total
       ~60 Hadoop nodes
~6 people use analytics stack daily
“Today”


           ~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
 10s of PBs total Hadoop DW capacity
          ~100 TB ingest daily
   dozens of teams use Hadoop daily
     10s of Ks of Hadoop jobs daily
Why?


         actionable insights
data     data products
Big data mining
Cool, you get to work on new algorithms!



        No, not really…
Big data mining isn’t mainly
 about data mining per se!
It’s impossible to overstress this: 80%
                                of the work in any data project is in
                                 cleaning the data. – DJ Patil ―Data
                                               Jujitsu‖




Source: Wikipedia (Jujitsu)
Reality
 Your boss says something vague
 You think very hard on how to move the needle
 Where’s the data?
 What’s in this dataset?
 What’s all the f#$!* crap in the data?
 Clean the data
 Run some off-the-shelf data mining algorithm
 …
 Productionize, act on the insight
 Rinse, repeat
Data science is less
     glamorous that you think!


How do we make data scientists’ lives a bit easier?
Gathering
Source: Wikipedia (Logging)
Moving




Source: Wikipedia (Timber rafting)
Organizing




Source: Wikipedia (Logging)
Log directly into a database!

Source: http://www.flickr.com/photos/snukkel/3206405352/
create table `my_audit_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created_at` datetime,
  `user_id` int(11),
  `action` varchar(256),
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


           Don’t do this!
             —   Workload mismatch
             —   Scaling challenges
             —   Overkill
             —   Schema changes
Main Datacenter


                                Scribe
                              Aggregators



                                                HDFS                      Main Hadoop
                                                                              DW

                                   Staging Hadoop Cluster
Datacenter
         Scribe Daemons                     Datacenter
        (Production Hosts)
                           Scribe
                         Aggregators                                   Scribe
                                                                     Aggregators

                                            HDFS
                                                                                      HDFS

                              Staging Hadoop Cluster
                                                                          Staging Hadoop Cluster
     Scribe Daemons
    (Production Hosts)                           Scribe Daemons
                                                (Production Hosts)




                             Use Scribe. or Flume.or Kafka.
Scribe solves log transport only…




        System.out.println
             LOG.info
^(w+s+d+s+d+:d+:d+)s+
([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)
s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)
s+[([^]]+)]s+"(w+)s+([^"]*
(?:.[^"]*)*)s+(S+)"s+(S+)s+
(S+)s+"([^"]*(?:.[^"]*)*)
"s+"([^"]*(?:.[^"]*)*)"s*
(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?
(s+[-w]+)?.*$

  An actual Java regular expression used to
  parse log message at Twitter circa 2010

     Plain-text log messages suck
             Don’t do this!
userid
CamelCase

smallCamelCase    user_id

snake_case

camel_Snake

dunder__snake




             Naming things is hard!
JSON to the Rescue!

Source: http://www.flickr.com/photos/snukkel/3206405352/
This should really be a list…
Remember the camelSnake!
        {
            "token": 945842,
            "feature_enabled": "super_special",
            "userid": 229922,
            "page": "null",             Is this really an integer?
            "info": { "email": "my@place.com" }
        }


                         Is this really null?

    What keys? What values?


                 This does not scale.
struct MessageInfo {
   1: optional string name
   2: optional string email
 }

struct LogMessage {
   1: required i64 token
   2: required string user_id
   3: optional list<Feature> enabled_features
   4: optional i64 page = 0
   5: optional MessageInfo info
 }
                              + DDL provides type safety
 enum Feature {
   super_special,             + Auto codgen
   less_special
 }                            + Efficient serialization

                              +   Sane schema migration
                              +   Separate logical from
                                  physical
            Use Thrift. or Protobufs.or Avro.
Schemas aren’t enough!


We need a data discovery service!
          Where’s the data?
          How do I read it?
          Who produces it?
         Who consumes it?
      When was it last generated?
                  …
Where to find data?

 Old way:
                Hard-coded partitioning scheme, path, format
  A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’
      USING LzoStatusProtobufBlockPigLoader();
                                          Custom loader

 How do people know?            1.) Ask around 2.) Cargo-cult

 New way:
                   Nice UI for browsing
  A = LOAD ‘tables.statuses’
      USING TwadoopLoader(); Same loader each time

  B = FILTER A BY year == ‘2011’
      AND month == ‘11’
      AND day == ‘01’
      AND hour >= ’05’
      AND hour <= ‘07’;
                         Filters are pushed into the loader.
                            No need to understand partitioning
                            scheme.
Data Access Layer (DAL)
―All problems in computer science can be solved by another level
of indirection... Except for the problem of too many layers of
indirection.‖      – David Wheeler


        All data accesses go through DAL
           Thin layer on top of HCatalog
Data Access Layer (DAL)
          Who wrote what data, when?


                      #win
Automatically construct data/job dependency graph
        Automatically figure out ownership

       Hooks into alerting, auditing, repos,
              deploy systems, etc.
Plumbing
                               Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
                               SIGMOD 2012.
Source: Wikipedia (Plumbing)
Classification




Source: Wikipedia (Sorting)
label
Given
                    feature vector



                                       ?   features


    features
     features
      features      learner                    classifier
       features
        features



Induce                                            ?   ?




                      Training       Predicting
Stone age machine learning…




Source: Wikipedia (Stonehenge)
upload results


data munging
 Joining multiple dataset
 Feature extraction
 …                                  download
                      down-sample   test data



                            train    predict
What doesn’t work…
 1.   Down-sampling for training on single-processor
         Defeats the whole point of big data!
 2.   Ad hoc productionizing
         Disconnected from rest of production workflow
Production considerations:
    dependency management
           scheduling
       resource allocation
           monitoring
         error reporting
             alerting
                …


        We need…
Seamless scaling




Source: Wikipedia (Galaxy)
Integration with production workflows




Source: Wikipedia (Oil refinery)
Stochastic Gradient Descent
             Conceptually, classifier training is a like
               user-defined aggregate function!


             AVG                     SGD

initialize   sum = 0                 initialize
             count = 0               weights

 update      add to sum
             increment
             count
terminate    return sum / count      return weights
previous Pig dataflow                previous Pig dataflow



             map

Classifier
Training
                    reduce
                                label, feature vector
             Pig storage
                function

                           model                           model     model



                                    feature vector                 feature vector
  Making           model        UDF              model         UDF
Predictions                         prediction                     prediction
                             Just like any other parallel Pig dataflow
It’s just Pig!




                 For ―free‖: dependency management,
                 scheduling, resource allocation,
                 monitoring, error reporting, alerting, …
Source: Wikipedia (Road surface)
Takeaway messages
How do we make data scientists’ lives a bit easier?




     Adding a bit of structure goes a long way
Getting the plumbing right makes all the difference




 ―In theory, there is no difference between
 theory and practice. But, in practice, there
 is.‖
                                                Questions?
                 - Jan L.A. van de Snepscheut

Weitere ähnliche Inhalte

Was ist angesagt?

Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 

Was ist angesagt? (20)

Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Taxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA TasksTaxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA Tasks
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 

Andere mochten auch (6)

Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to Name
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 

Ähnlich wie Scaling Big Data Mining Infrastructure Twitter Experience

SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
Jesse Vincent
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 

Ähnlich wie Scaling Big Data Mining Infrastructure Twitter Experience (20)

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Scaling Big Data Mining Infrastructure Twitter Experience

  • 1. Scaling Big Data Mining Infrastructure Jimmy Lin Dmitriy Ryaboy @lintool @squarecog Hadoop Summit Europe Thursday, March 21, 2013
  • 2. From the Ivory Tower… Source: Wikipedia (All Souls College, Oxford)
  • 3. … to building sh*t that works. Source: Wikipedia (Factory)
  • 4. IMHO Represents personal opinion. Not official position of Twitter. Management not responsible for misuse. Void where prohibited. YMMV. (If someone asks, I probably wasn’t here)
  • 5.
  • 6. “Yesterday” ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily
  • 7. “Today” ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  • 8. Why? actionable insights data data products
  • 9. Big data mining Cool, you get to work on new algorithms! No, not really…
  • 10. Big data mining isn’t mainly about data mining per se!
  • 11. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil ―Data Jujitsu‖ Source: Wikipedia (Jujitsu)
  • 12. Reality Your boss says something vague You think very hard on how to move the needle Where’s the data? What’s in this dataset? What’s all the f#$!* crap in the data? Clean the data Run some off-the-shelf data mining algorithm … Productionize, act on the insight Rinse, repeat
  • 13. Data science is less glamorous that you think! How do we make data scientists’ lives a bit easier?
  • 14.
  • 18. Log directly into a database! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 19. create table `my_audit_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `created_at` datetime, `user_id` int(11), `action` varchar(256), ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Don’t do this! — Workload mismatch — Scaling challenges — Overkill — Schema changes
  • 20. Main Datacenter Scribe Aggregators HDFS Main Hadoop DW Staging Hadoop Cluster Datacenter Scribe Daemons Datacenter (Production Hosts) Scribe Aggregators Scribe Aggregators HDFS HDFS Staging Hadoop Cluster Staging Hadoop Cluster Scribe Daemons (Production Hosts) Scribe Daemons (Production Hosts) Use Scribe. or Flume.or Kafka.
  • 21. Scribe solves log transport only… System.out.println LOG.info
  • 23. userid CamelCase smallCamelCase user_id snake_case camel_Snake dunder__snake Naming things is hard!
  • 24. JSON to the Rescue! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 25. This should really be a list… Remember the camelSnake! { "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", Is this really an integer? "info": { "email": "my@place.com" } } Is this really null? What keys? What values? This does not scale.
  • 26. struct MessageInfo { 1: optional string name 2: optional string email } struct LogMessage { 1: required i64 token 2: required string user_id 3: optional list<Feature> enabled_features 4: optional i64 page = 0 5: optional MessageInfo info } + DDL provides type safety enum Feature { super_special, + Auto codgen less_special } + Efficient serialization + Sane schema migration + Separate logical from physical Use Thrift. or Protobufs.or Avro.
  • 27. Schemas aren’t enough! We need a data discovery service! Where’s the data? How do I read it? Who produces it? Who consumes it? When was it last generated? …
  • 28. Where to find data? Old way: Hard-coded partitioning scheme, path, format A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’ USING LzoStatusProtobufBlockPigLoader(); Custom loader How do people know? 1.) Ask around 2.) Cargo-cult New way: Nice UI for browsing A = LOAD ‘tables.statuses’ USING TwadoopLoader(); Same loader each time B = FILTER A BY year == ‘2011’ AND month == ‘11’ AND day == ‘01’ AND hour >= ’05’ AND hour <= ‘07’; Filters are pushed into the loader. No need to understand partitioning scheme.
  • 29. Data Access Layer (DAL) ―All problems in computer science can be solved by another level of indirection... Except for the problem of too many layers of indirection.‖ – David Wheeler All data accesses go through DAL Thin layer on top of HCatalog
  • 30. Data Access Layer (DAL) Who wrote what data, when? #win Automatically construct data/job dependency graph Automatically figure out ownership Hooks into alerting, auditing, repos, deploy systems, etc.
  • 31. Plumbing Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012. Source: Wikipedia (Plumbing)
  • 33. label Given feature vector ? features features features features learner classifier features features Induce ? ? Training Predicting
  • 34. Stone age machine learning… Source: Wikipedia (Stonehenge)
  • 35. upload results data munging Joining multiple dataset Feature extraction … download down-sample test data train predict
  • 36. What doesn’t work… 1. Down-sampling for training on single-processor  Defeats the whole point of big data! 2. Ad hoc productionizing  Disconnected from rest of production workflow
  • 37. Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting … We need…
  • 39. Integration with production workflows Source: Wikipedia (Oil refinery)
  • 40. Stochastic Gradient Descent Conceptually, classifier training is a like user-defined aggregate function! AVG SGD initialize sum = 0 initialize count = 0 weights update add to sum increment count terminate return sum / count return weights
  • 41. previous Pig dataflow previous Pig dataflow map Classifier Training reduce label, feature vector Pig storage function model model model feature vector feature vector Making model UDF model UDF Predictions prediction prediction Just like any other parallel Pig dataflow
  • 42. It’s just Pig! For ―free‖: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …
  • 44. Takeaway messages How do we make data scientists’ lives a bit easier? Adding a bit of structure goes a long way Getting the plumbing right makes all the difference ―In theory, there is no difference between theory and practice. But, in practice, there is.‖ Questions? - Jan L.A. van de Snepscheut