SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Hadoop and Vertica
The Data Analytics Platform at Twitter
               Bill Graham - @billgraham
     Data Systems Engineer, Analytics Infrastructure
              Hadoop Summit, June 2012
About that pony giveaway...




                              2
Outline
  • Architecture
  • Data flow
  • Job coordination
  • Resource management
  • Vertica integration
  • Gotchas
  • Future work




                          3
We count things

  • 140 characters
  • 140M active users
  • 340M tweets per day
  • 80-100 TB ingested daily (uncompressed)
  • 10s of Ks daily Hadoop jobs




                                              4
Heterogeneous stack
  • Many job execution applications
    • Crane - Java ETL
    • Oink - Pig scheduler
    • Rasvelg - SQL aggregations
    • Scalding - Cascading via Scala
    • PyCascading - Cascading via Python
    • Indexing jobs
  • Our users
    • Analytics, Revenue, Growth, Search, Recommendations, etc.
    • PMs, Sales!


                                                                  5
Data flow: Analytics

                                       Production Hosts
                  Log                                      Application
                events                                     Data
                         Scribe
                         Aggregators

  Third Party
                                                                     Social graph
   Imports                        HDFS                    MySQL/     Tweets
                                                          Gizzard    User profiles
                     Staging Hadoop Cluster




                Main Hadoop DW           HBase                                      Analytics
                                                           Vertica
                                                                                    Web Tools




                                                            MySQL
                                                                                                6
Data flow: Analytics

                                       Production Hosts
                  Log                                            Application
                events                                           Data
                         Scribe
                         Aggregators

  Third Party
                                                                           Social graph
   Imports                        HDFS                          MySQL/     Tweets
                                                                Gizzard    User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler



                            Log
                            Mover



                Main Hadoop DW           HBase                                            Analytics
                                                                 Vertica
                                                                                          Web Tools




                                                                  MySQL
                                                                                                      6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover



                Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane


                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed                                              Analysts
                     Staging Hadoop Cluster       Crawler                                                  Engineers
       Crane                                                                                               PMs
                                                            Crane                                          Sales
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                                  6
Data flow: Analytics

                                             Production Hosts
                        Log                                                 Application
                      events                                                Data
                               Scribe
                               Aggregators

       Third Party
                                                                                      Social graph
        Imports                         HDFS                              MySQL/      Tweets
                                                                          Gizzard     User profiles
                                                        Distributed                                              Analysts
                           Staging Hadoop Cluster       Crawler                                                  Engineers
              Crane                                                                                              PMs
                                                                  Crane                                          Sales
                                                                             Crane
                                  Log                                                     Rasvelg
HCatalog                          Mover


                                                                 Oink
           Oink       Main Hadoop DW           HBase                                                 Analytics
                                                                            Vertica
                                                                                                     Web Tools
                                                                Crane

                                                                          Crane
                                                                Crane

                                                       Oink
                                                                             MySQL
                                                                                                                        6
Chaotic? Actually, no.




                         7
System concepts


  • Loose coupling
  • Job coordination as a service
  • Resource management as a service
  • Idempotence




                                       8
Loose coupling


  • Multiple job frameworks
  • Right tool for the job
  • Common dependency management




                                   9
Job coordination

  • Shared batch table for job state
  • Access via client libraries
  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                            10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                      Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)                             ?


                                                                            10
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?




                                              12
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?
                                          granted


                                 denied                                  Insert entry into
                                                                            batch table
                                                      no
                          Idle                             yes   Completion
                                     Execution
                                     Complete?

                                          Execution




                                                                                    12
Job DAG & state transition

                 “Local View”
                 • Is it time for me to run yet?
                 • Are my dependancies satisfied?
                 • Any resource constraints?
                                               granted


                                      denied                                  Insert entry into
                                                                                 batch table
                                                           no
                               Idle                             yes   Completion
                                          Execution
                                          Complete?

                                               Execution


     batch table:
     (id, description, state,
      start_time, end_time,
      job_start_time, job_end_time)
                                                                                         12
Example: active users

  Production Hosts




                     Main Hadoop DW




       MySQL/                                  Analytics
       Gizzard                        MySQL   Dashboards
                          Vertica




                                                           13
Example: active users
                                                                       Job DAG




                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)

                         ib   e   web_events
                     Scr
                                                      Main Hadoop DW
                 Scr
                        ibe       sms_events




       MySQL/                                                                             Analytics
       Gizzard                                                                   MySQL   Dashboards
                                                           Vertica




                                                                                                      13
Example: active users
                                                                       Job DAG




                                                                             Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct




       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                                                                     Crane
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                         Crane                                                    Analytics
       Gizzard                                                                      MySQL     Dashboards
                                   user_profiles            Vertica




                                                                                                           13
Example: active users
                                                                            Job DAG




                                                                                   Oink     Oink
                                                                      Log mover
  Production Hosts
                                                                                           Crane
                                   Log mover                                                       Rasvelg
                              (via staging cluster)
                                                                                      Oink/Pig
                         ibe      web_events
                     Scr                                                              Cleanse
                                                       Main Hadoop DW                 Filter
                                                                                      Transform
                 Scr                                                                  Geo lookup
                        ibe       sms_events                                          Union
                                                                                      Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                                                   Analytics
       Gizzard                                                                            MySQL              Dashboards
                                   user_profiles                 Vertica



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Example: active users
                                                                             Job DAG




                                                                                    Oink     Oink
                                                                                                           ...
                                                                      Log mover
  Production Hosts
                                                                                            Crane
                                   Log mover                                                          Rasvelg Crane
                              (via staging cluster)
                                                                                         Oink/Pig
                         ibe      web_events
                     Scr                                                                 Cleanse
                                                       Main Hadoop DW                    Filter
                                                                                         Transform
                 Scr                                                                     Geo lookup
                        ibe       sms_events                                             Union
                                                                                         Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                 Crane                             Analytics
       Gizzard                                                                             MySQL             Dashboards
                                   user_profiles                 Vertica    active_by_*



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Vertica or Hadoop?
  • Vertica
    • Loads 100s of Ks rows/second
    • Aggregate 100s of Ms rows in seconds
    • Used for low latency queries and aggregations
    • Keep a sliding window of data
  • Hadoop
    • Excels when data size is massive
    • Flexible and powerful
    • Great with nested data structures and unstructured data
    • Used for complex functions and ML



                                                                14
Vertica import options
  • Direct import via Crane
    • Load into dest table, single thread, atomic
  • Atomic import via Crane/Rasvelg
    • Crane loads to temp table, single thread
    • Rasvelg moves to dest table
  • Parallel import via Oink/Pig
    • Pig job via VerticaStorer
                                                                MySQL/
                                                                Gizzard



    • ARM throttles active DB connections                         Crane

                                                                           Rasvelg


                                                        Oink
                                       Main Hadoop DW
                                                                 Vertica
                                                        Crane




                                                                                15
Vertica imports - pros/cons
  • Crane & Rasvelg
    • Good for smaller datasets, DB to DB transfers
    • Single threaded
    • Easy on Vertica
    • Hadoop not required
  • Pig
    • Great for larger datasets                                  MySQL/
                                                                 Gizzard


    • More complex, not atomic
                                                                   Crane

    • DDOS potential                                                        Rasvelg


                                                         Oink
                                        Main Hadoop DW
                                                                  Vertica
                                                         Crane




                                                                                16
VerticaStorer
  • PigStorage implementation
  • From Vertica’s Hadoop connector suite
  • Out of the box
    • Easy to get Hello World working
    • Well documented
    • Pig/Vertica data bindings work well
    • Fast!
    • Transaction-aware tasks
    • No bugs found
    • Open source?



                                            17
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table




                                            18
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table
         SET mapred.map.tasks.speculative.execution false

         user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;

         STORE user_sessions INTO '{db_schema.user_sessions}' USING
               com.twitter.twadoop.pig.store.VerticaStorage(
               'config/db.yml', 'db_name', 'arm_resource_name');
                                                                       18
Gotcha #1


  • MR data load is not atomic
    • Avoid partial reads
    • Option 1: load to temp table, then insert direct
    • Option 2: add job dependency concept




                                                         19
Gotcha #2



  • Speculative execution is not always your friend
    • Launch more tasks than needed, just in case
    • For non-idempotent jobs, extra tasks == BAD




                                                      20
Gotcha #3


  • isIdempotant() must be a first-class concept
    • Loader jobs will fail
    • Failure after first task success == not good
    • Can’t automate retry without cleanup




                                                    21
Gotcha #4

  • Vendor code only gets you so far
    • Nice to haves == have to write
    • Favor the decorator pattern
    • Pig’s StoreFuncWrapper can help
    • Vendor open sourcing is ideal




                                        22
Future work
  • More VerticaStorer features
  • Multiple Vertica clusters
  • Atomic DB loads with Pig/Oink
  • Better DAG visibility
  • Better job history visibility
  • MR job optimizations via historic stats
  • HCatalog data registry
  • Job push events


                                              23
Acknowledgements




                   24
Questions?

 Bill Graham - @billgraham




                             25

Weitere ähnliche Inhalte

Was ist angesagt?

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingYu Huang
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 

Was ist angesagt? (20)

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 

Andere mochten auch

Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Simeon Nedkov
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaJack Gudenkauf
 
Behavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial IntelligenceBehavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial IntelligenceJohn Liu
 
Enterprise Content Search Paradigms
Enterprise Content Search ParadigmsEnterprise Content Search Paradigms
Enterprise Content Search ParadigmsDavid Champeau
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsAmazon Web Services
 
Big Data Analytics for Insurance Business
Big Data Analytics for Insurance BusinessBig Data Analytics for Insurance Business
Big Data Analytics for Insurance BusinessPanBI
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataSpark Summit
 

Andere mochten auch (8)

Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
Behavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial IntelligenceBehavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial Intelligence
 
Enterprise Content Search Paradigms
Enterprise Content Search ParadigmsEnterprise Content Search Paradigms
Enterprise Content Search Paradigms
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Big Data Analytics for Insurance Business
Big Data Analytics for Insurance BusinessBig Data Analytics for Insurance Business
Big Data Analytics for Insurance Business
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 

Ähnlich wie Hadoop and Vertica: Data Analytics Platform at Twitter

16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012infolive
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Datainside-BigData.com
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata Gruter
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationACSG Section Montréal
 

Ähnlich wie Hadoop and Vertica: Data Analytics Platform at Twitter (20)

Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
 
Fluentd meetup #3
Fluentd meetup #3Fluentd meetup #3
Fluentd meetup #3
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Hadoop and Vertica: Data Analytics Platform at Twitter

  • 1. Hadoop and Vertica The Data Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
  • 2. About that pony giveaway... 2
  • 3. Outline • Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
  • 4. We count things • 140 characters • 140M active users • 340M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
  • 5. Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
  • 6. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 7. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 8. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
  • 9. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 10. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 11. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 12. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg HCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 14. System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
  • 15. Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
  • 16. Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 17. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 18. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 19. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 20. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
  • 21. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 22. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 23. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 24. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
  • 25. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
  • 26. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
  • 27. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 28. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 29. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 30. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 31. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
  • 32. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 33. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 34. Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
  • 35. Vertica import options • Direct import via Crane • Load into dest table, single thread, atomic • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
  • 36. Vertica imports - pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
  • 37. VerticaStorer • PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
  • 38. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
  • 39. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO '{db_schema.user_sessions}' USING com.twitter.twadoop.pig.store.VerticaStorage( 'config/db.yml', 'db_name', 'arm_resource_name'); 18
  • 40. Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
  • 41. Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
  • 42. Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
  • 43. Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
  • 44. Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
  • 46. Questions? Bill Graham - @billgraham 25