SlideShare a Scribd company logo
1 of 47
Analyzing Twitter Data with Hadoop
    DevIgnition Conference, December 2012
    Joey Echeverria | Principal Solutions Architect
    joey@cloudera.com | @fwiffo




1                               ©2012 Cloudera, Inc.
About Joey

    • Principal Solutions Architect
    • 18 months
    • 4+ years
    • Local




2
Analyzing Twitter Data with Hadoop




     BUILDING A BIG DATA SOLUTION




3                   ©2012 Cloudera, Inc.
Big Data

    •   Big
        •   Larger volume than you’ve handled before
              •   No litmus test
        •   High value, under utilized
    •   Data
        •   Structured
        •   Unstructured
        •   Semi-structured
    •   Hadoop
        • Distributed file system
        • Distributed, batch computation


4                                  ©2012 Cloudera, Inc.
Data Management Systems




                                                               Data
                                                            Processing
       Data Source
                       Data
                                             Data Storage
                     Ingestion




5                                ©2012 Cloudera, Inc.
Relational Data Management Systems




                                                  Reporting

       Data Source   ETL                 RDBMS




6                          ©2012 Cloudera, Inc.
A Canonical Hadoop Architecture




                                                      Hive
                                                    (Impala)
       Data Source   Flume                  HDFS




7                            ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




     AN EXAMPLE USE CASE




8                   ©2012 Cloudera, Inc.
Analyzing Twitter

    • Social media popular with marketing teams
    • Twitter is an effective tool for promotion
    • Who is influential?
        •   Tweets
        •   Followers
        •   Retweets
             •   Similar to e-mail forwarding
    • Which twitter user gets the most retweets?
    • Who is influential in our industry?


9                                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HOW DO WE ANSWER THESE
      QUESTIONS?




10                   ©2012 Cloudera, Inc.
Techniques

     •   SQL
         •   Filtering
         •   Aggregation
         •   Sorting
     •   Complex data
         •   Deeply nested
         •   Variable schema




11
Architecture



             Twitter                                                  Oozie

        Custom                                                             Add
        Flume                                                            Partitions
        Source                                                            Hourly
                       Sink to                          JSON SerDe
                        HDFS                            Parses Data
            Flume                        HDFS                         Hive




12                               ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      TWITTER SOURCE




13                   ©2012 Cloudera, Inc.
Flume

     • Streaming data flow
     • Sources
         •   Push or pull
     • Sinks
     • Event based




14                          ©2012 Cloudera, Inc.
Pulling Data From Twitter

• Custom source, using twitter4j
• Sources process data as discrete events
Loading Data Into HDFS

• HDFS Sink comes stock with Flume
• Easily separate files by creation time
    •   hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
     public class TwitterSource extends AbstractSource
         implements EventDrivenSource, Configurable {
       ...
       // The initialization method for the Source. The context contains all
       // the Flume configuration info
       @Override
       public void configure(Context context) {
         ...
       }
       ...
       // Start processing events. Uses the Twitter Streaming API to sample
       // Twitter, and process tweets.
       @Override
       public void start() {
         ...
       }
       ...
       // Stops Source's event processing and shuts down the Twitter stream.
       @Override
       public void stop() {
         ...
       }
     }

17                                          ©2012 Cloudera, Inc.
Twitter API

     •   Callback mechanism for catching new tweets
     /** The actual Twitter stream. It's set up to collect raw JSON data */
     private final TwitterStream twitterStream = new TwitterStreamFactory(
       new ConfigurationBuilder().setJSONStoreEnabled(true).build())
         .getInstance();
     ...
     // The StatusListener is a twitter4j API that can be added to a stream,
     // and will call a method every time a message is sent to the stream.
     StatusListener listener = new StatusListener() {
       // The onStatus method is executed every time a new tweet comes in.
       public void onStatus(Status status) {
         ...
       }
     }
     ...
     // Set up the stream's listener (defined above), and set any necessary
     // security information.
     twitterStream.addListener(listener);
     twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
     AccessToken token = new AccessToken(accessToken, accessTokenSecret);
     twitterStream.setOAuthAccessToken(token);

18                                       ©2012 Cloudera, Inc.
JSON Data

     •     JSON data is processed as an event and written to
           HDFS
     public void onStatus(Status status) {
      // The EventBuilder is used to build an event using the headers and
      // the raw JSON of a tweet

         headers.put("timestamp", String.valueOf(
          status.getCreatedAt().getTime()));
         Event event = EventBuilder.withBody(
          DataObjectFactory.getRawJSON(status).getBytes(), headers);

         channel.processEvent(event);
     }




19                                          ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      FLUME DEMO




20                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HIVE




21                   ©2012 Cloudera, Inc.
What is Hive?

     • Created at Facebook
     • HiveQL
         •   SQL like interface
     • Hive interpreter
       converts HiveQL to
       MapReduce code
     • Returns results to the
       client


22                                ©2012 Cloudera, Inc.
Hive Details

     • Schema on read
     • Scalar types (int, float, double, boolean, string)
     • Complex types (struct, map, array)
     • Metastore contains table definitions
         •   Stored in a relational database
         •   Similar to catalog tables in other DBs




23
Complex Data

     SELECT
      t.retweeted_screen_name,
      sum(retweets) AS total_retweets,
      count(*) AS tweet_count
     FROM (SELECT
           retweeted_status.user.screen_name AS retweet_screen_name,
           retweeted_status.text,
           max(retweet_count) AS retweets
         FROM tweets
         GROUP BY
             retweeted_status.user.screen_name,
          retweeted_status.text) t
     GROUP BY t.retweet_screen_name
     ORDER BY total_retweets DESC
     LIMIT 10;



24                                  ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      JSON INTERLUDE




25                   ©2012 Cloudera, Inc.
What is JSON?

     • Complex, semi-structured data
     • Based on JavaScript’s data syntax
     • Rich, nested data types:
         •   number
         •   string
         •   Array
         •   object
         •   true, false
         •   null


26                         ©2012 Cloudera, Inc.
What is JSON?
     {
       "retweeted_status": {
         "contributors": null,
         "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest
     alternative routes when a road is clogged. #bigdata",
         "retweeted": false,
         "entities": {
           "hashtags": [
             {
               "text": "Crowdsourcing",
               "indices": [0, 14]
             },
             {
               "text": "bigdata",
               "indices": [129,137]
             }
           ],
           "user_mentions": []
         }
       }
     }



27                                           ©2012 Cloudera, Inc.
Hive Serializers and Deserializers

     • Instructs Hive on how to interpret data
     • JSONSerDe




28                        ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HIVE DEMO




29                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      IT’S A TRAP




30                   ©2012 Cloudera, Inc.
Not a Database

                       RDBMS                        Hive
                                                    Subset of SQL-92 plus
                       Generally >= SQL-92
 Language                                           Hive specific
                                                    extensions
                       INSERT, UPDATE,              INSERT OVERWRITE
 Update Capabilities
                       DELETE                       no UPDATE, DELETE
 Transactions          Yes                          No
 Latency               Sub-second                   Minutes
 Indexes               Yes                          Yes
 Data size             Terabytes                    Petabytes



31                           ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      IMPALA ASIDE




32                   ©2012 Cloudera, Inc.
Cloudera Impala
     Real-Time Query for Data Stored in Hadoop.

                                   Supports Hive SQL

                                   4-30X faster than Hive over MapReduce

                                   Supports multiple storage engines &
                                   file formats

                                   Uses existing drivers, integrates with existing
                                   metastore, works with leading BI tools

                                   Flexible, cost-effective, no lock-in

                                   Deploy & operate with
                                   Cloudera Enterprise RTQ

33                           ©2012 Cloudera, Inc.
Benefits of Cloudera Impala
     Real-Time Query for Data Stored in Hadoop

                            • Real-time queries run directly on source data
                            • No ETL delays
                            • No jumping between data silos

                            •   No double storage with EDW/RDBMS
                            •   Unlock analysis on more data
                            •   No need to create and maintain complex ETL between systems
                            •   No need to preplan schemas

                            • All data available for interactive queries
                            • No loss of fidelity from fixed data schemas


                            • Single metadata store from origination through analysis
                            • No need to hunt through multiple data silos



34                               ©2012 Cloudera, Inc.
Cloudera Impala Details
                                                          Unified metadata and scheduler
                                                      Hive
                                                    Metastore         YARN        HDFS NN
                         SQL App

                          ODBC                                     State Store

                                                                       Low-latency scheduler and cache
 Common Hive SQL and interface                                               (low-impact failures)

       Query Planner                 Query Planner              Fully MPP        Query Planner
                                                                Distributed
     Query Coordinator             Query Coordinator                          Query Coordinator

     Query Exec Engine             Query Exec Engine                          Query Exec Engine

     HDFS DN   HBase               HDFS DN       HBase                        HDFS DN       HBase
                                                                      Local Direct Reads


35                                 ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      OOZIE AUTOMATION




36                   ©2012 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management

• Once an hour, add a partition
• Takes advantage of advanced Hive functionality
Analyzing Twitter Data with Hadoop




      OOZIE DEMO




39                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      PUTTING IT ALL TOGETHER




40                   ©2012 Cloudera, Inc.
Complete Architecture



             Twitter                                                  Oozie

        Custom                                                             Add
        Flume                                                            Partitions
        Source                                                            Hourly
                       Sink to                          JSON SerDe
                        HDFS                            Parses Data
            Flume                        HDFS                         Hive




41                               ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      MORE DEMOS




42                   ©2012 Cloudera, Inc.
What next?

• Download Hadoop!
• CDH available at www.cloudera.com
• Cloudera provides pre-loaded VMs
    •   https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
        nager+Free+Edition+Demo+VM
•   Clone the source repo
    •   https://github.com/cloudera/cdh-twitter-example
My personal preference

•   Cloudera Manager
    •   https://ccp.cloudera.com/display/SUPPORT/Downloads
•   Free up to 50 nodes
Shout Out

• Jon Natkins
• @nattybnatkins
• Blog posts
    •   http://blog.cloudera.com/blog/2012/09/analyzing-twitter-
        data-with-hadoop/
    •   http://blog.cloudera.com/blog/2012/10/analyzing-twitter-
        data-with-hadoop-part-2-gathering-data-with-flume/
    •   http://blog.cloudera.com/blog/2012/11/analyzing-twitter-
        data-with-hadoop-part-3-querying-semi-structured-data-
        with-hive/
Questions?

• Contact me!
• Joey Echeverria
• joey@cloudera.com
• @fwiffo




•   We’re hiring!
47   ©2012 Cloudera, Inc.

More Related Content

Viewers also liked

Instrukcje rejestracji w W2X
Instrukcje rejestracji w W2XInstrukcje rejestracji w W2X
Instrukcje rejestracji w W2Xjarekkapica
 
Unit 2 analysis and software requirements
Unit 2 analysis and software requirementsUnit 2 analysis and software requirements
Unit 2 analysis and software requirementsAzhar Shaik
 
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahan
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahanPresentation bentuk bentuk negara & bentuk-bentuk pemerintahan
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahanDavid Leonel situmorang
 
Colegio de bachilleres del estado de querétaro plantel
Colegio de bachilleres  del estado de querétaro plantelColegio de bachilleres  del estado de querétaro plantel
Colegio de bachilleres del estado de querétaro plantelllolillo bbbbbbbb
 
Wilson_Escartin_SRMining 2015
Wilson_Escartin_SRMining 2015Wilson_Escartin_SRMining 2015
Wilson_Escartin_SRMining 2015Kim Wilson
 
Plano de Pormenor Cais do Ginjal - Termos de Referencia
Plano de Pormenor Cais do Ginjal  - Termos de ReferenciaPlano de Pormenor Cais do Ginjal  - Termos de Referencia
Plano de Pormenor Cais do Ginjal - Termos de Referenciavivercacilhas
 
大(中)規模Java script開発について
大(中)規模Java script開発について大(中)規模Java script開発について
大(中)規模Java script開発についてYuki Tanaka
 
Unit 3 system models
Unit 3 system modelsUnit 3 system models
Unit 3 system modelsAzhar Shaik
 

Viewers also liked (15)

Instrukcje rejestracji w W2X
Instrukcje rejestracji w W2XInstrukcje rejestracji w W2X
Instrukcje rejestracji w W2X
 
Como meditar lawrence
Como meditar lawrenceComo meditar lawrence
Como meditar lawrence
 
Unit 2 analysis and software requirements
Unit 2 analysis and software requirementsUnit 2 analysis and software requirements
Unit 2 analysis and software requirements
 
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahan
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahanPresentation bentuk bentuk negara & bentuk-bentuk pemerintahan
Presentation bentuk bentuk negara & bentuk-bentuk pemerintahan
 
Colegio de bachilleres del estado de querétaro plantel
Colegio de bachilleres  del estado de querétaro plantelColegio de bachilleres  del estado de querétaro plantel
Colegio de bachilleres del estado de querétaro plantel
 
Ldb 11ed
Ldb 11edLdb 11ed
Ldb 11ed
 
Wilson_Escartin_SRMining 2015
Wilson_Escartin_SRMining 2015Wilson_Escartin_SRMining 2015
Wilson_Escartin_SRMining 2015
 
Wp8 ppt
Wp8 pptWp8 ppt
Wp8 ppt
 
Plano de Pormenor Cais do Ginjal - Termos de Referencia
Plano de Pormenor Cais do Ginjal  - Termos de ReferenciaPlano de Pormenor Cais do Ginjal  - Termos de Referencia
Plano de Pormenor Cais do Ginjal - Termos de Referencia
 
Be able-to
Be able-toBe able-to
Be able-to
 
大(中)規模Java script開発について
大(中)規模Java script開発について大(中)規模Java script開発について
大(中)規模Java script開発について
 
El poder sanador de las manos
El poder sanador de las manosEl poder sanador de las manos
El poder sanador de las manos
 
Unit 3 system models
Unit 3 system modelsUnit 3 system models
Unit 3 system models
 
City hall seismic upgrade
City hall seismic upgradeCity hall seismic upgrade
City hall seismic upgrade
 
Escola verda
Escola verdaEscola verda
Escola verda
 

More from Joey Echeverria

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (13)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Analyzing twitter data with hadoop

  • 1. Analyzing Twitter Data with Hadoop DevIgnition Conference, December 2012 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo 1 ©2012 Cloudera, Inc.
  • 2. About Joey • Principal Solutions Architect • 18 months • 4+ years • Local 2
  • 3. Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION 3 ©2012 Cloudera, Inc.
  • 4. Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation 4 ©2012 Cloudera, Inc.
  • 5. Data Management Systems Data Processing Data Source Data Data Storage Ingestion 5 ©2012 Cloudera, Inc.
  • 6. Relational Data Management Systems Reporting Data Source ETL RDBMS 6 ©2012 Cloudera, Inc.
  • 7. A Canonical Hadoop Architecture Hive (Impala) Data Source Flume HDFS 7 ©2012 Cloudera, Inc.
  • 8. Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 8 ©2012 Cloudera, Inc.
  • 9. Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry? 9 ©2012 Cloudera, Inc.
  • 10. Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 ©2012 Cloudera, Inc.
  • 11. Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema 11
  • 12. Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive 12 ©2012 Cloudera, Inc.
  • 13. Analyzing Twitter Data with Hadoop TWITTER SOURCE 13 ©2012 Cloudera, Inc.
  • 14. Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based 14 ©2012 Cloudera, Inc.
  • 15. Pulling Data From Twitter • Custom source, using twitter4j • Sources process data as discrete events
  • 16. Loading Data Into HDFS • HDFS Sink comes stock with Flume • Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... } } 17 ©2012 Cloudera, Inc.
  • 18. Twitter API • Callback mechanism for catching new tweets /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the stream's listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token); 18 ©2012 Cloudera, Inc.
  • 19. JSON Data • JSON data is processed as an event and written to HDFS public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); } 19 ©2012 Cloudera, Inc.
  • 20. Analyzing Twitter Data with Hadoop FLUME DEMO 20 ©2012 Cloudera, Inc.
  • 21. Analyzing Twitter Data with Hadoop HIVE 21 ©2012 Cloudera, Inc.
  • 22. What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client 22 ©2012 Cloudera, Inc.
  • 23. Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs 23
  • 24. Complex Data SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10; 24 ©2012 Cloudera, Inc.
  • 25. Analyzing Twitter Data with Hadoop JSON INTERLUDE 25 ©2012 Cloudera, Inc.
  • 26. What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null 26 ©2012 Cloudera, Inc.
  • 27. What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } } 27 ©2012 Cloudera, Inc.
  • 28. Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe 28 ©2012 Cloudera, Inc.
  • 29. Analyzing Twitter Data with Hadoop HIVE DEMO 29 ©2012 Cloudera, Inc.
  • 30. Analyzing Twitter Data with Hadoop IT’S A TRAP 30 ©2012 Cloudera, Inc.
  • 31. Not a Database RDBMS Hive Subset of SQL-92 plus Generally >= SQL-92 Language Hive specific extensions INSERT, UPDATE, INSERT OVERWRITE Update Capabilities DELETE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes 31 ©2012 Cloudera, Inc.
  • 32. Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 ©2012 Cloudera, Inc.
  • 33. Cloudera Impala Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ 33 ©2012 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos 34 ©2012 Cloudera, Inc.
  • 35. Cloudera Impala Details Unified metadata and scheduler Hive Metastore YARN HDFS NN SQL App ODBC State Store Low-latency scheduler and cache Common Hive SQL and interface (low-impact failures) Query Planner Query Planner Fully MPP Query Planner Distributed Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads 35 ©2012 Cloudera, Inc.
  • 36. Analyzing Twitter Data with Hadoop OOZIE AUTOMATION 36 ©2012 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management • Once an hour, add a partition • Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with Hadoop OOZIE DEMO 39 ©2012 Cloudera, Inc.
  • 40. Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 40 ©2012 Cloudera, Inc.
  • 41. Complete Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive 41 ©2012 Cloudera, Inc.
  • 42. Analyzing Twitter Data with Hadoop MORE DEMOS 42 ©2012 Cloudera, Inc.
  • 43. What next? • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM • Clone the source repo • https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference • Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads • Free up to 50 nodes
  • 45. Shout Out • Jon Natkins • @nattybnatkins • Blog posts • http://blog.cloudera.com/blog/2012/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2012/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2012/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
  • 46. Questions? • Contact me! • Joey Echeverria • joey@cloudera.com • @fwiffo • We’re hiring!
  • 47. 47 ©2012 Cloudera, Inc.