2. The Problem Data Streams Databases Job Invocation and Workflow 2 Session Agenda
3. Tons of data (otherwise you’re in the wrong room) Tons of existing systems RDBMS Caches EDW Messaging Reporting Scheduling and job control Management and Monitoring 3 The Problem
4. Tons of data (otherwise you’re in the wrong room) Tons of existing systems RDBMS Caches EDW Messaging Reporting Scheduling and job control Management and Monitoring We’ll focus on streams, databases, and job control 4 The Problem
5.
6.
7. Other uses of streams Streams are more than just logs JMS, AMQP messages: Wire tap and send to Flume Turn incremental updates into stream data to avoid DBMS “middleman” (or send to both) Many existing problems can be turned into asynchronous streams 7
10. Basic approach: queries (out), inserts (in) Works, but slow Beware the Hadoop -> RDBMS DDOS attack Manage transactions on inserts Smarter: Lower level export / import tools Go to text formats Think in batches (MR to convert text <-> SequenceFiles) Use Sqoop! 10 Relational Databases
11. Batch incoming edits Perform a MR job to apply updates Input: Original dataset + incoming batch(es) Group by record ID Secondary sort on timestamp descending Reducer selects the newest record or merges changes over time Represent delete operations using a DELETE surrogate record i.e. <timestamp>, <record id>, DELETE 11 Pattern: Incremental Merge
12. Jobs are usually triggered based on Time Data arrival Service Interface An external event Production systems must monitor for successful completion Jobs can fail (just like tasks) Build for job atomicity and clean recovery 12 Job Invocation
13. For complex chains of jobs use a workflow engine Most systems support different types of steps (e.g. Java MR, Pig, Hive, HDFS commands, shell scripts) Don’t write your own Hadoop specific: Oozie, Cascading, Azkaban, … General ETL: Spring Batch, Kettle, … 13 Workflow
14. “What was most interesting is that of the people using a homegrown system, only one said they were at all happy with it, and none would recommend their system.” Kevin D. Peterson http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html 14 Really, don’t write your own!
15. Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs 15 Example: Ingestion
16. Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs What’s so special here? 16 Example: Ingestion
17. Every N minutes process new data from /incoming Move data into a working directory based on timestamp Allows jobs to run concurrently Input is isolated; no duplicate processing On failure, move back into /incoming Transform records Split input data by day into separate outputs 17 Example: Ingestion
18. Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Nothing special. This is your logic. OK, your logic is special… you know what I mean. Split input data by day into separate outputs 18 Example: Ingestion
19. Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs If we make no assumptions about input, we recover from previous failures Daily rollover no longer matters 19 Example: Ingestion
20. Data streams are everywhere; use Flume For bulk relational database import and export, use Sqoop Consider asynchronous updates to large data stores Incremental merges are possible Use a workflow system for complex jobs Job atomicity is critical ETL best practices still apply 20 Recap