SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Offline Processing with
        Hadoop

     Chris K Wensel
     Concurrent, Inc.
Introduction
                Chris K Wensel
               chris@wensel.net

• Cascading, Lead Developer
    • http://cascading.org/
• Concurrent, Inc., Founder
    • Hadoop/Cascading support and tools
    • http://concurrentinc.com/
Computing Systems


           data           info

                  value

• Exist to create value out of data
• Everything else is an implementation
  detail
In Todays Computing
           Environment

• Lots of relevant medium-large data sets
  – that individually could fit in a RDBMS
• Lots of applications touching that data
  – where do you think PERL came from?
• Underutilized hardware owning
  (intermediate) data
  – xen/vmware add complexity (sprawl)
continued...
• Raw data continuously arriving (and in
  bursts)
  – we mostly care about the new stuff
• Raw data is dirty
  – bots and bugs
• Demands on timely/predictable result
  availability
  – downstream systems must be fed
• The ‘Cloud’ is enabling an on-demand
  model
Data Warehousing != Data
     ETL
         Processing

                         process     streams
    hub and spoke                  [distributed]
      [monolithic]



• Data Warehousing
  – monolithic systems and data schema
  – distribution through manual federation/
    sharding
• Data Processing
  – cluster of peer systems
  – dynamic even distribution of data and
    processing
Data Warehousing
                                     data
          raw data       ETL      warehouse   ETL    reporting
          loggers                                   [BI, KPI, etc]
         loggers                   [cache]
        loggers
                                              ETL
                                    ETL
                          data
                         mining
                                                      product        Consumer


                      R, SAS,     some data
                     Excel, etc
          Analyst


• Agility, no “one size fits all” schema,
  resistant to change
• Complex Analytics, cannot be represented
  by SQL
• Massive Data Sets, won’t fit or too
Production Data Processing
              raw data   data processing   valuable
              loggers                        data
             loggers
            loggers
                                                      Consumer




• Online / Real-Time      process



  – low latency (milliseconds to seconds for
    results)
  – smaller datasets - streams
• Offline / Batch
  – high latency (minutes to days for results)
  – larger datasets - files
Hadoop Adoption
           Cluster




                Rack            Rack                 Rack

                Node   Node     Node        Node     ...


                              Global Compute-space


                               Global Namespace




• Distributed replicated storage for large
  files
• Distributed fault tolerant exec of batch
  processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis
But Stuffed into Legacy Roles
                                                data
                                               mining
                          data warehouse
        raw data   ETL
        loggers          Hadoop + pig / hive
       loggers
      loggers
                                 ETL
                                                        Analyst




• Hadoop deployments mirror legacy
  architectures
  – ETL into cached “structured storage”
• Pig/Hive are syntaxes for Data Mining
  “Big” data
  – SQL like, but hard to customize and not
    “advanced”
Hadoop for Data Processing
                Value Creation

                  Scalability

                  Simplicity




• More Value through Innovation
• Scalability, Not Performance
• Simplifies Infrastructure
Simplicity
           Cluster




                Rack                  Rack                 Rack

                Node         Node     Node        Node     ...


                     cpus           Global Compute-space


                     disks           Global Namespace




• Virtualization across resources, not
  within (PaaS)
  – A single FileSystem across disks - no DBA
  – A single Execution System across CPUs -
    less IT
Scalability
         Users       Cluster

            Client

                          Rack                Rack                  Rack

                          Node         Node   Node           Node   ...

            Client
                                 job
                                                       job
                                                 job
            Client




• Scalability - continued reliability and met
  expectations as demand changes
• Application Scalability - data grows, app/
  infra expand
• Organizational Scalability - simpler infra
Creating Value
                                 events


                                               reporting
                  raw data
                  loggers
                 loggers     data processing
                loggers           Hadoop
                                 + Hadoop
                              etlCascading
                                   analytics
                                 Cascading
     Producer                                              Consumer


                                               product

                             operational



                              Value

• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind
Consequences
• Improved reliability of production
  processes
  – “we had a failed disk yet jobs never
    failed”
• Greater utilization of hardware
  resources
  – dynamically moves code to available
    cores
• Increased rate of innovation
  – diverse analytics over larger sets, less
    bureaucracy
• Fewer staff
Hadoop MapReduce
        Count Job                                Sort Job
                     [ k, [v] ]                                    [ k, [v] ]
             Map                   Reduce              Map                        Reduce


       [ k, v ]                   [ k, v ]              [ k, v ]                        [ k, v ]


              File                            File                                   File



                                             [ k, v ] = key and value pair
                                             [ k, [v] ] = key and associated values collection




• Nearly impossible to “think in”
• Apps are many dependent MR jobs
Cascading
                                   Word Count/Sort Flow
         Map                          Reduce                              Map           Reduce
                    [ f1,f2,.. ]             [ f1,f2,.. ]            [ f1,f2,.. ]
         Parse                     Group                    Count                    Sort

                                                                                            [ f1,f2,.. ]
                 [ f1,f2,.. ]


          Data                             [ f1, f2,... ] = tuples with field names             Data




• Alternative model & API to MapReduce
  – pipe/filters of re-usable operations
• For rapidly implementing Data Processing
  Systems
• Open-Source
Emerging Tool Support
• Karmasphere IDE (soon)
  – Developing and Debugging
• Bixo (Bixo Labs) Data Mining Toolkit
  – Apache Nutch replacement
  – Easier to customize to meet new business
    models
• Clojure & JRuby Domain Specific
  Languages (DSL)
  – Machine Learning
  – Simple/Complex Ad-Hoc queries
Practical Applications
• Log/event analysis, device and system
  monitoring
• Web crawling and content mining
• Behavior ad-targeting segmentation
• Ad campaign ROI
• Demand and event prediction
• POS analytics for product demand pricing
Successes
• Publicis/RazorFish - Behavioral Ad-
  Targeting
  – Cascading + AWS (Elastic MapReduce)
  – Daily automated User Behavior
    Segmentation
  – 6wks dev, 3T/day, $13k/mo
  – 500% increase in return on ad spend
    from a similar campaign a year before
continued...
• FlightCaster - Predicting flight delays
  – Clojure + Cascading + AWS
  – Machine learning and production
    processing
  – 3mos dev, 10G day, <1T total currently,
    <$2k/mos
• Etsy - Online Marketplace
  – JRuby + Cascading
  – Data mining (Hadoop as a DW!)
  – 750M page-views/mo, 60G/day of logs
Resources
• Chris K Wensel
  – chris@wensel.net
  – @cwensel
• Cascading
  – an API for optimizing production data
    processing
  – http://cascading.org
• Concurrent, Inc.
  – Support and Mentoring
  – http://concurrentinc.com

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 

Was ist angesagt? (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 

Ähnlich wie Processing Big Data

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 

Ähnlich wie Processing Big Data (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 

Mehr von cwensel

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problemscwensel
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 

Mehr von cwensel (7)

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
Cascading and BigData Problems
Cascading and BigData ProblemsCascading and BigData Problems
Cascading and BigData Problems
 
Buzz words
Buzz wordsBuzz words
Buzz words
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Kürzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Processing Big Data

  • 1. Offline Processing with Hadoop Chris K Wensel Concurrent, Inc.
  • 2. Introduction Chris K Wensel chris@wensel.net • Cascading, Lead Developer • http://cascading.org/ • Concurrent, Inc., Founder • Hadoop/Cascading support and tools • http://concurrentinc.com/
  • 3. Computing Systems data info value • Exist to create value out of data • Everything else is an implementation detail
  • 4. In Todays Computing Environment • Lots of relevant medium-large data sets – that individually could fit in a RDBMS • Lots of applications touching that data – where do you think PERL came from? • Underutilized hardware owning (intermediate) data – xen/vmware add complexity (sprawl)
  • 5. continued... • Raw data continuously arriving (and in bursts) – we mostly care about the new stuff • Raw data is dirty – bots and bugs • Demands on timely/predictable result availability – downstream systems must be fed • The ‘Cloud’ is enabling an on-demand model
  • 6. Data Warehousing != Data ETL Processing process streams hub and spoke [distributed] [monolithic] • Data Warehousing – monolithic systems and data schema – distribution through manual federation/ sharding • Data Processing – cluster of peer systems – dynamic even distribution of data and processing
  • 7. Data Warehousing data raw data ETL warehouse ETL reporting loggers [BI, KPI, etc] loggers [cache] loggers ETL ETL data mining product Consumer R, SAS, some data Excel, etc Analyst • Agility, no “one size fits all” schema, resistant to change • Complex Analytics, cannot be represented by SQL • Massive Data Sets, won’t fit or too
  • 8. Production Data Processing raw data data processing valuable loggers data loggers loggers Consumer • Online / Real-Time process – low latency (milliseconds to seconds for results) – smaller datasets - streams • Offline / Batch – high latency (minutes to days for results) – larger datasets - files
  • 9. Hadoop Adoption Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace • Distributed replicated storage for large files • Distributed fault tolerant exec of batch processes • Scale out vs (legacy) scale up • Java API allows complex analysis
  • 10. But Stuffed into Legacy Roles data mining data warehouse raw data ETL loggers Hadoop + pig / hive loggers loggers ETL Analyst • Hadoop deployments mirror legacy architectures – ETL into cached “structured storage” • Pig/Hive are syntaxes for Data Mining “Big” data – SQL like, but hard to customize and not “advanced”
  • 11. Hadoop for Data Processing Value Creation Scalability Simplicity • More Value through Innovation • Scalability, Not Performance • Simplifies Infrastructure
  • 12. Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace • Virtualization across resources, not within (PaaS) – A single FileSystem across disks - no DBA – A single Execution System across CPUs - less IT
  • 13. Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client • Scalability - continued reliability and met expectations as demand changes • Application Scalability - data grows, app/ infra expand • Organizational Scalability - simpler infra
  • 14. Creating Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value • Unconstrained processing model • Data processing requires integration • Processing must not fail or fall behind
  • 15. Consequences • Improved reliability of production processes – “we had a failed disk yet jobs never failed” • Greater utilization of hardware resources – dynamically moves code to available cores • Increased rate of innovation – diverse analytics over larger sets, less bureaucracy • Fewer staff
  • 16. Hadoop MapReduce Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection • Nearly impossible to “think in” • Apps are many dependent MR jobs
  • 17. Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data • Alternative model & API to MapReduce – pipe/filters of re-usable operations • For rapidly implementing Data Processing Systems • Open-Source
  • 18. Emerging Tool Support • Karmasphere IDE (soon) – Developing and Debugging • Bixo (Bixo Labs) Data Mining Toolkit – Apache Nutch replacement – Easier to customize to meet new business models • Clojure & JRuby Domain Specific Languages (DSL) – Machine Learning – Simple/Complex Ad-Hoc queries
  • 19. Practical Applications • Log/event analysis, device and system monitoring • Web crawling and content mining • Behavior ad-targeting segmentation • Ad campaign ROI • Demand and event prediction • POS analytics for product demand pricing
  • 20. Successes • Publicis/RazorFish - Behavioral Ad- Targeting – Cascading + AWS (Elastic MapReduce) – Daily automated User Behavior Segmentation – 6wks dev, 3T/day, $13k/mo – 500% increase in return on ad spend from a similar campaign a year before
  • 21. continued... • FlightCaster - Predicting flight delays – Clojure + Cascading + AWS – Machine learning and production processing – 3mos dev, 10G day, <1T total currently, <$2k/mos • Etsy - Online Marketplace – JRuby + Cascading – Data mining (Hadoop as a DW!) – 750M page-views/mo, 60G/day of logs
  • 22. Resources • Chris K Wensel – chris@wensel.net – @cwensel • Cascading – an API for optimizing production data processing – http://cascading.org • Concurrent, Inc. – Support and Mentoring – http://concurrentinc.com

Hinweis der Redaktion