SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Hadoop Talk
Brief background on me

    Phil has over 16 years experience in data-centric system
    development. His work has flowed from simulation and video-
    game-like systems, to high-performance computing (HPC), to
    traditional database (Oracle, SQL Server, Postgres, MySQL)
    and CRM (warehouse/analytical) systems, and most recently to
    the Hadoop stack. Recently, as an employee at TripAdvisor he
    led the research into Hadoop/Hive which resulted in the
    successful migration from the traditional RDBMS platform to a
    system which is based on Hadoop/Hive and is integrated with
    MS SQL Server/SSAS. Currently, he's focused on the Hadoop
    stack and is creating a solution which involves integrating
    Hadoop in a more traditional enterprise environment.
Agenda

    To make you as excited about Hadoop as I am


   What is Hadoop (high-level) ?

   What have we actually done with it?
  
    How does “it” (HDFS, M/R, Hive, and HBase) work?
  
    Future of Hadoop
What is Hadoop?
Q: What is Hadoop:
   A#1 - The thing that empowers
      Yahoo, FB, and others
         Yahoo has >25k Hadoop nodes…wow…
Q: What is Hadoop
   A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
Q: What is Hadoop
A#3 – the revolution of 5+ years ago
“Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
What is Hadoop:
A#4 – the wave everyone is riding

 Nearly all the big players (and many smaller ones) are on board…
In fact, beware of this




  http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
What have we actually done with it?
Hadoop projects performed by BlueMetal Architects




  
    Hadoop at a Web 2.0 company (prior to BMA)
    
      Ported traditional 30TB Warehouse to Hive
    
      Big transform jobs in Hive
      
        E.G. Joins 50M rows to 12B rows
      
        Big Data jobs, e.g. Social Graph processing with
        many “Cartesians” to empower emails
  
     Hadoop in HealthCare (at BMA)
    
      Applied HBase as part of a new system
    
      Feeds data (via WS) to:
      
        E.D.
      
        Patient Web Portal
      
        Other HealthCare affiliates
  Note: Both projects include Hadoop as part of larger systems.
Warehouse Goals





   Use the right tool for the right job
  –Hadoop (M/R, Hive) is a batch system
   • Inherently high-latency
  –RDBMS (& other tools) are still needed

   Empower users
  –Minimize complexity
   • Eliminate joins (almost)
   • Eliminate “dimensions” (maybe)
  –Expose *all* data
  –Provide low-latency options
  –Provide self-service options
A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
Focus back to Hadoop …
High-level descriptions are good,
but not enough. How does it work?
    (From: http://blog.nahurst.com/visual-guide-to-nosql-systems)
Here we go…
Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
    (From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
                (From: http://www.lecturemaker.com/2011/02/rhipe/)
Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)
Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
                              (from
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s
                     orts_a_petabyte_in_162/)
Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
                (From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
Hadoop M/R executive summary

Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).

Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.

The same code can run on data of any size. The cluster is
scaled with the data, not the code.
Hadoop Stack Key Components
(http://hortonworks.com/technology/hortonworksdataplatform/)
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
                Hadoop is not just about non/semi structured data !
Hive
= HDFS
+ Metadata
+ HQL-> (efficient) M/R
+ more
= RDBMS
- low-latency (usually)
- (row-level) updates
- other (e.g. constraints)
+ HUGE scalability
+ POWERFUL distributed processing
Common RDBMS warehouse query




select top 10
  t.*
from (
  select ip_address, count(*) as cnt
  from f_pageviews pv
  join d_ipaddress ip on (pv.ip_key = ip.id)
  where date_key = 2992
  group by ip_address
)t
order by cnt desc

– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !
Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
required
NOTE: Hive does support “column-oriented” storage, which is very efficient.


select t.*
from (
  select ip_address, count(*) as cnt
  from f_lookback
  where ds = '2011-03-11'
  group by ip_address
)t
order by cnt desc
Limit 10

– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime
What else can Hadoop do?



   FB: Invented Cassandra but went with HBase for their new messaging system.
   Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.
   http://www.facebook.com/note.php?note_id=454991608919



That’s to hold 135B messages per month !
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html


Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.
HBase Architecture
Not shown: HM, ZK and HDFS
      (From: http://www.larsgeorge.com/2009/10/hbase-architecture-101-
                                 storage.html)
HBase: a more detailed view
 (http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
     (From: http://java.dzone.com/news/bigtable-model-cassandra-and)
HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
       The most important of these is distributed processing.
Hadoop in (pre*) action
                    Hadoop indexed “THE DATA” for Watson
  http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/

                        *Runtime processing used Apache JMS + UIMA .
Future of Hadoop
Overlapping Ecosystems

Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source
communities.




        Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
False Conflicts, with Solutions
             Sodium(explosive) + Chlorine(poison) =>
                           Salt(vital)




                                        From http://strangetimes.lastsuperpower.net/?p=1663




Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
IMO, an important message from a
brilliant man
    Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A

 http://www.youtube.com/watch?v=IVS__xF3Byg


Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.
Hadoop “option” (MapR) that plays nicely
MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
 BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
                      contact us as we’ll be blogging about it.
Hadoop NextGen:
 NN-HA, performance gains, more
Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
Hadoop >> (un)structured data store.
Why do this        (except ad-hoc)   …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
       From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
Q&A
Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS-
Engineer/22735283

http://hadoop.apache.org/

http://www.docstoc.com/docs/66356954/Advanced-HBase

https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

http://wiki.apache.org/hadoop/WordCount

https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s
Additional Slides
Fun Links
http://www.youtube.com/watch?v=tIrBVjVfjNY

Weitere ähnliche Inhalte

Was ist angesagt?

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 

Was ist angesagt? (20)

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

Andere mochten auch

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

Andere mochten auch (12)

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Hadoop demo ppt

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Ähnlich wie Hadoop demo ppt (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Hadoop demo ppt

  • 2. Brief background on me  Phil has over 16 years experience in data-centric system development. His work has flowed from simulation and video- game-like systems, to high-performance computing (HPC), to traditional database (Oracle, SQL Server, Postgres, MySQL) and CRM (warehouse/analytical) systems, and most recently to the Hadoop stack. Recently, as an employee at TripAdvisor he led the research into Hadoop/Hive which resulted in the successful migration from the traditional RDBMS platform to a system which is based on Hadoop/Hive and is integrated with MS SQL Server/SSAS. Currently, he's focused on the Hadoop stack and is creating a solution which involves integrating Hadoop in a more traditional enterprise environment.
  • 3. Agenda  To make you as excited about Hadoop as I am  What is Hadoop (high-level) ?  What have we actually done with it?  How does “it” (HDFS, M/R, Hive, and HBase) work?  Future of Hadoop
  • 5. Q: What is Hadoop: A#1 - The thing that empowers Yahoo, FB, and others Yahoo has >25k Hadoop nodes…wow…
  • 6. Q: What is Hadoop A#2 - Last year’s revolution (sort of) The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
  • 7. Q: What is Hadoop A#3 – the revolution of 5+ years ago
  • 8. “Success has many fathers” And you can look them up, because it’s FOSS ! People are fighting to contribute, and to get credit… be a contributor… (http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
  • 9. What is Hadoop: A#4 – the wave everyone is riding Nearly all the big players (and many smaller ones) are on board…
  • 10. In fact, beware of this http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
  • 11. What have we actually done with it?
  • 12. Hadoop projects performed by BlueMetal Architects  Hadoop at a Web 2.0 company (prior to BMA)  Ported traditional 30TB Warehouse to Hive  Big transform jobs in Hive  E.G. Joins 50M rows to 12B rows  Big Data jobs, e.g. Social Graph processing with many “Cartesians” to empower emails  Hadoop in HealthCare (at BMA)  Applied HBase as part of a new system  Feeds data (via WS) to:  E.D.  Patient Web Portal  Other HealthCare affiliates Note: Both projects include Hadoop as part of larger systems.
  • 13. Warehouse Goals  Use the right tool for the right job –Hadoop (M/R, Hive) is a batch system • Inherently high-latency –RDBMS (& other tools) are still needed  Empower users –Minimize complexity • Eliminate joins (almost) • Eliminate “dimensions” (maybe) –Expose *all* data –Provide low-latency options –Provide self-service options
  • 14. A strategy for MASSIVE processing: Best tool for the job This is what we implemented and, it turns out, is also what Yahoo has done. Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
  • 15. Focus back to Hadoop …
  • 16. High-level descriptions are good, but not enough. How does it work? (From: http://blog.nahurst.com/visual-guide-to-nosql-systems)
  • 18. Map-Reduce (M/R) example Note: this job is not optimized Take home message: “Simple API - Mappers read the input and emit K/V pairs. Framework sends Reducers K/V pairs partitioned and ordered* by Key” (From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
  • 19. Hadoop M/R with some details: Note: Partition, Combine and Shuffle (From: http://www.lecturemaker.com/2011/02/rhipe/)
  • 20. Hadoop M/R Primer Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks” (From: Yahoo)
  • 21. Hadoop Terasort Job Profile - or “hey, I thought it was just M/R” (from http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s orts_a_petabyte_in_162/)
  • 22. Why Hadoop? Because you don’t want to handle this… This is actually a profile of a job running on an old version of Hadoop, but jobs with many failures look similar. This also shows improvement in Hadoop. (From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
  • 23. Hadoop M/R executive summary Distributed storage system, with distributed processing capability, on commodity hardware (or in the cloud). Moves the computation to the data ! That, in turn, saves network which is the limiting factor in distributed apps. The same code can run on data of any size. The cluster is scaled with the data, not the code.
  • 24. Hadoop Stack Key Components (http://hortonworks.com/technology/hortonworksdataplatform/) HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas. Hadoop is not just about non/semi structured data !
  • 25. Hive = HDFS + Metadata + HQL-> (efficient) M/R + more = RDBMS - low-latency (usually) - (row-level) updates - other (e.g. constraints) + HUGE scalability + POWERFUL distributed processing
  • 26. Common RDBMS warehouse query select top 10 t.* from ( select ip_address, count(*) as cnt from f_pageviews pv join d_ipaddress ip on (pv.ip_key = ip.id) where date_key = 2992 group by ip_address )t order by cnt desc – wait a few minutes - time is usually 1-4x nominal time depending on load - … assumes the job can succeed at all !
  • 27. Hive Version… The luxury of Hadoop space/power, means dimensional processing might not be required NOTE: Hive does support “column-oriented” storage, which is very efficient. select t.* from ( select ip_address, count(*) as cnt from f_lookback where ds = '2011-03-11' group by ip_address )t order by cnt desc Limit 10 – BUT – runtime is trickier Time to run your job = HQL parse + M/R Job Submit + [ wait in the queue for availability ] + M/R Job Runtime
  • 28. What else can Hadoop do? FB: Invented Cassandra but went with HBase for their new messaging system. Does that mean HBase is ”better”? – no, it’s about using the right tool for the job. http://www.facebook.com/note.php?note_id=454991608919 That’s to hold 135B messages per month ! http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html Scale is relative (to your hardware and load), but when you want a consistent “OLTP” solution that doesn’t require redesign to scale, consider Hbase.
  • 29. HBase Architecture Not shown: HM, ZK and HDFS (From: http://www.larsgeorge.com/2009/10/hbase-architecture-101- storage.html)
  • 30. HBase: a more detailed view (http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  • 31. HBase: one way to look at it A BigTable Implementation: memcached + LSM + framework (From: http://java.dzone.com/news/bigtable-model-cassandra-and)
  • 32. HBase: Hadoop BigTable Not just a CRUD back-end: …coprocessors, versioned cells, range scans, optimization (e.g. selective compression) via column families, etc. The most important of these is distributed processing.
  • 33. Hadoop in (pre*) action Hadoop indexed “THE DATA” for Watson http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/ *Runtime processing used Apache JMS + UIMA .
  • 35. Overlapping Ecosystems Hadoop (usage and contributions) will be “shared” between FOSS and Closed Source communities. Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
  • 36. False Conflicts, with Solutions Sodium(explosive) + Chlorine(poison) => Salt(vital) From http://strangetimes.lastsuperpower.net/?p=1663 Closed Source + Open Source => Free + Enterprise + Support + Integration Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
  • 37. IMO, an important message from a brilliant man Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A http://www.youtube.com/watch?v=IVS__xF3Byg Add value by fostering the ecosystem. Do not fragment Hadoop (as Unix did). There is room for folks from many areas to contribute and benefit.
  • 38. Hadoop “option” (MapR) that plays nicely
  • 39. MS embraced Hadoop despite having developed technology similar to NextGen Hadoop. Wow. Hadoop release on Azure is 3/12. BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please contact us as we’ll be blogging about it.
  • 40. Hadoop NextGen: NN-HA, performance gains, more
  • 41. Hadoop NextGen: A Brave New (!?) world Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph” BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
  • 42. Hadoop >> (un)structured data store. Why do this (except ad-hoc) …? RDBMS and Hadoop have strengths, use them, don’t negate both. See the above Warehouse Architecture diagram… From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
  • 43. Q&A
  • 44. Useful/Supporting Links Bing crawls the web for Yahoo (for US, Canada, and some other countries) http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html World’s largest SSAS Cube: 14TB/quarter, 3B rows/day http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS- Engineer/22735283 http://hadoop.apache.org/ http://www.docstoc.com/docs/66356954/Advanced-HBase https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial http://wiki.apache.org/hadoop/WordCount https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s