SlideShare ist ein Scribd-Unternehmen logo
1 von 33
above the cloud:
                                    Big Data and BI




A Primer on Hadoop and Microsoft’s Hadoop on Azure and Windows
Denny Lee | Principal Program Manager (Big Data, BI)
@dennylee | dennyglee.com | sqlcat.com
Large        Complex   Unstructured


What is Big Data?
Scale Up!
With the power of the Hubble telescope, we can take amazing pictures 45M light years away
Amazing image of the Antennae Galaxies (NGC 4038-4039)
Analogous with scale up:
• non-commodity
• specialized equipment
• single point of failure*
Scale Out | Commoditized Distribution
Hubble can provide an amazing view Giant Galactic Nebula (NGC 3503) but how about radio waves?
• Not just from one area but from all areas viewed by observatories
• SETI @ Home: 5.2M participants, 1021 floating point operations1, 769 teraFLOPS2
Analogous with commoditized distributed computing
• Distributed and calculated locally
• Engage with hundreds, thousands, + machines
• Many points of failure, auto-replication prevents this from being a problem
Task                  Task
                                                                      tracker               tracker


                                     Map Reduce                         Job
                                       Layer                          tracker


                                           HDFS                       Name
                                           Layer                      node


                                                                      Data                   Data
                                                                      node                   node

          Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png


What is Hadoop?
•   Synonymous with the Big Data movement
•   Infrastructure to automatically distribute and replicate data across multiple nodes and execute and track map
    reduce jobs across all of those nodes
•   Inspired by Google’s Map Reduce and GFS papers
•   Components are: Hadoop Distributed File System (HDFS), Map Reduce, Job Tracker, and Task Tracker
•   Based on the Yahoo! “Nutch” project in 2003, became Hadoop in 2005 named after Doug Cutting’s son’s toy
    elephant
Traditional RDBMS          MapReduce
    Data Size                  Gigabytes (Terabytes)      Petabytes (Hexabytes)
    Access                     Interactive and Batch      Batch
    Updates                    Read / Write many times    Write once, Read many times
    Structure                  Static Schema              Dynamic Schema
    Integrity                  High (ACID)                Low
    Scaling                    Nonlinear                  Linear
    DBA Ratio                  1:40                       1:3000
    Reference: Tom White’s Hadoop: The Definitive Guide




Comparing RDBMS and MapReduce
Hadoop in action
•   WordCount demo
•   Map/Reduce
•   HiveQL on a weblog sample
Sample Java MapReduce WordCount Function

// Map Reduce is broken out into a Map function and reduce function
// ------------------------------------------------------------------

// Sample Map function: tokenizes string and establishes the tokens
// e.g. “a btcnd” is now an key value pairs representing [a, b, c, d]

public void map(Object key, Text value, Context context
                              ) throws IOException, InterruptedException {
         StringTokenizer itr = new StringTokenizer(value.toString());
         while (itr.hasMoreTokens()) {
                  word.set(itr.nextToken());
                  context.write(word, one);
         }
}

// Sample Reduce function: does the count of these key value pairs

public void reduce(Text key, Iterable<IntWritable> values,
                                       Context context
                                      ) throws IOException,
                                    InterruptedException {
         int sum = 0;
         for (IntWritable val : values) {
                  sum += val.get();
         }
         result.set(sum);
         context.write(key, result);
}
Executing WordCount against sample file

The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
header without written permission.

Please read the "legal small print," and other information about the
eBook and Project Gutenberg at the bottom of this file. Included is
important information about your specific rights and restrictions in
how the file may be used. You can also find out about how to make a
donation to Project Gutenberg, and how to get involved.

                                       Code to execute:
**Welcome To The World of Free Plain   hadoop jar AcmeWordCount.jar AcmeWordCount
                                       Vanilla Electronic Texts**

**eBooks Readable By Both Humans and   /test/davinci.txt /test/davinci_wordcount
                                       By Computers, Since 1971**

*****These eBooks Were Prepared By Thousands of Volunteers!*****
                                       Purpose: To perform count of number of words
                                       within the said davinci.txt
Title: The Notebooks of Leonardo Da Vinci, Complete

Author: Leonardo Da Vinci                  laws          2
                                           Project       5
...
                                       …
Query a Sample WebLog using HiveQL

// Sample Generated Log
588.891.552.388,-,08/05/2011,11:00:02,W3SVC1,CTSSVR14,-,-,0,-
,200,-,GET,/c.gif,Mozilla/5.0 (Windows NT 6.1; rv:5.0)
Gecko/20100101 Firefox/5.0,http://foo.live.com/cid-
4985109174710/blah?fdkjafdf,[GUID],-,MSFT,
&PageID=1234&Region=89191&IsoCy=BR&Lang=1046&Referrer=hotmail.c
om&ag=2385105&Campaign=&Event=12034



GUID       Parameters
[GUID]     &PageID=1234&Region=89191&IsoCy=BR&Lang=104
           6&Referrer=hotmail.com&ag=2385105&Campaign=
           &Event=12034



select
   GUID,
   str_to_map("param", "&", "=")["IsoCy"],
                                   HiveQL: SQL-like language
   str_to_map("param", "&", "=")["Lang"] SQL-like query which becomes
                                   • Write
from                                  MapReduce functions
weblog;                            • Includes functions like str_to_map so one can
                                      perform parsing functions in HiveQL
Traditional RDBMS: Move Data to Compute
As you process more and more data, and you want interactive response
• Typically need more expensive hardware
• Failures at the points of disk and network can be quite problematic
It’s all about ACID: atomicity, consistency, isolation, durability
Can work around this problem with more expensive HW and systems
• Though distribution problem becomes harder to do
Hadoop: Move Compute to the Data
Hadoop (and NoSQL in general) follows the Map Reduce framework
• Developed initially by Google -> Map Reduce and Google File system
• Embraced by community to develop MapReduce algorithms that are very robust
• Built Hadoop Distributed File System (HDFS) to auto-replicate data to multiple nodes
• And execute a single MR task on all/many nodes available on HDFS
Use commodity HW: no need for specialized and expensive network and disk
Not so much ACID, but BASE (basically available, soft state, eventually consistent)
Hadoop in Distributed Action
•   Distributed Queries
•   Hadoop Jobs and Task Tracker (50030/50070)
Distributed File System > Replication
http://server:50070
Job Tracker > Task mapping
http://server:50030
Task List > Tasks
http://server:50030
HiveQL Query                                       TeraSort
    14000   12577.498                                   300      270
    12000                                               250
    10000
                                                        200
                             7650.019
    8000
                                        5771.231        150                    123
    6000
                                                        100
    4000                                                                                      53
    2000                                                 50

        0                                                 0
               1                2          3                      8             16            32




Distribution, Linear Scalability
•    Key facet is that as more nodes are added, the more can be calculated, faster (minus overhead)
•    Designed for Batch systems (though this will change in the near future)
•    Roughly linear scalability for HiveQL query as well as TeraSort
Cassandra                Hadoop                     BackType          MR/GFS                  SimpleDB
   Hive                     Oozie                      Hadoop            Bigtable                Dynamo
   Scribe                   Pig (-latin)               Pig / Hbase       Dremel                  EC2 / S3
   Hadoop                                              Cassandra         …                       …




                                     Hadoop | Azure | Excel | BI | SQL DW | PDW | F# | Viola James




Hadoop ecosystem | open source, commodity
     Mahout    | Scalable machine learning and data mining
    MongoDB    | Document-oriented database (C++)
   Couchbase   | CouchDB (doc dB) + Membase (memcache protocol)
      Hbase    | Hadoop column-store database
          R    | Statistical computing and graphics
     Pegasus   | Peta-scale graph mining system
      Lucene   | full-featured text search engine library
15 out of 17
sectors in the US have more data
stored per company than the
US Library of Congress
                                                                        140,000-190,000
                                                                        more deep analytical talent positions
                             1.5 million                                                                   50-60%
                             more data saavy managers
                                                                         increase in the number of Hadoop developers
                             in the US alone                                within organizations already using Hadoop
                                                                                                          within a year
   €250 billion
   Potential annual value to
   Europe’s public sector                                  $300 billion
                                                           Potential annual value to US healthcare

 NoSQL ecosystem | business value

                         How Facebook moved 30 petabytes of Hadoop data.
                        … the migration proved that disaster recovery is possible with Hadoop clusters.
                        This could be an important capability for organizations considering relying on Hadoop
                        (by running Hive atop the Hadoop Distributed File System) as a data warehouse,
                        like Facebook does. As Yang notes, “Unlike a traditional warehouse using SAN/NAS storage,
                        HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was
                        possible to efficiently keep an active multi-petabyte cluster properly replicated,
                        with only a small amount of lag.”
This illustrates a new thesis or collective wisdom emerging from the
Valley: If a technology is not your core value-add, it should be open-
sourced because then others can improve it, and potential future
employees can learn it. This rising tide has lifted all boats, and is just
getting started.
                                                 Kovas Boguta: Hadoop & Startups: Where Open
                                                 Source Meets Business Data.
Hadoop JS: Our VB moment for Big Data
•   Harness the power of millions of Javascript developers
•   We give them a path to Big Data
•   Just like we did for VB for .NET / Enterprise market
Hadoop.JS executing AcmeWordCount.JS

// Map Reduce function in JavaScript
// ------------------------------------------------------------------

var map = function (key, value, context) {
         var words = value.split(/[^a-zA-Z]/);
         for (var i = 0; i < words.length; i++) {
                  if (words[i] !== "") {
                           context.write(words[i].toLowerCase(), 1);
                  }
         }
};

var reduce = function (key, values, context) {
         var sum = 0;
         while (values.hasNext()) {
                  sum += parseInt(values.next());
         }
         context.write(key, sum);
};
• AD integration
                                               • Kerberos
                                               • Cluster
                                                 Management and
                                                 Monitoring
                                Operations
                                                                  • Visual Studio Tools
                                                                    Integration
                                                                  • Hadoop JS
             Information                                          • HiveQL
                  Worker
                                                                  Developer
     • Excel Integration   Hadoop on Azure and
     • HiveODBC:
       PowerPivot, Cresc   Windows
       ent
                                                Data
                                                Scientist
                            Hadoop Analytics
                            • R
                            • Lucene
                            • Mahout
                            • Pegasus




How Microsoft address Big Data
StreamInsight
       Java MR                        HiveQL      Pig Latin     CQL
                        API




Hadoop on
                              Ocean of Data
Azure and Windows




EIS / ERP        Database          File System   OData / RSS   Azure Storage
Azure + Hadoop = Tier-1 in the Cloud

            In short, Hadoop has the potential to make the
            enterprise compatible with the entire rest of
            the open-source and startup world…
                                             Kovas Boguta: Hadoop & Startups: Where Open
                                             Source Meets Business Data.
BI + Tier-1 Big Data = Above the Cloud


                                 "The sun always shines
                                 above the clouds."
                                 — Paul F. Davis
Hadoop on Azure Interactive Hive Console
Excel Hive Add-In
PowerPivot and Hadoop

Connecting PowerPivot to Hadoop on Azure
Power View and Hadoop

Connecting Power View to Hadoop on Azure
APPENDIX
A Definition of Big Data - 4Vs: volume, velocity, variability, variety
Big data: techniques and technologies that make handling data at extreme scale economical.
Hadoop: Auto-replication
•   Hadoop processes data in 64MB chunks and then replicates to different servers
•   Replication value set at the hdfs-site.xml, dfs.replication node

Weitere ähnliche Inhalte

Was ist angesagt?

Windows Azure Design Patterns
Windows Azure Design PatternsWindows Azure Design Patterns
Windows Azure Design PatternsDavid Pallmann
 
Seeding The Cloud
Seeding The CloudSeeding The Cloud
Seeding The CloudTed Leung
 
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012Amazon Web Services
 
The Essentials of Building Cloud-Based Web Apps with Azure
The Essentials of Building Cloud-Based Web Apps with AzureThe Essentials of Building Cloud-Based Web Apps with Azure
The Essentials of Building Cloud-Based Web Apps with AzureIdo Flatow
 
Migrating Customers to Microsoft Azure: Lessons Learned From the Field
Migrating Customers to Microsoft Azure: Lessons Learned From the FieldMigrating Customers to Microsoft Azure: Lessons Learned From the Field
Migrating Customers to Microsoft Azure: Lessons Learned From the FieldIdo Flatow
 
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance Database
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance DatabaseDay 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance Database
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance DatabaseAmazon Web Services
 
What's New + The Lean Methodology: Introduction to AWS, Cambridge
What's New + The Lean Methodology: Introduction to AWS, CambridgeWhat's New + The Lean Methodology: Introduction to AWS, Cambridge
What's New + The Lean Methodology: Introduction to AWS, CambridgeAmazon Web Services
 
Microsoft PaaS Cloud Windows Azure Platform
Microsoft PaaS Cloud Windows Azure PlatformMicrosoft PaaS Cloud Windows Azure Platform
Microsoft PaaS Cloud Windows Azure PlatformEsri
 
Integrating sps 2010 and windows azure
Integrating sps 2010 and windows azureIntegrating sps 2010 and windows azure
Integrating sps 2010 and windows azureManish Corriea
 
Developing and deploying windows azure applications
Developing and deploying windows azure applicationsDeveloping and deploying windows azure applications
Developing and deploying windows azure applicationsManish Corriea
 
Lap around windows azure
Lap around windows azureLap around windows azure
Lap around windows azureManish Corriea
 
Azure Cloud Dev Camp - Introduction
Azure Cloud Dev Camp - IntroductionAzure Cloud Dev Camp - Introduction
Azure Cloud Dev Camp - Introductiongiventocode
 
Azure Data services
Azure Data servicesAzure Data services
Azure Data servicesRajesh Kolla
 
AWS RDS Presentation - DOAG Conference
AWS RDS Presentation - DOAG Conference AWS RDS Presentation - DOAG Conference
AWS RDS Presentation - DOAG Conference Amazon Web Services
 
Windows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressWindows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressGeorge Kanellopoulos
 
Migrating Data and Databases to Azure
Migrating Data and Databases to AzureMigrating Data and Databases to Azure
Migrating Data and Databases to AzureKaren Lopez
 
Windows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume OneWindows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume OneEric Nelson
 

Was ist angesagt? (20)

Windows Azure Design Patterns
Windows Azure Design PatternsWindows Azure Design Patterns
Windows Azure Design Patterns
 
Seeding The Cloud
Seeding The CloudSeeding The Cloud
Seeding The Cloud
 
Move to azure
Move to azureMove to azure
Move to azure
 
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
DAT103 Introducing Amazon RedShift - AWS re: Invent 2012
 
SharePoint on Azure
SharePoint on Azure SharePoint on Azure
SharePoint on Azure
 
The Essentials of Building Cloud-Based Web Apps with Azure
The Essentials of Building Cloud-Based Web Apps with AzureThe Essentials of Building Cloud-Based Web Apps with Azure
The Essentials of Building Cloud-Based Web Apps with Azure
 
Migrating Customers to Microsoft Azure: Lessons Learned From the Field
Migrating Customers to Microsoft Azure: Lessons Learned From the FieldMigrating Customers to Microsoft Azure: Lessons Learned From the Field
Migrating Customers to Microsoft Azure: Lessons Learned From the Field
 
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance Database
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance DatabaseDay 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance Database
Day 2 - Amazon RDS - Letting AWS run your Low Admin, High Performance Database
 
What's New + The Lean Methodology: Introduction to AWS, Cambridge
What's New + The Lean Methodology: Introduction to AWS, CambridgeWhat's New + The Lean Methodology: Introduction to AWS, Cambridge
What's New + The Lean Methodology: Introduction to AWS, Cambridge
 
Spring in the Cloud
Spring in the CloudSpring in the Cloud
Spring in the Cloud
 
Microsoft PaaS Cloud Windows Azure Platform
Microsoft PaaS Cloud Windows Azure PlatformMicrosoft PaaS Cloud Windows Azure Platform
Microsoft PaaS Cloud Windows Azure Platform
 
Integrating sps 2010 and windows azure
Integrating sps 2010 and windows azureIntegrating sps 2010 and windows azure
Integrating sps 2010 and windows azure
 
Developing and deploying windows azure applications
Developing and deploying windows azure applicationsDeveloping and deploying windows azure applications
Developing and deploying windows azure applications
 
Lap around windows azure
Lap around windows azureLap around windows azure
Lap around windows azure
 
Azure Cloud Dev Camp - Introduction
Azure Cloud Dev Camp - IntroductionAzure Cloud Dev Camp - Introduction
Azure Cloud Dev Camp - Introduction
 
Azure Data services
Azure Data servicesAzure Data services
Azure Data services
 
AWS RDS Presentation - DOAG Conference
AWS RDS Presentation - DOAG Conference AWS RDS Presentation - DOAG Conference
AWS RDS Presentation - DOAG Conference
 
Windows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressWindows Azure & How to Deploy Wordress
Windows Azure & How to Deploy Wordress
 
Migrating Data and Databases to Azure
Migrating Data and Databases to AzureMigrating Data and Databases to Azure
Migrating Data and Databases to Azure
 
Windows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume OneWindows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume One
 

Ähnlich wie Big Data and BI: A Primer on Hadoop and Microsoft’s Hadoop on Azure and Windows

Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 

Ähnlich wie Big Data and BI: A Primer on Hadoop and Microsoft’s Hadoop on Azure and Windows (20)

Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Mehr von Denny Lee

Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceDenny Lee
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesDenny Lee
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesDenny Lee
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDenny Lee
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Denny Lee
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
 
SQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecurityDenny Lee
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008Denny Lee
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesDenny Lee
 
Deploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDenny Lee
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger BrainsDenny Lee
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee
 

Mehr von Denny Lee (20)

Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best Practices
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
SQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin Security
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
 
Deploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePoint
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 

Kürzlich hochgeladen

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Kürzlich hochgeladen (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Big Data and BI: A Primer on Hadoop and Microsoft’s Hadoop on Azure and Windows

  • 1. above the cloud: Big Data and BI A Primer on Hadoop and Microsoft’s Hadoop on Azure and Windows Denny Lee | Principal Program Manager (Big Data, BI) @dennylee | dennyglee.com | sqlcat.com
  • 2. Large Complex Unstructured What is Big Data?
  • 3. Scale Up! With the power of the Hubble telescope, we can take amazing pictures 45M light years away Amazing image of the Antennae Galaxies (NGC 4038-4039) Analogous with scale up: • non-commodity • specialized equipment • single point of failure*
  • 4. Scale Out | Commoditized Distribution Hubble can provide an amazing view Giant Galactic Nebula (NGC 3503) but how about radio waves? • Not just from one area but from all areas viewed by observatories • SETI @ Home: 5.2M participants, 1021 floating point operations1, 769 teraFLOPS2 Analogous with commoditized distributed computing • Distributed and calculated locally • Engage with hundreds, thousands, + machines • Many points of failure, auto-replication prevents this from being a problem
  • 5. Task Task tracker tracker Map Reduce Job Layer tracker HDFS Name Layer node Data Data node node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png What is Hadoop? • Synonymous with the Big Data movement • Infrastructure to automatically distribute and replicate data across multiple nodes and execute and track map reduce jobs across all of those nodes • Inspired by Google’s Map Reduce and GFS papers • Components are: Hadoop Distributed File System (HDFS), Map Reduce, Job Tracker, and Task Tracker • Based on the Yahoo! “Nutch” project in 2003, became Hadoop in 2005 named after Doug Cutting’s son’s toy elephant
  • 6. Traditional RDBMS MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear DBA Ratio 1:40 1:3000 Reference: Tom White’s Hadoop: The Definitive Guide Comparing RDBMS and MapReduce
  • 7. Hadoop in action • WordCount demo • Map/Reduce • HiveQL on a weblog sample
  • 8. Sample Java MapReduce WordCount Function // Map Reduce is broken out into a Map function and reduce function // ------------------------------------------------------------------ // Sample Map function: tokenizes string and establishes the tokens // e.g. “a btcnd” is now an key value pairs representing [a, b, c, d] public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } // Sample Reduce function: does the count of these key value pairs public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
  • 9. Executing WordCount against sample file The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete by Leonardo Da Vinci (#3 in our series by Leonardo Da Vinci) Copyright laws are changing all over the world. Be sure to check the copyright laws for your country before downloading or redistributing this or any other Project Gutenberg eBook. This header should be the first thing seen when viewing this Project Gutenberg file. Please do not remove it. Do not change or edit the header without written permission. Please read the "legal small print," and other information about the eBook and Project Gutenberg at the bottom of this file. Included is important information about your specific rights and restrictions in how the file may be used. You can also find out about how to make a donation to Project Gutenberg, and how to get involved. Code to execute: **Welcome To The World of Free Plain hadoop jar AcmeWordCount.jar AcmeWordCount Vanilla Electronic Texts** **eBooks Readable By Both Humans and /test/davinci.txt /test/davinci_wordcount By Computers, Since 1971** *****These eBooks Were Prepared By Thousands of Volunteers!***** Purpose: To perform count of number of words within the said davinci.txt Title: The Notebooks of Leonardo Da Vinci, Complete Author: Leonardo Da Vinci laws 2 Project 5 ... …
  • 10. Query a Sample WebLog using HiveQL // Sample Generated Log 588.891.552.388,-,08/05/2011,11:00:02,W3SVC1,CTSSVR14,-,-,0,- ,200,-,GET,/c.gif,Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0,http://foo.live.com/cid- 4985109174710/blah?fdkjafdf,[GUID],-,MSFT, &PageID=1234&Region=89191&IsoCy=BR&Lang=1046&Referrer=hotmail.c om&ag=2385105&Campaign=&Event=12034 GUID Parameters [GUID] &PageID=1234&Region=89191&IsoCy=BR&Lang=104 6&Referrer=hotmail.com&ag=2385105&Campaign= &Event=12034 select GUID, str_to_map("param", "&", "=")["IsoCy"], HiveQL: SQL-like language str_to_map("param", "&", "=")["Lang"] SQL-like query which becomes • Write from MapReduce functions weblog; • Includes functions like str_to_map so one can perform parsing functions in HiveQL
  • 11. Traditional RDBMS: Move Data to Compute As you process more and more data, and you want interactive response • Typically need more expensive hardware • Failures at the points of disk and network can be quite problematic It’s all about ACID: atomicity, consistency, isolation, durability Can work around this problem with more expensive HW and systems • Though distribution problem becomes harder to do
  • 12. Hadoop: Move Compute to the Data Hadoop (and NoSQL in general) follows the Map Reduce framework • Developed initially by Google -> Map Reduce and Google File system • Embraced by community to develop MapReduce algorithms that are very robust • Built Hadoop Distributed File System (HDFS) to auto-replicate data to multiple nodes • And execute a single MR task on all/many nodes available on HDFS Use commodity HW: no need for specialized and expensive network and disk Not so much ACID, but BASE (basically available, soft state, eventually consistent)
  • 13. Hadoop in Distributed Action • Distributed Queries • Hadoop Jobs and Task Tracker (50030/50070)
  • 14. Distributed File System > Replication http://server:50070
  • 15. Job Tracker > Task mapping http://server:50030
  • 16. Task List > Tasks http://server:50030
  • 17. HiveQL Query TeraSort 14000 12577.498 300 270 12000 250 10000 200 7650.019 8000 5771.231 150 123 6000 100 4000 53 2000 50 0 0 1 2 3 8 16 32 Distribution, Linear Scalability • Key facet is that as more nodes are added, the more can be calculated, faster (minus overhead) • Designed for Batch systems (though this will change in the near future) • Roughly linear scalability for HiveQL query as well as TeraSort
  • 18. Cassandra Hadoop BackType MR/GFS SimpleDB Hive Oozie Hadoop Bigtable Dynamo Scribe Pig (-latin) Pig / Hbase Dremel EC2 / S3 Hadoop Cassandra … … Hadoop | Azure | Excel | BI | SQL DW | PDW | F# | Viola James Hadoop ecosystem | open source, commodity Mahout | Scalable machine learning and data mining MongoDB | Document-oriented database (C++) Couchbase | CouchDB (doc dB) + Membase (memcache protocol) Hbase | Hadoop column-store database R | Statistical computing and graphics Pegasus | Peta-scale graph mining system Lucene | full-featured text search engine library
  • 19. 15 out of 17 sectors in the US have more data stored per company than the US Library of Congress 140,000-190,000 more deep analytical talent positions 1.5 million 50-60% more data saavy managers increase in the number of Hadoop developers in the US alone within organizations already using Hadoop within a year €250 billion Potential annual value to Europe’s public sector $300 billion Potential annual value to US healthcare NoSQL ecosystem | business value How Facebook moved 30 petabytes of Hadoop data. … the migration proved that disaster recovery is possible with Hadoop clusters. This could be an important capability for organizations considering relying on Hadoop (by running Hive atop the Hadoop Distributed File System) as a data warehouse, like Facebook does. As Yang notes, “Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.”
  • 20. This illustrates a new thesis or collective wisdom emerging from the Valley: If a technology is not your core value-add, it should be open- sourced because then others can improve it, and potential future employees can learn it. This rising tide has lifted all boats, and is just getting started. Kovas Boguta: Hadoop & Startups: Where Open Source Meets Business Data.
  • 21. Hadoop JS: Our VB moment for Big Data • Harness the power of millions of Javascript developers • We give them a path to Big Data • Just like we did for VB for .NET / Enterprise market
  • 22. Hadoop.JS executing AcmeWordCount.JS // Map Reduce function in JavaScript // ------------------------------------------------------------------ var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 23. • AD integration • Kerberos • Cluster Management and Monitoring Operations • Visual Studio Tools Integration • Hadoop JS Information • HiveQL Worker Developer • Excel Integration Hadoop on Azure and • HiveODBC: PowerPivot, Cresc Windows ent Data Scientist Hadoop Analytics • R • Lucene • Mahout • Pegasus How Microsoft address Big Data
  • 24. StreamInsight Java MR HiveQL Pig Latin CQL API Hadoop on Ocean of Data Azure and Windows EIS / ERP Database File System OData / RSS Azure Storage
  • 25. Azure + Hadoop = Tier-1 in the Cloud In short, Hadoop has the potential to make the enterprise compatible with the entire rest of the open-source and startup world… Kovas Boguta: Hadoop & Startups: Where Open Source Meets Business Data.
  • 26. BI + Tier-1 Big Data = Above the Cloud "The sun always shines above the clouds." — Paul F. Davis
  • 27. Hadoop on Azure Interactive Hive Console
  • 29. PowerPivot and Hadoop Connecting PowerPivot to Hadoop on Azure
  • 30. Power View and Hadoop Connecting Power View to Hadoop on Azure
  • 32. A Definition of Big Data - 4Vs: volume, velocity, variability, variety Big data: techniques and technologies that make handling data at extreme scale economical.
  • 33. Hadoop: Auto-replication • Hadoop processes data in 64MB chunks and then replicates to different servers • Replication value set at the hdfs-site.xml, dfs.replication node

Hinweis der Redaktion

  1. Root of this is processing of a lot of log files
  2. Taken by Hubble telescope:The Antennae Galaxies/NGC 4038-4039http://hubblesite.org/gallery/album/galaxy/pr2006046a/xlarge_web/
  3. Hubble Telescope view of Giant Galactic Nebula NGC 3503 -&gt; to -&gt; SETI @ HomeSETI @ Home Stats: 1021 floating point operations on 9/26/2001769 teraFLOPS on 11/14/2009
  4. Reference numbers from McKinsey Global Institute – Big Data: The next frontier for innovation competition (http://www.mckinsey.com/mgi/publications/big_data/index.asp)http://www.karmasphere.com/images/documents/Karmasphere-HadoopDeveloperResearch.pdf
  5. Hopewell Rocks, Hopewell Cape, Bay of Fundy, New Brunswick, Canada
  6. IsotopeJS.cmd –inputs /test/davinci.txt –output /test/resultsjs /IsotopeJS/AcmeWordCount.js
  7. The background is the Width Smasher 1.9 theme found at: http://wordpress.org/extend/themes/download/width-smasher.1.9.zip
  8. Taken by Brian Hedlund during Brian Hedlund, Kelly Hedlund, and Denny Lee ascent of Mount Rainier in August 2000
  9. http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  10. http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  11. http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  12. http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  13. http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data