SlideShare a Scribd company logo
1 of 43
Hadoop and its Ecosystem
    Components in Action
               Andrew J. Brust
                 CEO and Founder
               Blue Badge Insights
                   Level: Intermediate
Meet Andrew
 •   CEO and Founder, Blue Badge Insights
 •   Big Data blogger for ZDNet
 •   Microsoft Regional Director, MVP
 •   Co-chair VSLive! and 17 years as a speaker
 •   Founder, Microsoft BI User Group of NYC
     – http://www.msbinyc.com
 •   Co-moderator, NYC .NET Developers Group
     – http://www.nycdotnetdev.com
 •   “Redmond Review” columnist for
     Visual Studio Magazine and Redmond Developer
     News
 •   brustblog.com, Twitter: @andrewbrust
My New Blog (bit.ly/bigondata)
Read All About It!
Agenda
•   Understand:
    – MapReduce, Hadoop, and Hadoop stack elements
•   Review:
    – Microsoft Hadoop on Azure, Amazon Elastic
      MapReduce (EMR), Cloudera CDH4
•   Demo:
    – Hadoop, HBase, Hive, Pig
•   Discuss:
    – Sqoop, Flume
•   Demo
    – Mahout
MapReduce, in a Diagram


        Input   mapper   Output

                                  K1, K4…

        Input   mapper   Output   Input     reducer   Output

                                  K2, K5…
                mapper   Output                                Output
        Input                     Input     reducer   Output
Input
                                  K3, K6…
        Input   mapper   Output
                                  Input     reducer   Output


        Input   mapper   Output


        Input   mapper   Output
A MapReduce Example


          • Count by suite, on each floor

          • Send per-suite, per platform totals to lobby

          • Sort totals by platform

         • Send two platform packets to 10th, 20th, 30th floor

          •                                      Tally up
                                                 each
          • Collect the tallies                  platform

          • Merge tallies into one spreadsheet
What’s a Distributed File System?
•   One where data gets distributed over
    commodity drives on commodity servers
•   Data is replicated
•   If one box goes down, no data lost
    – Except the name node = SPOF!
•   BUT: HDFS is immutable
    – Files can only be written to once
    – So updates require drop + re-write (slow)
Hadoop = MapReduce + HDFS
•   Modeled after Google MapReduce + GFS
•   Have more data? Just add more nodes to
    cluster.
    – Mappers execute in parallel
    – Hardware is commodity
    – “Scaling out”
•   Use of HDFS means data may well be local
    to mapper processing
•   So, not just parallel, but minimal data
    movement, which avoids network
    bottlenecks
The Hadoop Stack
•   Hadoop
    – MapReduce, HDFS
•   Hbase
    – NoSQL Database
    – Lesser extent: Cassandra, HyperTable
•   Hive, Pig
    – SQL-like “data warehouse” system
    – Data transformation language
•   Sqoop
    – Import/export between HDFS, HBase,
      Hive and relational data warehouses
•   Flume
    – Log file integration
•   Mahout
    – Data Mining
Ways to work
•   Microsoft Hadoop on Azure
    – Visit www.hadooponazure.com
    – Request invite
    – Free, for now
•   Amazon Web Services Elastic MapReduce
    – Create AWS account
    – Select Elastic MapReduce in Dashboard
    – Cheap for experimenting, but not free
•   Cloudera CDH VM image
    –   Download as .tar.gz file
    –   “Un-tar” (can use WinRAR, 7zip)
    –   Run via VMWare Player or Virtual Box
    –   Everything’s free
Microsoft Hadoop on Azure
•   Much simpler than the others
•   Browser-based portal
    – Provisioning cluster, managing ports, MapReduce jobs
    – Gathering external data
                 Configure Azure Storage, Amazon S3
                 Import from DataMarket to Hive
•   Interactive JavaScript console
    – HDFS, Pig, light data visualization
•   Interactive Hive console
    – Hive commands and metadata discovery
•   From Portal page you can RDP directly to
    Hadoop head node
    – Double click desktop shortcut for CLI access
    – Certain environment variables may need to be set
Microsoft Hadoop on
Azure
Amazon Elastic MapReduce
•   Lots of steps!
•   At a high level:
    – Setup AWS account and S3 “buckets”
    – Generate Key Pair and PEM file
    – Install Ruby and EMR Command Line Interface
    – Provision the cluster using CLI
               A batch file can work very well here
    – Setup and run SSH/PuTTY
    – Work interactively at command line
Amazon EMR – Prep Steps
•   Create an AWS account
•   Create an S3 bucket for log storage
    – with list permissions for authenticated users
•   Create a Key Pair and save PEM file
•   Install Ruby
•   Install Amazon Web Services Elastic
    MapReduce Command Line Interface
    – aka AWS EMR CLI 
•   Create credentials.json in EMR CLI folder
    – Associate with same region as where key pair created
Amazon – Security and Startup
•   Security
    –   Download PuTTYgen and run it
    –   Click Load and browse to PEM file
    –   Save it in PPK format
    –   Exit PuTTYgen
•   In a command window, navigate to EMR CLI
    folder and enter command:
    – ruby elastic-mapreduce --create --alive [--num-instance xx]
      [--pig-interactive] [--hive-interactive] [--hbase --instance-type
      m1.large]
•   In AWS Console, go to EC2 Dashboard and
    click Instances on left nav bar
•   Wait until instance is running and get its
    Public DNS name
    – Use Compatibility View in IE or copy may not work
Connect!
•   Download and run PuTTY
•   Paste DNS name of EC2 instance into hostname
    field
•   In Treeview, drill down and navigate to
    ConnectionSSHAuth, browse to PPK file
•   Once EC2 instance(s) running, click Open
•   Click Yes to “The server’s host key is not cached
    in the registry…” PuTTY Security Alert
•   When prompted for user name, type “hadoop” and
    hit Enter
•   cd bin, then hive, pig, hbase shell
•   Right-click to paste from clipboard; option to go
    full-screen
•   (Kill EC2 instance(s) from Dashboard when done)
Amazon Elastic MapReduce
Cloudera CDH4 Virtual Machine
•   Get it for free, in VMWare and Virtual Box
    versions.
    – VMWare player and Virtual Box are free too
•   Run it, and configure it to have its own IP on
    your network.
•   Assuming IP of 192.168.1.59, open browser on
    your own (host ) machine and navigate to:
    – http://192.168.1.59:8888
•   Can also use browser in VM and hit:
    – http://localhost:8888
•   Work in “Hue”…
Hue
•   Browser based UI,
    with front ends
    for:
    – HDFS (w/ upload &
      download)
    – MapReduce job
      creation and
      monitoring
    – Hive (“Beeswax”)
•   And in-browser
    command line
    shells for:
    – HBase
    – Pig (“Grunt”)
Cloudera CDH4
Hadoop commands
•   HDFS
    – hadoop fs filecommand
    – Create and remove directories:
       mkdir, rm, rmr
    – Upload and download files to/from HDFS
       get, put
    – View directory contents
       ls, lsr
    – Copy, move, view files
       cp, mv, cat
•   MapReduce
    – Run a Java jar-file based job
       hadoop jar jarname params
Hadoop (directly)
HBase
•   Concepts:
    – Tables, column families
    – Columns, rows
    – Keys, values
•   Commands:
    –   Definition: create, alter, drop, truncate
    –   Manipulation: get, put, delete, deleteall, scan
    –   Discovery: list, exists, describe, count
    –   Enablement: disable, enable
    –   Utilities: version, status, shutdown, exit
    –   Reference: http://wiki.apache.org/hadoop/Hbase/Shell
•   Moreover,
    – Interesting HBase work can be done in MapReduce, Pig
HBase Examples
•   create 't1', 'f1', 'f2', 'f3'
•   describe 't1'
•   alter 't1', {NAME => 'f1',
    VERSIONS => 5}
•   put 't1', 'r1', 'c1:f1', 'value'
•   get 't1', 'r1'
•   count 't1'
HBase
Hive
•   Used by most BI products which connect
    to Hadoop
•   Provides a SQL-like abstraction over
    Hadoop
    – Officially HiveQL, or HQL
•   Works on own tables, but also on HBase
•   Query generates MapReduce job, output
    of which becomes result set
•   Microsoft has Hive ODBC driver
    – Connects Excel, Reporting Services, PowerPivot,
      Analysis Services Tabular Mode (only)
Hive, Continued
•   Load data from flat HDFS files
    – LOAD DATA [LOCAL] INPATH 'myfile'
      INTO TABLE mytable;
•   SQL Queries
    – CREATE, ALTER, DROP
    – INSERT OVERWRITE (creates whole tables)
    – SELECT, JOIN, WHERE, GROUP BY
    – SORT BY, but ordering data is tricky!
    – MAP/REDUCE/TRANSFORM…USING allows for custom
      map, reduce steps utilizing Java or streaming code
Hive
Pig
•   Instead of SQL, employs a language (“Pig
    Latin”) that accommodates data flow
    expressions
    – Do a combo of Query and ETL
•   “10 lines of Pig Latin ≈ 200 lines of Java.”
•   Works with structured or unstructured data
•   Operations
    – As with Hive, a MapReduce job is generated
    – Unlike Hive, output is only flat file to HDFS or text at
      command line console
    – With MS Hadoop, can easily convert to JavaScript array,
      then manipulate
•   Use command line (“Grunt”) or build scripts
Example
•   A = LOAD 'myfile'
      AS (x, y, z);
    B = FILTER A by x > 0;
    C = GROUP B BY x;
    D = FOREACH A GENERATE
      x, COUNT(B);
    STORE D INTO 'output';
Pig Latin Examples
•   Imperative, file system commands
    – LOAD, STORE
        Schema specified on LOAD
•   Declarative, query commands (SQL-like)
    – xxx = file or data set
    – FOREACH xxx GENERATE (SELECT…FROM xxx)
    – JOIN (WHERE/INNER JOIN)
    – FILTER xxx BY (WHERE)
    – ORDER xxx BY (ORDER BY)
    – GROUP xxx BY / GENERATE COUNT(xxx)
      (SELECT COUNT(*) GROUP BY)
    – DISTINCT (SELECT DISTINCT)
•   Syntax is assignment statement-based:
    – MyCusts = FILTER Custs BY SalesPerson eq 15;
•   Access Hbase
    – CpuMetrics = LOAD 'hbase://SystemMetrics' USING
      org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp
      u:','-loadKey -returnTuple');
Pig
Sqoop
sqoop import
 --connect
  "jdbc:sqlserver://<servername>.
   database.windows.net:1433;
   database=<dbname>;
   user=<username>@<servername>;
   password=<password>"
 --table <from_table>
 --target-dir <to_hdfs_folder>
 --split-by <from_table_column>
Sqoop
sqoop export
 --connect
  "jdbc:sqlserver://<servername>.
   database.windows.net:1433;
   database=<dbname>;
   user=<username>@<servername>;
   password=<password>"
 --table <to_table>
 --export-dir <from_hdfs_folder>
 --input-fields-terminated-by
  "<delimiter>"
Flume NG
•   Source
    – Avro (data serialization system – can read json-
      encoded data files, and can work over RPC)
    – Exec (reads from stdout of long-running process)
•   Sinks
    – HDFS, HBase, Avro
•   Channels
    – Memory, JDBC, file
Flume NG (next generation)
•     Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1


•     From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Mahout Algorithms
•   Recommendation
    – Your info + community info
    – Give users/items/ratings; get user-user/item-item
    – itemsimilarity
•   Classification/Categorization
    – Drop into buckets
    – Naïve Bayes, Complementary Naïve Bayes, Decision
      Forests
•   Clustering
    – Like classification, but with categories unknown
    – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
      Shift
Workflow, Syntax
•   Workflow
    – Run the job
    – Dump the output
    – Visualize, predict
•   mahout algorithm
      -- input folderspec
      -- output folderspec
      -- param1 value1
      -- param2 value2
    …
•   Example:
    – mahout itemsimilarity
        --input <input-hdfs-path>
        --output <output-hdfs-path>
        --tempDir <tmp-hdfs-path>
        -s SIMILARITY_LOGLIKELIHOOD
Mahout
The Truth About Mahout
•   Mahout is really just an algorithm engine
•   Its output is almost unusable by non-
    statisticians/non-data scientists
•   You need a staff or a product to visualize, or
    make into a usable prediction model
•   Investigate Predixion Software
    – CTO, Jamie MacLennan, used to lead SQL Server Data
      Mining team
    – Excel add-in can use Mahout remotely, visualize its output,
      run predictive analyses
    – Also integrates with SQL Server, Greenplum, MapReduce
    – http://www.predixionsoftware.com
Resources
•   Big On Data blog
    – http://www.zdnet.com/blog/big-data
•   Apache Hadoop home page
    – http://hadoop.apache.org/
•   Hive & Pig home pages
    – http://hive.apache.org/
    – http://pig.apache.org/
•   Hadoop on Azure home page
    – https://www.hadooponazure.com/
•   Cloudera CDH 4 download
    – https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
•   SQL Server 2012 Big Data
    – http://bit.ly/sql2012bigdata
Thank you



•   andrew.brust@bluebadgeinsights.com
•   @andrewbrust on twitter
•   Get Blue Badge’s free briefings
    – Text “bluebadge” to 22828

More Related Content

What's hot

Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and dockerFabio Fumarola
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudAndrei Savu
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!Yahoo Developer Network
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 

What's hot (20)

Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Hdfs java api
Hdfs java apiHdfs java api
Hdfs java api
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
 
Unit ii sem-v-hadoop
Unit ii  sem-v-hadoopUnit ii  sem-v-hadoop
Unit ii sem-v-hadoop
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
HBase lon meetup
HBase lon meetupHBase lon meetup
HBase lon meetup
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 

Viewers also liked

Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013ladysmithdowntown
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An AnalysisAndrew Brust
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 

Viewers also liked (7)

Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013Town of Ladysmith Economic Development Plan 2013
Town of Ladysmith Economic Development Plan 2013
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An Analysis
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 

Similar to Brust hadoopecosystem

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...NashvilleTechCouncil
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 

Similar to Brust hadoopecosystem (20)

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
מיכאל
מיכאלמיכאל
מיכאל
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 

More from Andrew Brust

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooAndrew Brust
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012Andrew Brust
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataAndrew Brust
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012Andrew Brust
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmAndrew Brust
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms Andrew Brust
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Andrew Brust
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisAndrew Brust
 

More from Andrew Brust (13)

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Brust hadoopecosystem

  • 1. Hadoop and its Ecosystem Components in Action Andrew J. Brust CEO and Founder Blue Badge Insights Level: Intermediate
  • 2. Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust
  • 3. My New Blog (bit.ly/bigondata)
  • 5. Agenda • Understand: – MapReduce, Hadoop, and Hadoop stack elements • Review: – Microsoft Hadoop on Azure, Amazon Elastic MapReduce (EMR), Cloudera CDH4 • Demo: – Hadoop, HBase, Hive, Pig • Discuss: – Sqoop, Flume • Demo – Mahout
  • 6. MapReduce, in a Diagram Input mapper Output K1, K4… Input mapper Output Input reducer Output K2, K5… mapper Output Output Input Input reducer Output Input K3, K6… Input mapper Output Input reducer Output Input mapper Output Input mapper Output
  • 7. A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each • Collect the tallies platform • Merge tallies into one spreadsheet
  • 8. What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – Except the name node = SPOF! • BUT: HDFS is immutable – Files can only be written to once – So updates require drop + re-write (slow)
  • 9. Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks
  • 10. The Hadoop Stack • Hadoop – MapReduce, HDFS • Hbase – NoSQL Database – Lesser extent: Cassandra, HyperTable • Hive, Pig – SQL-like “data warehouse” system – Data transformation language • Sqoop – Import/export between HDFS, HBase, Hive and relational data warehouses • Flume – Log file integration • Mahout – Data Mining
  • 11. Ways to work • Microsoft Hadoop on Azure – Visit www.hadooponazure.com – Request invite – Free, for now • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard – Cheap for experimenting, but not free • Cloudera CDH VM image – Download as .tar.gz file – “Un-tar” (can use WinRAR, 7zip) – Run via VMWare Player or Virtual Box – Everything’s free
  • 12. Microsoft Hadoop on Azure • Much simpler than the others • Browser-based portal – Provisioning cluster, managing ports, MapReduce jobs – Gathering external data Configure Azure Storage, Amazon S3 Import from DataMarket to Hive • Interactive JavaScript console – HDFS, Pig, light data visualization • Interactive Hive console – Hive commands and metadata discovery • From Portal page you can RDP directly to Hadoop head node – Double click desktop shortcut for CLI access – Certain environment variables may need to be set
  • 14. Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI A batch file can work very well here – Setup and run SSH/PuTTY – Work interactively at command line
  • 15. Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI  • Create credentials.json in EMR CLI folder – Associate with same region as where key pair created
  • 16. Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not work
  • 17. Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done)
  • 19. Cloudera CDH4 Virtual Machine • Get it for free, in VMWare and Virtual Box versions. – VMWare player and Virtual Box are free too • Run it, and configure it to have its own IP on your network. • Assuming IP of 192.168.1.59, open browser on your own (host ) machine and navigate to: – http://192.168.1.59:8888 • Can also use browser in VM and hit: – http://localhost:8888 • Work in “Hue”…
  • 20. Hue • Browser based UI, with front ends for: – HDFS (w/ upload & download) – MapReduce job creation and monitoring – Hive (“Beeswax”) • And in-browser command line shells for: – HBase – Pig (“Grunt”)
  • 22. Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params
  • 24. HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell • Moreover, – Interesting HBase work can be done in MapReduce, Pig
  • 25. HBase Examples • create 't1', 'f1', 'f2', 'f3' • describe 't1' • alter 't1', {NAME => 'f1', VERSIONS => 5} • put 't1', 'r1', 'c1:f1', 'value' • get 't1', 'r1' • count 't1'
  • 26. HBase
  • 27. Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
  • 28. Hive, Continued • Load data from flat HDFS files – LOAD DATA [LOCAL] INPATH 'myfile' INTO TABLE mytable; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code
  • 29. Hive
  • 30. Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS or text at command line console – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scripts
  • 31. Example • A = LOAD 'myfile' AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO 'output';
  • 32. Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = file or data set – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15; • Access Hbase – CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp u:','-loadKey -returnTuple');
  • 33. Pig
  • 34. Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>
  • 35. Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"
  • 36. Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, file
  • 37. Flume NG (next generation) • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
  • 38. Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- Shift
  • 39. Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD
  • 41. The Truth About Mahout • Mahout is really just an algorithm engine • Its output is almost unusable by non- statisticians/non-data scientists • You need a staff or a product to visualize, or make into a usable prediction model • Investigate Predixion Software – CTO, Jamie MacLennan, used to lead SQL Server Data Mining team – Excel add-in can use Mahout remotely, visualize its output, run predictive analyses – Also integrates with SQL Server, Greenplum, MapReduce – http://www.predixionsoftware.com
  • 42. Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • Cloudera CDH 4 download – https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata
  • 43. Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Get Blue Badge’s free briefings – Text “bluebadge” to 22828

Editor's Notes

  1. Visual Studio Live! Redmond 2012 © 2012 Visual Studio Live! All rights reserved.
  2. Visual Studio Live! Redmond 2012 © 2012 Visual Studio Live! All rights reserved.
  3. Visual Studio Live! Redmond 2012 © 2012 Visual Studio Live! All rights reserved.