SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Big Data Warehousing Meetup

Today’s Topic: Exploring Big Data
Analytics Techniques with Datameer




                                     Sponsored By:
WELCOME!
  Joe Caserta
  Founder & President, Caserta Concepts
Agenda
7:00     Networking
         Grab a slice of pizza and a drink...



7:15     Joe Caserta                              Welcome
         President, Caserta Concepts              About the Meetup and about Caserta Concepts
         Author, Data Warehouse ETL Toolkit


7:30     Elliott Cordo                            Pig and Hive
         Principal Consultant, Caserta Concepts   Walkthrough of these powerful native Hadoop tools



7:50     Adam Gugliciello                         Datameer
         Solutions Engineer, Datameer



8:10 -   More Networking
9:00     Tell us what you’re up to…
About BDW Meetup
• Big Data is a complex, rapidly
 changing landscape

• We want to share our stories and
 hear about yours

• Great networking opportunity for like
 minded data nerds

• Opportunities to collaborate on
 exciting projects

• Next BDW Meetup: April 22.
• Topic: Intro to NoSQL Databases
About Caserta Concepts
 Focused                             Industries Served
 Expertise
                                    •   Financial Services
 •   Big Data Analytics             •   Healthcare / Insurance
 •   Data Warehousing               •   Retail / eCommerce
 •   Business Intelligence          •   Digital Media / Marketing
 •   Strategic Data                 •   K-12 / Higher Education
     Ecosystems

     Founded in 2001

     • President: Joe Caserta, industry thought leader,
       consultant, educator and co-author, The Data
       Warehouse ETL Toolkit (Wiley, 2004)
Client Portfolio
Finance
& Insurance




Retail/eCommerce
& Manufacturing




Education
& Services
Expertise & Offerings
 Strategic Roadmap/
 Assessment/Consulting


 Big Data
 Analytics




 Data Warehousing/
 ETL/Data Integration


 BI/Visualization/
 Analytics



 Master Data Management
Opportunities
Does this word cloud excite you?




Speak with us about our open positions: jobs@casertaconcepts.com
Contacts

     Joe Caserta
     President & Founder, Caserta Concepts
     P: (855) 755-2246 x227
     E: joe@casertaconcepts.com


     Erik Laurence
     VP Marketing, Caserta Concepts
     P: (855) 755-2246 x528                   info@casertaconcepts.com
     E: erik@casertaconcepts.com              1(855) 755-2246
                                              www.casertaconcepts.com
     Elliott Cordo
     Principal Consultant, Caserta Concepts
     P: (855) 755-2246 x267
     E: elliott@casertaconcepts.com
ANALYZING DATA: PIG AND HIVE
    Elliott Cordo
    Principal Consultant, Caserta Concepts
Big Data Analysis
• Let’s review some tools for analyzing and processing Big
 Data




• We will go over some simple use cases – point out what is
 interesting about them

• Develop a point of view of what each one is well suited for.
Big Data Analysis – Map Reduce?
Distributed programming framework – Divide and Conquer!
  • Master divides work into digestible chunks and distributes to worker nodes
    – > MAP
  • Work from nodes is then collected by the master and combined to form an
    answer -> REDUCE

Powerful tool for to solve interesting computational problems at scale
HELP
• We are doing low-level language coding to perform low-
 level operations

• For productivity we need higher level tools!

• We will get help from a few animals!




              N1     N2          N3           N4            N5
                    Hadoop Distributed File System (HDFS)
HIVE
• The Hadoop “Data Warehouse”


• HiveQL is a SQL-Like interface that allows you to abstract
 “relational-db like” structure on top of non-relational or
 unstructured data
  • Flat Files, JSON, Web logs
  • HBase, Casandra, other NoSQL stores like MongoDB


• Thanks to ODBC/JDBC drivers some conventional BI
 tools can interact with Hive

• Ability to integrate custom programming, mappers,
 reducers
HIVE
But don’t get too excited!
• Hive is not a Database, especially in terms of
  optimizations.

• SQL is interpreted to Map Reduce Jobs, expect even
 simple queries to be around a minute or more.
                       Start query,
                       go get coffee



• But now that expectations have been set, it’s still a very
 useful tool
HIVE DDL– Create and load a table
hive> create table user_movie_ratings(
  > user_id int,
  > movie_id int,                   Looks like a typical
  > rating int,
  > time_unix_ts string)            table declaration,
  > row format delimited            except we are specify
  > fields terminated by 't'       the ingested file
  > stored as textfile;             format
OK
Time taken: 0.395 seconds

hive> load data inpath '/user/hive/staging/data/u.data' overwrite into table
user_movie_ratings;
Loading data to table default.user_movie_ratings
Deleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratings
Table default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0,
total_size: 1979173, raw_data_size: 0]
OK
Time taken: 0.474 seconds
HIVE DDL– Create an external table
hive> create external table user (
  > user_id int,
  > age int,
                                                    This time we don’t
  > gender string,                                  want Hive to own this
  > occupation string,                              data’s lifecycle
  > postal_code int )
  > row format delimited fields terminated by '|'
  > location '/user/hive/staging/user';
OK
Time taken: 0.096 seconds
HIVE – YAY SQL!
hive> select occupation, count(1)
  > from user_movie_ratings m
  > join user u on u.user_id=m.user_id
  > group by occupation;

Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Total MapReduce CPU Time Spent: 47 seconds 170 msec
OK

administrator 7479
artist 2308
doctor 540
educator 9442
engineer 8175
entertainment 2095
….
retired 1609
salesman 856
scientist 2058
student 21957
technician 3506
writer 5536                          Hmmm..
Time taken: 110.331 seconds
PIG
• Powerful High Level Programming Language


• SQL-ish, small learning curve for SQL and procedural
 programmers

• Excellent for data transformation, ETL


• Not meant to be an ad-hoc query tool, happy with doing
 grunt work

• Plenty of supported file formats, databases, ability to
 create custom UDF’s
PIG Example
grunt> lens_users= load '/user/movie_lens/u.user' using PigStorage('|') as
(user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int);

grunt> lens_data= load '/user/movie_lens/u.data' using PigStorage('t') as
(user_id:int, movie_id:int, rating:int, time_unix_ts:chararray);

grunt> joined = join lens_users by user_id, lens_data by user_id

grunt> grouped = group joined by (occupation);

grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*;

grunt> store results into '/user/movie_lens_user_summary'
                                                                 Interesting,
                                                                 We are doing
                                                                 our aggregate
                                                                 functions after
                                                                 grouping
PIG - Results
                Grouping in PIG is a fair
                deviation from SQL ->
                original elements are
                preserved in a bag
Summary
Hive:
• Helpful for ETL
• Very good for Ad-Hoc Analysis - Not necessarily suited
  for front end users but definitely helpful for data analysts
• Directly leverages SQL expertise!!


PIG:
• Great for ETL
• Powerful, transformation and processing capabilities
• SQL-like, but different in many ways, will take some time
  to master.
Big Data Warehousing - Meetup

Weitere ähnliche Inhalte

Was ist angesagt?

20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 

Was ist angesagt? (20)

Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 

Ähnlich wie Big Data Warehousing: Pig vs. Hive Comparison

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 

Ähnlich wie Big Data Warehousing: Pig vs. Hive Comparison (20)

Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Big Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with RiakBig Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with Riak
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Build a Big Data Warehouse on the Cloud in 30 Minutes
Build a Big Data Warehouse on the Cloud in 30 MinutesBuild a Big Data Warehouse on the Cloud in 30 Minutes
Build a Big Data Warehouse on the Cloud in 30 Minutes
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 

Mehr von Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

Mehr von Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Big Data Warehousing: Pig vs. Hive Comparison

  • 1. Big Data Warehousing Meetup Today’s Topic: Exploring Big Data Analytics Techniques with Datameer Sponsored By:
  • 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  • 3. Agenda 7:00 Networking Grab a slice of pizza and a drink... 7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit 7:30 Elliott Cordo Pig and Hive Principal Consultant, Caserta Concepts Walkthrough of these powerful native Hadoop tools 7:50 Adam Gugliciello Datameer Solutions Engineer, Datameer 8:10 - More Networking 9:00 Tell us what you’re up to…
  • 4. About BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Next BDW Meetup: April 22. • Topic: Intro to NoSQL Databases
  • 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  • 6. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  • 8. Opportunities Does this word cloud excite you? Speak with us about our open positions: jobs@casertaconcepts.com
  • 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  • 10. ANALYZING DATA: PIG AND HIVE Elliott Cordo Principal Consultant, Caserta Concepts
  • 11. Big Data Analysis • Let’s review some tools for analyzing and processing Big Data • We will go over some simple use cases – point out what is interesting about them • Develop a point of view of what each one is well suited for.
  • 12. Big Data Analysis – Map Reduce? Distributed programming framework – Divide and Conquer! • Master divides work into digestible chunks and distributes to worker nodes – > MAP • Work from nodes is then collected by the master and combined to form an answer -> REDUCE Powerful tool for to solve interesting computational problems at scale
  • 13. HELP • We are doing low-level language coding to perform low- level operations • For productivity we need higher level tools! • We will get help from a few animals! N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
  • 14. HIVE • The Hadoop “Data Warehouse” • HiveQL is a SQL-Like interface that allows you to abstract “relational-db like” structure on top of non-relational or unstructured data • Flat Files, JSON, Web logs • HBase, Casandra, other NoSQL stores like MongoDB • Thanks to ODBC/JDBC drivers some conventional BI tools can interact with Hive • Ability to integrate custom programming, mappers, reducers
  • 15. HIVE But don’t get too excited! • Hive is not a Database, especially in terms of optimizations. • SQL is interpreted to Map Reduce Jobs, expect even simple queries to be around a minute or more. Start query, go get coffee • But now that expectations have been set, it’s still a very useful tool
  • 16. HIVE DDL– Create and load a table hive> create table user_movie_ratings( > user_id int, > movie_id int, Looks like a typical > rating int, > time_unix_ts string) table declaration, > row format delimited except we are specify > fields terminated by 't' the ingested file > stored as textfile; format OK Time taken: 0.395 seconds hive> load data inpath '/user/hive/staging/data/u.data' overwrite into table user_movie_ratings; Loading data to table default.user_movie_ratings Deleted hdfs://localhost:54310/user/hive/warehouse/user_movie_ratings Table default.user_movie_ratings stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 1979173, raw_data_size: 0] OK Time taken: 0.474 seconds
  • 17. HIVE DDL– Create an external table hive> create external table user ( > user_id int, > age int, This time we don’t > gender string, want Hive to own this > occupation string, data’s lifecycle > postal_code int ) > row format delimited fields terminated by '|' > location '/user/hive/staging/user'; OK Time taken: 0.096 seconds
  • 18. HIVE – YAY SQL! hive> select occupation, count(1) > from user_movie_ratings m > join user u on u.user_id=m.user_id > group by occupation; Total MapReduce jobs = 2 Launching Job 1 out of 2 ... Total MapReduce CPU Time Spent: 47 seconds 170 msec OK administrator 7479 artist 2308 doctor 540 educator 9442 engineer 8175 entertainment 2095 …. retired 1609 salesman 856 scientist 2058 student 21957 technician 3506 writer 5536 Hmmm.. Time taken: 110.331 seconds
  • 19. PIG • Powerful High Level Programming Language • SQL-ish, small learning curve for SQL and procedural programmers • Excellent for data transformation, ETL • Not meant to be an ad-hoc query tool, happy with doing grunt work • Plenty of supported file formats, databases, ability to create custom UDF’s
  • 20. PIG Example grunt> lens_users= load '/user/movie_lens/u.user' using PigStorage('|') as (user_id:int, age:int, gender:chararray, occupation:chararray, postal_code:int); grunt> lens_data= load '/user/movie_lens/u.data' using PigStorage('t') as (user_id:int, movie_id:int, rating:int, time_unix_ts:chararray); grunt> joined = join lens_users by user_id, lens_data by user_id grunt> grouped = group joined by (occupation); grunt> results = FOREACH grouped GENERATE COUNT_STAR(joined),*; grunt> store results into '/user/movie_lens_user_summary' Interesting, We are doing our aggregate functions after grouping
  • 21. PIG - Results Grouping in PIG is a fair deviation from SQL -> original elements are preserved in a bag
  • 22. Summary Hive: • Helpful for ETL • Very good for Ad-Hoc Analysis - Not necessarily suited for front end users but definitely helpful for data analysts • Directly leverages SQL expertise!! PIG: • Great for ETL • Powerful, transformation and processing capabilities • SQL-like, but different in many ways, will take some time to master.