SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Presented by
Stephen Peter
The Hadoop Data Access Layer
Stephen Peter
E-Mail: Stephen.peter@gmail.com
LinkedIn - https://in.linkedin.com/in/stephenepeter
Hortonworks Certified Trainer.
Hortonworks Certified Developer (Apache Pig & Hive)
Digital Badge : http://bcert.me/sxohnqiq
Professional Experience: Over 20 years of IT experience with
specialization in Business Intelligence , Data warehousing and Big Data.
Worked in organizations such as HCL Tech, Oracle , Cisco Systems.
Presently working as Hadoop trainer at Spring People.
Area of interest: coexistence of Enterprise DW and Hadoop
Introduction
• The motivation for Hadoop
▫ The need for ingesting, storing and analyzing big data.
▫ Use cases on the value of Big Data.
• Hadoop as an integral part of Modern Data Architecture.
• The HDP (Hortonworks Data Platform) reference architecture.
▫ HDP Data Access Layer.
 The different components its functions and application.
• Use case – Data warehouse Optimization using Hadoop.
▫ to achieve better insight and cost effectiveness.
Agenda
Emerging Data landscape
• In the past the world’s data doubled every
century, now its every 2 years.
• The flood of data is driven by IOT, mobile
devices, server logs, geo location coordinates,
social media and sensor data.
• Big data is characterized by:
 Velocity – 90% of world’s data created in the
last two years.
 Volume – from 8 ZB in 2015 expected to grow
to 40 ZB by 2020.
 Variety – 80% of enterprise data unstructured
ranging from docs, emails, images, web logs,
sensor data, geospatial coordinates and server
logs.
Big Data Use Cases
Source: https://hortonworks.com
Hadoop – An integral part of modern Data Architecture
Source: https://hortonworks.com
Hortonworks Hadoop Platform - HDP
www.hortonworks.com
• Batch Processing using Map Reduce Framework
• Interactive SQL Query using Hive on Tez framework.
• Apache Pig scripting language can run on MR or Tez.
• Low latency data access via NoSQL database Hbase.
• Apache Storm processes and analyze streams of data
in real time as it flows into HDFS
• Apache Spark is a fast, in-memory data processing
engine that enables batch, real-time, and advanced
analytics on the Apache Hadoop platform.
HDP - Data Access Layer
www.hortonworks.com
Ingest Data into HDFS using Scoop
▫ The primary use case:
 Stream log entries from multiple machines
 Aggregate them to a centralized, persistent
store such as the Hadoop Distributed File
System
 Log entries can be analyzed by other Hadoop
tools.
▫ Flume is not limited to log entries.
 Flume is used to collect many types of
streaming data.
 Examples include network traffic data, social
media generated data, machine sensor data, and
email messages.
▫ Flume is not the best choice where data is not
regularly generated.
Ingest Data into HDFS using Flume
• Use the Twitter streaming API as the source
• Create a twitter application
• Configure the flume agent by modifying the flume
configuration.
▫ Configure the source, channel and sink.
▫ Source type:
org.apache.flume.source.twitter.TwitterSource
▫ Channel type: MemChannel
▫ Sink type : HDFS
• Run the flume command to extract data from
twitter.
for example
$ flume-ng agent --conf ./conf/ -f conf/twitter.conf
Importing Twitter data into HDFS
Query Data using Hive
Example Hive QL commands
 Create a Hive managed table:
CREATE TABLE stockinfo (symbol STRING, price FLOAT,
change FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’;
 Create a Hive external table:
CREATE EXTERNAL TABLE salaries (gender string, age int, salary
double,zip int
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',‘
LOCATION '/user/train/salaries/';
 Load data from file in HDFS:
LOAD DATA INPATH ‘/user/me/stockdata.csv’
OVERWRITE INTO TABLE stockinfo;
 View everything in the table:
SELECT * from stockinfo;
Performance tuning in Hive
• Hive Partition table
• Hive Buckets
• Use Optimized Row Columnar (ORC) Format storage
• Cost Based SQL Optimization
• Using Hive on Tez for low latency query
Use cases for Apache Pig
• Pig can extract data from multiple sources, transform it and store it in HDFS.
• Research raw data.
• Iterative data processing
database
data
log
data
sensor
data
transform HDFS
extract transform load
Hive
other
tools
PIG
analysis
tools
 Load data from a file and apply a schema:
stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS
(symbol STRING, price FLOAT, change FLOAT) ;
 Display the data in stockinfo:
DUMP stockinfo;
 Filter the stockinfo data and write the filtered data to HDFS:
IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);
STORE IBM_only INTO ‘ibm_stockinfo’;
 Load data from a file without applying a schema
a = LOAD ‘flightdelays’ using PigStorage(‘,’);
 Apply schema on read
c = foreach a generate $0 as year:int, $1 as month:int,
$4 as name:chararray;
Example Pig Statements
Create workflow using Apache Oozie
email
distcp
MapReduce
Hive
PigSqoop
Oozie workflow example
data data
Apache Oozie is a server-based workflow engine
used to execute Hadoop jobs.
Used to build and schedule complex data
transformations by combining MapReduce,
Apache Hive, Apache Pig, and Apache Sqoop
jobs into a single, logical unit of work.
Oozie can also perform Java, Linux shell,
distcp, SSH, email, and other operations.
Oozie runs as a Java Web application in
Apache Tomcat.
Use Case -Data warehouse Optimization with Hadoop
Hadoop data access layer v4.0

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"DataConf
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 

Was ist angesagt? (20)

Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Loan Decisioning Transformation
Loan Decisioning TransformationLoan Decisioning Transformation
Loan Decisioning Transformation
 
Hadoop
HadoopHadoop
Hadoop
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 

Ähnlich wie Hadoop data access layer v4.0

GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxinfinix8
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...NashvilleTechCouncil
 

Ähnlich wie Hadoop data access layer v4.0 (20)

GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
מיכאל
מיכאלמיכאל
מיכאל
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Atul Mithe
Atul MitheAtul Mithe
Atul Mithe
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
 

Mehr von SpringPeople

Growth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can tryGrowth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can trySpringPeople
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Introduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaSIntroduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaSSpringPeople
 
Introduction to Selenium WebDriver
Introduction to Selenium WebDriverIntroduction to Selenium WebDriver
Introduction to Selenium WebDriverSpringPeople
 
Introduction to Open stack - An Overview
Introduction to Open stack - An Overview Introduction to Open stack - An Overview
Introduction to Open stack - An Overview SpringPeople
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Why 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoftWhy 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoftSpringPeople
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsMongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsSpringPeople
 
Mastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium SuccessfullyMastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium SuccessfullySpringPeople
 
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...SpringPeople
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?SpringPeople
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Introduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeopleIntroduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeopleSpringPeople
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Introduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeopleIntroduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeopleSpringPeople
 
Introduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeopleIntroduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeopleSpringPeople
 
Introduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeopleIntroduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeopleSpringPeople
 
Introduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeopleIntroduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeopleSpringPeople
 

Mehr von SpringPeople (20)

Growth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can tryGrowth hacking tips and tricks that you can try
Growth hacking tips and tricks that you can try
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaSIntroduction to Microsoft Azure IaaS
Introduction to Microsoft Azure IaaS
 
Introduction to Selenium WebDriver
Introduction to Selenium WebDriverIntroduction to Selenium WebDriver
Introduction to Selenium WebDriver
 
Introduction to Open stack - An Overview
Introduction to Open stack - An Overview Introduction to Open stack - An Overview
Introduction to Open stack - An Overview
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Why 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoftWhy 2 million Developers depend on MuleSoft
Why 2 million Developers depend on MuleSoft
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsMongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
 
Mastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium SuccessfullyMastering Test Automation: How To Use Selenium Successfully
Mastering Test Automation: How To Use Selenium Successfully
 
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
An Introduction of Big data; Big data for beginners; Overview of Big Data; Bi...
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?SpringPeople - Devops skills - Do you have what it takes?
SpringPeople - Devops skills - Do you have what it takes?
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Introduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeopleIntroduction To Core Java - SpringPeople
Introduction To Core Java - SpringPeople
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Introduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeopleIntroduction To Cloud Foundry - SpringPeople
Introduction To Cloud Foundry - SpringPeople
 
Introduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeopleIntroduction To Spring Enterprise Integration - SpringPeople
Introduction To Spring Enterprise Integration - SpringPeople
 
Introduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeopleIntroduction To Groovy And Grails - SpringPeople
Introduction To Groovy And Grails - SpringPeople
 
Introduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeopleIntroduction To Jenkins - SpringPeople
Introduction To Jenkins - SpringPeople
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Hadoop data access layer v4.0

  • 1. Presented by Stephen Peter The Hadoop Data Access Layer
  • 2. Stephen Peter E-Mail: Stephen.peter@gmail.com LinkedIn - https://in.linkedin.com/in/stephenepeter Hortonworks Certified Trainer. Hortonworks Certified Developer (Apache Pig & Hive) Digital Badge : http://bcert.me/sxohnqiq Professional Experience: Over 20 years of IT experience with specialization in Business Intelligence , Data warehousing and Big Data. Worked in organizations such as HCL Tech, Oracle , Cisco Systems. Presently working as Hadoop trainer at Spring People. Area of interest: coexistence of Enterprise DW and Hadoop Introduction
  • 3. • The motivation for Hadoop ▫ The need for ingesting, storing and analyzing big data. ▫ Use cases on the value of Big Data. • Hadoop as an integral part of Modern Data Architecture. • The HDP (Hortonworks Data Platform) reference architecture. ▫ HDP Data Access Layer.  The different components its functions and application. • Use case – Data warehouse Optimization using Hadoop. ▫ to achieve better insight and cost effectiveness. Agenda
  • 4. Emerging Data landscape • In the past the world’s data doubled every century, now its every 2 years. • The flood of data is driven by IOT, mobile devices, server logs, geo location coordinates, social media and sensor data. • Big data is characterized by:  Velocity – 90% of world’s data created in the last two years.  Volume – from 8 ZB in 2015 expected to grow to 40 ZB by 2020.  Variety – 80% of enterprise data unstructured ranging from docs, emails, images, web logs, sensor data, geospatial coordinates and server logs.
  • 5. Big Data Use Cases Source: https://hortonworks.com
  • 6. Hadoop – An integral part of modern Data Architecture Source: https://hortonworks.com
  • 7. Hortonworks Hadoop Platform - HDP www.hortonworks.com
  • 8. • Batch Processing using Map Reduce Framework • Interactive SQL Query using Hive on Tez framework. • Apache Pig scripting language can run on MR or Tez. • Low latency data access via NoSQL database Hbase. • Apache Storm processes and analyze streams of data in real time as it flows into HDFS • Apache Spark is a fast, in-memory data processing engine that enables batch, real-time, and advanced analytics on the Apache Hadoop platform. HDP - Data Access Layer www.hortonworks.com
  • 9. Ingest Data into HDFS using Scoop
  • 10. ▫ The primary use case:  Stream log entries from multiple machines  Aggregate them to a centralized, persistent store such as the Hadoop Distributed File System  Log entries can be analyzed by other Hadoop tools. ▫ Flume is not limited to log entries.  Flume is used to collect many types of streaming data.  Examples include network traffic data, social media generated data, machine sensor data, and email messages. ▫ Flume is not the best choice where data is not regularly generated. Ingest Data into HDFS using Flume
  • 11. • Use the Twitter streaming API as the source • Create a twitter application • Configure the flume agent by modifying the flume configuration. ▫ Configure the source, channel and sink. ▫ Source type: org.apache.flume.source.twitter.TwitterSource ▫ Channel type: MemChannel ▫ Sink type : HDFS • Run the flume command to extract data from twitter. for example $ flume-ng agent --conf ./conf/ -f conf/twitter.conf Importing Twitter data into HDFS
  • 13. Example Hive QL commands  Create a Hive managed table: CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;  Create a Hive external table: CREATE EXTERNAL TABLE salaries (gender string, age int, salary double,zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘ LOCATION '/user/train/salaries/';  Load data from file in HDFS: LOAD DATA INPATH ‘/user/me/stockdata.csv’ OVERWRITE INTO TABLE stockinfo;  View everything in the table: SELECT * from stockinfo;
  • 14. Performance tuning in Hive • Hive Partition table • Hive Buckets • Use Optimized Row Columnar (ORC) Format storage • Cost Based SQL Optimization • Using Hive on Tez for low latency query
  • 15. Use cases for Apache Pig • Pig can extract data from multiple sources, transform it and store it in HDFS. • Research raw data. • Iterative data processing database data log data sensor data transform HDFS extract transform load Hive other tools PIG analysis tools
  • 16.  Load data from a file and apply a schema: stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS (symbol STRING, price FLOAT, change FLOAT) ;  Display the data in stockinfo: DUMP stockinfo;  Filter the stockinfo data and write the filtered data to HDFS: IBM_only = FILTER stockinfo BY (symbol == ‘IBM’); STORE IBM_only INTO ‘ibm_stockinfo’;  Load data from a file without applying a schema a = LOAD ‘flightdelays’ using PigStorage(‘,’);  Apply schema on read c = foreach a generate $0 as year:int, $1 as month:int, $4 as name:chararray; Example Pig Statements
  • 17. Create workflow using Apache Oozie email distcp MapReduce Hive PigSqoop Oozie workflow example data data Apache Oozie is a server-based workflow engine used to execute Hadoop jobs. Used to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache Pig, and Apache Sqoop jobs into a single, logical unit of work. Oozie can also perform Java, Linux shell, distcp, SSH, email, and other operations. Oozie runs as a Java Web application in Apache Tomcat.
  • 18. Use Case -Data warehouse Optimization with Hadoop