Suche senden
Hochladen
Analyzing Hadoop Using Hadoop
•
16 gefällt mir
•
3,196 views
DataWorks Summit
Folgen
Analyzing Hadoop Using Hadoop Sheetal Dolas Principal Architect, Hortonworks
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 47
Empfohlen
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Rommel Garcia
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Tune up Yarn and Hive
Tune up Yarn and Hive
rxu
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Empfohlen
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Rommel Garcia
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
Tune up Yarn and Hive
Tune up Yarn and Hive
rxu
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Quick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
A Multi Colored YARN
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
Spark vstez
Spark vstez
David Groozman
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
Rohit Agrawal
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Michael Joseph
Weitere ähnliche Inhalte
Was ist angesagt?
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Quick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Yahoo Developer Network
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
A Multi Colored YARN
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
Spark vstez
Spark vstez
David Groozman
Was ist angesagt?
(20)
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Quick Introduction to Apache Tez
Quick Introduction to Apache Tez
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
YARN Ready: Apache Spark
YARN Ready: Apache Spark
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
A Multi Colored YARN
A Multi Colored YARN
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
Spark vstez
Spark vstez
Andere mochten auch
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
Rohit Agrawal
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Michael Joseph
Negotiating Meaning
Negotiating Meaning
Candice Marshall
Distributed computing the Google way
Distributed computing the Google way
Eduard Hildebrandt
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Edureka!
Hadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
Andere mochten auch
(8)
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Negotiating Meaning
Negotiating Meaning
Distributed computing the Google way
Distributed computing the Google way
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Ähnlich wie Analyzing Hadoop Using Hadoop
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
Hortonworks
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
Raúl Marín
Unlocking insights in streaming data
Unlocking insights in streaming data
Carolyn Duby
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
In-Memory Computing Summit
Apache Metron in the Real World
Apache Metron in the Real World
DataWorks Summit
Data Evolution in HBase
Data Evolution in HBase
HBaseCon
An Introduction to Druid
An Introduction to Druid
DataWorks Summit
SAM—streaming analytics made easy
SAM—streaming analytics made easy
DataWorks Summit
You Can't Search Without Data
You Can't Search Without Data
Bryan Bende
Ähnlich wie Analyzing Hadoop Using Hadoop
(20)
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Hadoop crashcourse v3
Hadoop crashcourse v3
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
Unlocking insights in streaming data
Unlocking insights in streaming data
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
Apache Metron in the Real World
Apache Metron in the Real World
Data Evolution in HBase
Data Evolution in HBase
An Introduction to Druid
An Introduction to Druid
SAM—streaming analytics made easy
SAM—streaming analytics made easy
You Can't Search Without Data
You Can't Search Without Data
Mehr von DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Mehr von DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Kürzlich hochgeladen
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
rafiqahmad00786416
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
apidays
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
Kürzlich hochgeladen
(20)
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Architecting Cloud Native Applications
Architecting Cloud Native Applications
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Analyzing Hadoop Using Hadoop
1.
Page1 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Analyzing Hadoop Using Hadoop 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
2.
Page2 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
3.
Page3 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Agenda • Need for operational insights • Challenges • Data sets available • Using Hadoop to analyze itself • Sample reports • Q and A
4.
Page4 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Need for Operational Insights
5.
Page5 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Need for Metrics Analysis • Metrics can reveal the story about your cluster • They help you understand workload characteristics o Reveal the pain point o Clear the misconceptions o Drive towards action plan • Operational insights are critical for SLA management by improving o System Reliability o Uptime o Performance o Security
6.
Page6 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Hadoop Metrics Challenges • Hadoop generates lot of metrics o Host metrics (CPU, Memory, Disk, Network) o Service metrics (JVM metrics, GC, Transactions, Performance) o Service reports (fsck, lsr, dfs admin, audit logs) o Job Metrics (Resource utilization, data processed, performance) • Understanding and analyzing them is overwhelming • No good enough tools that address the whole spectrum • Need for deeper technology understanding
7.
Page7 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation hmm… hmmmm… hah! ahem! ahh! eh? Hadoop Expert Hadoop Newbie
8.
Page8 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation You know all the words and their meaning
9.
Page9 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation But still don’t get the meaning of conversation
10.
Page10 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved We need tools that help extract meaning out of it Hadoop Expert Hadoop Newbie hmm… Hadoop has Magnificent Metrics hmmmm… Hadoop Metrics Make Me Mad hah! Hadoop Analyzes Hadoop ahem! Analyze Hadoop Easily in Minutes ahh! Awesome! Hail Hadoop! eh? Elucidative Hadoop?
11.
Page11 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Datasets available
12.
Page12 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Datasets available for analysis • MapReduce job history log • HDFS lsr report • HDFS Audit log
13.
Page13 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved MapReduce Job History Log
14.
Page14 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Job history log • Stored on HDFS • Contains all the events occurred in a job plus the event metadata • Has its own format o Can be parsed using Rumen API
15.
Page15 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Analyzing MapReduce Job history log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Job Log Parsing Rumen Job Resource Computations Periodically read the job history logs from HDFS 1 Parse the logs compute data and write it back to Hive 2 Query data through a preferred interface 3
16.
Page16 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Sample Reports
17.
Page17 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved CPU Utilization 33% 28% 25% 8% 3% 3% 0% 0% 0% CPU Utilization - By Queue - Week To Date productintelligence cfld adhoc hive techsupport mnm webhcat infosecurity prodintel_small
18.
Page18 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Disk IO 53% 33% 7% 7% 0% Data IO (GB) - By User - Yesterday katharine.matsumoto hadoop_sa ebrown mzang justin.meyer jmarquez nbhupalam rchakravarthy pyan rchirala thomas.cox User Ids
19.
Page19 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Hour of Day 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 14 16 18 20 22 Numberofjobs submitted Job submission hour Number of jobs submitted 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 0 2 4 6 8 10 12 14 16 18 20 22 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 0 2 4 6 8 10 12 14 16 18 20 22 TotaldataprocessedGBs Job submission hour Total Data Processed
20.
Page20 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Day of Week 0 50 100 150 200 250 300 Numberofjobssubmitted Job submission hour Number of jobs submitted 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 600,000.00 700,000.00 TotaldataprocessedGBs Job submission hour Total Data Processed
21.
Page21 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Job Type and Status 73 143 199 Job Distribution By Type Yesterday Hive MapReduce Pig SUCCEEDE D 98% FAILED 2% KILLED 0% Job Distribution By Status Yesterday SUCCEEDED FAILED KILLED
22.
Page22 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Top 5 long running jobs - Yesterday Job Id Job Name User Name Queue Name Job Duration job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s
23.
Page23 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Top 5 long waiting jobs - Yesterday Job Id Job Name User Name Queue Name Job Submission Wait job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld 5 h 39 m 38 s job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity 5 h 19 m 46 s job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm 5 h 18 m 47 s job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld 5 h 11 m 12 s job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m
24.
Page24 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Top 5 resource consuming jobs job_id Total maps Total reduces Requested map GB Requeste d reduce GB total memory blocked by job GB job_1403277400645_1400 27,358 6 4 4 109,456 job_1403277400645_1423 27,358 3 4 4 109,444 job_1403277400645_1745 5,581 1 4 4 22,328 job_1403277400645_1497 1,807 0 4 4 7,228 job_1403277400645_1564 1,794 0 4 4 7,176
25.
Page25 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Showback Reports Queue Name Total Cpu Hours Used Cpu Cost Total Memory Gb Hours Blocked Memory Cost Total Data Io Gb Data Io Cost Total Network Io Gb Network Io Cost Total Cost adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56 cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11 hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01 $23,146.20 infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47 mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45 productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14 $18,167.99 techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61 Resource Pricing CPU Cost Per Hour: $ 0.004 Memory Cost Per Gb Per Hour: $ 0.004 Data Io Cost Per Gb: $ 0.020 Network Io Cost Per Gb: $ 0.020
26.
Page26 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS lsr report
27.
Page27 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS lsr report • lsr is recursive file listing • Contains metadata about files o Permissions o Owner & Group o Replication factor o File size o Last modified date time o File path ------------------------------------------------------------------------------------------------------- |Permissions |rep factor | user | group | size | date | time| file path | ------------------------------------------------------------------------------------------------------- drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics -rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log
28.
Page28 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Analyzing lsr report HDFS lsr report Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Periodically generate lsr repot hdfs dfs –lsr / Load it into hive load data local inpath ‘/tmp/lsr.txt’ overwrite into table lsr Query data through a preferred interface 1 2 3
29.
Page29 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive Table Definition CREATE EXTERNAL TABLE lsr ( permissions STRING, replication STRING, owner STRING, group STRING, size STRING, date STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)" ) ;
30.
Page30 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive View Definition CREATE VIEW lsr_view AS SELECT ( CASE Substr(permissions, 1, 1) WHEN 'd' THEN 'DIR' ELSE 'FILE' END ) AS file_type, permissions, ( CASE replication WHEN '-' THEN 0 ELSE Cast (replication AS INT) END ) AS replication, owner, group, Cast (size AS INT) AS size, date, time, file_path FROM lsr ;
31.
Page31 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Sample Reports
32.
Page32 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Security Checks – Files Readable by All SELECT permissions, owner, file_path FROM lsr_view WHERE file_type = 'FILE' AND Substr(permissions, 8, 1) = 'r' LIMIT 3; Permissions Owner File Path -rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0 -rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1 -rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
33.
Page33 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Data loss risk – Files with low replication factor SELECT owner, replication, file_path FROM lsr_view WHERE file_type = 'FILE' AND file_path LIKE '/apps/hive/warehouse/%' AND replication < 3 LIMIT 3; Owner Replication File Path elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0
34.
Page34 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Data storage by user SELECT owner, Sum(size) AS total_size FROM lsr_view WHERE file_type = 'FILE' GROUP BY owner ORDER BY total_size DESC; agrissia 30% albarma 26% blackupli 15% blackwardap 8% brilliantbox 7% bumpkin 5% catstoopshard 4% cozyboyal 2% fallenvivala 2% fonetter 1%
35.
Page35 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Small Files SELECT relative_size, Count(1) AS total FROM (SELECT ( CASE size < 134217728 WHEN true THEN 'small' ELSE 'large' END ) AS relative_size FROM lsr_view WHERE file_type = 'FILE') tmp GROUP BY relative_size; large 10% small 90% SELECT Avg(size) FROM lsr_view WHERE file_type = 'FILE'; > 61,305,522 Average File Size
36.
Page36 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS Audit Logs
37.
Page37 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS audit logs • Can be enabled by setting audit log level to INFO • Every hdfs access request is logged • Contains metadata about access requests o User name (actual user and proxy user if any) o IP Address (where request came from) o Action (Command) o File Name (Source and destination files involved) ------------------------------------------------------------------------------------------------------------------------------------- |Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms | ------------------------------------------------------------------------------------------------------------------------------------- 2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null
38.
Page38 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Analyzing HDFS Audit Log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI HDFS Audit Logs Periodically load it into hive load data local inpath ‘/log/Hadoop/hdfs/hdfs- audit.log.2014-11-19’ into table hdfs_audit 2 Audit log generated during normal operations of HDFS 1 Query data through a preferred interface 3
39.
Page39 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved HDFS Audit Log – Hive Table Definition CREATE EXTERNAL TABLE hdfs_audit ( date STRING, time STRING, log_level STRING, class STRING, allowed STRING, user STRING, auth_str STRING, auth_type STRING, proxy_user STRING, proxy_user_auth_str STRING, proxy_user_auth_type STRING, ip STRING, command STRING, src_path STRING, dest_path STRING, permissions STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via (S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)" ) ;
40.
Page40 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Sample Reports
41.
Page41 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Most Frequently Used Datasets SELECT src_path, Count(1) AS access_frequency FROM hdfs_audit GROUP BY src_path ORDER BY access_frequency DESC LIMIT 3; File Path Access Frequency /domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774 /domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181 /domains/drd/production/config/DBFBlockCriteria.log 4,816,841
42.
Page42 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Datasets not read even once SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size AS file_size FROM lsr_view lsr LEFT JOIN (SELECT Max(date), src_path FROM hdfs_audit WHERE command = 'open' GROUP BY src_path) audit ON ( lsr.file_path = audit.src_path ) WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL ORDER BY creation_date DESC LIMIT 3; File Path Creation Date File Size /app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465 /app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342 /app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128
43.
Page43 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Potentially intrusive users SELECT user, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY user ORDER BY failed_attempts DESC LIMIT 3; User Failed Attempts ryan_m 266 drown_d 238 mac_t 66
44.
Page44 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Potentially malicious client hosts SELECT ip, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY ip LIMIT 3; IP Address Failed Attempts 10.20.147.245 1059 10.20.145.137 1021 10.20.146.203 1018
45.
Page45 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Summary
46.
Page46 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Summary • Hadoop generates lots of useful metrics • Many of the datasets can be easily analyzed with a little effort o Hive and Pig are great analytical tools o There are inbuilt SerDes/Loaders for many of the formats • Simple analytics on HDFS lsr, HDFS Audit, Job History can empower DevOps to manage their clusters better
47.
Page47 © Hortonworks
Inc. 2011 – 2015. All Rights Reserved Thank You! Questions ?