SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing Hadoop Using Hadoop
15 Apr 2015
Sheetal Dolas
Principal Architect, Hortonworks
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Who am I ?
• Principal Architect @ Hortonworks
• Most of the career has been in field, solving real life business
problems
• Last 5+ years in Big Data including Hadoop, Storm etc.
• Co-developed Cisco OpenSOC ( http://opensoc.github.io )
sheetal@hortonworks.com
@sheetal_dolas
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Need for operational insights
• Challenges
• Data sets available
• Using Hadoop to analyze itself
• Sample reports
• Q and A
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Operational Insights
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Metrics Analysis
• Metrics can reveal the story about your cluster
• They help you understand workload characteristics
o Reveal the pain point
o Clear the misconceptions
o Drive towards action plan
• Operational insights are critical for SLA management by
improving
o System Reliability
o Uptime
o Performance
o Security
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Metrics Challenges
• Hadoop generates lot of metrics
o Host metrics (CPU, Memory, Disk, Network)
o Service metrics (JVM metrics, GC, Transactions, Performance)
o Service reports (fsck, lsr, dfs admin, audit logs)
o Job Metrics (Resource utilization, data processed, performance)
• Understanding and analyzing them is overwhelming
• No good enough tools that address the whole spectrum
• Need for deeper technology understanding
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
hmm…
hmmmm…
hah!
ahem!
ahh!
eh?
Hadoop Expert Hadoop Newbie
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
You know all the words and their meaning
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Metrics can appear like this conversation
But still don’t get the meaning of
conversation
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
We need tools that help extract meaning out of it
Hadoop Expert Hadoop Newbie
hmm…
Hadoop has Magnificent
Metrics
hmmmm…
Hadoop Metrics Make Me Mad
hah!
Hadoop Analyzes Hadoop
ahem!
Analyze Hadoop Easily in
Minutes
ahh!
Awesome! Hail Hadoop!
eh?
Elucidative Hadoop?
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets available
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets available for analysis
• MapReduce job history log
• HDFS lsr report
• HDFS Audit log
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MapReduce Job History Log
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Job history log
• Stored on HDFS
• Contains all the events occurred in a job plus the event metadata
• Has its own format
o Can be parsed using Rumen API
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing MapReduce Job history log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Job Log Parsing
Rumen
Job Resource
Computations
Periodically read the job
history logs from HDFS
1
Parse the logs compute
data and write it back to
Hive
2
Query data through a
preferred interface
3
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
CPU Utilization
33%
28%
25%
8%
3% 3% 0% 0% 0%
CPU Utilization - By Queue - Week To Date
productintelligence
cfld
adhoc
hive
techsupport
mnm
webhcat
infosecurity
prodintel_small
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Disk IO
53%
33%
7%
7%
0%
Data IO (GB) - By User - Yesterday
katharine.matsumoto
hadoop_sa
ebrown
mzang
justin.meyer
jmarquez
nbhupalam
rchakravarthy
pyan
rchirala
thomas.cox
User Ids
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Workload Distribution Through Hour of Day
0
50
100
150
200
250
300
350
0 2 4 6 8 10 12 14 16 18 20 22
Numberofjobs
submitted
Job submission hour
Number of jobs submitted
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
0 2 4 6 8 10 12 14 16 18 20 22
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
0 2 4 6 8 10 12 14 16 18 20 22
TotaldataprocessedGBs
Job submission hour
Total Data Processed
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Workload Distribution Through Day of Week
0
50
100
150
200
250
300
Numberofjobssubmitted
Job submission hour
Number of jobs submitted
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
Numberoftaskssubmitted
Job submission hour
Number of tasks submitted
-
100,000.00
200,000.00
300,000.00
400,000.00
500,000.00
600,000.00
700,000.00
TotaldataprocessedGBs
Job submission hour
Total Data Processed
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Job Type and Status
73
143
199
Job Distribution By Type
Yesterday
Hive
MapReduce
Pig
SUCCEEDE
D
98%
FAILED
2%
KILLED
0%
Job Distribution By Status Yesterday
SUCCEEDED
FAILED
KILLED
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 long running jobs - Yesterday
Job Id Job Name User Name Queue Name Job Duration
job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s
job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s
job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s
job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s
job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 long waiting jobs - Yesterday
Job Id Job Name User Name Queue Name
Job
Submission
Wait
job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld
5 h 39 m 38
s
job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity
5 h 19 m 46
s
job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm
5 h 18 m 47
s
job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld
5 h 11 m 12
s
job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Top 5 resource consuming jobs
job_id
Total
maps
Total
reduces
Requested
map GB
Requeste
d reduce
GB
total memory blocked by
job GB
job_1403277400645_1400 27,358 6 4 4 109,456
job_1403277400645_1423 27,358 3 4 4 109,444
job_1403277400645_1745 5,581 1 4 4 22,328
job_1403277400645_1497 1,807 0 4 4 7,228
job_1403277400645_1564 1,794 0 4 4 7,176
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Showback Reports
Queue Name
Total Cpu
Hours
Used
Cpu
Cost
Total
Memory Gb
Hours
Blocked
Memory
Cost
Total Data Io
Gb
Data Io
Cost
Total
Network Io
Gb
Network
Io Cost
Total Cost
adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56
cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11
hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01
$23,146.20
infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47
mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45
productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14
$18,167.99
techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61
Resource Pricing
CPU Cost Per Hour: $ 0.004
Memory Cost Per Gb Per Hour: $ 0.004
Data Io Cost Per Gb: $ 0.020
Network Io Cost Per Gb: $ 0.020
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report
• lsr is recursive file listing
• Contains metadata about files
o Permissions
o Owner & Group
o Replication factor
o File size
o Last modified date time
o File path
-------------------------------------------------------------------------------------------------------
|Permissions |rep factor | user | group | size | date | time| file path |
-------------------------------------------------------------------------------------------------------
drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics
-rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing lsr report
HDFS
lsr
report
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
Periodically generate lsr repot
hdfs dfs –lsr /
Load it into hive
load data local inpath
‘/tmp/lsr.txt’ overwrite
into table lsr
Query data through a
preferred interface
1
2 3
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report – Hive Table Definition
CREATE EXTERNAL TABLE lsr (
permissions STRING,
replication STRING,
owner STRING,
group STRING,
size STRING,
date STRING,
time STRING,
file_path STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)"
) ;
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS lsr report – Hive View Definition
CREATE VIEW lsr_view
AS
SELECT ( CASE Substr(permissions, 1, 1)
WHEN 'd' THEN 'DIR'
ELSE 'FILE'
END ) AS file_type,
permissions,
( CASE replication
WHEN '-' THEN 0
ELSE Cast (replication AS INT)
END ) AS replication,
owner,
group,
Cast (size AS INT) AS size,
date,
time,
file_path
FROM lsr ;
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security Checks – Files Readable by All
SELECT permissions, owner, file_path
FROM lsr_view
WHERE file_type = 'FILE'
AND Substr(permissions, 8, 1) = 'r'
LIMIT 3;
Permissions Owner File Path
-rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0
-rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1
-rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data loss risk – Files with low replication factor
SELECT owner, replication, file_path
FROM lsr_view
WHERE file_type = 'FILE'
AND file_path LIKE '/apps/hive/warehouse/%'
AND replication < 3
LIMIT 3;
Owner Replication File Path
elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt
sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data storage by user
SELECT owner, Sum(size) AS total_size
FROM lsr_view
WHERE file_type = 'FILE'
GROUP BY owner
ORDER BY total_size DESC;
agrissia
30%
albarma
26%
blackupli
15%
blackwardap
8%
brilliantbox
7%
bumpkin
5%
catstoopshard
4%
cozyboyal
2%
fallenvivala
2%
fonetter
1%
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Small Files
SELECT relative_size, Count(1) AS total
FROM (SELECT ( CASE size < 134217728
WHEN true THEN 'small'
ELSE 'large'
END ) AS relative_size
FROM lsr_view
WHERE file_type = 'FILE') tmp
GROUP BY relative_size;
large
10%
small
90%
SELECT Avg(size)
FROM lsr_view
WHERE file_type = 'FILE';
> 61,305,522
Average File Size
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS Audit Logs
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS audit logs
• Can be enabled by setting audit log level to INFO
• Every hdfs access request is logged
• Contains metadata about access requests
o User name (actual user and proxy user if any)
o IP Address (where request came from)
o Action (Command)
o File Name (Source and destination files involved)
-------------------------------------------------------------------------------------------------------------------------------------
|Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms |
-------------------------------------------------------------------------------------------------------------------------------------
2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Analyzing HDFS Audit Log
Hadoop Cluster
Tez
HDFS
Yarn
Hive
Analysis
JDBC Clients
ODBC Clients
Hive CLI
HDFS Audit
Logs
Periodically load it into hive
load data local inpath
‘/log/Hadoop/hdfs/hdfs-
audit.log.2014-11-19’
into table hdfs_audit
2
Audit log generated
during normal
operations of HDFS
1
Query data through a
preferred interface
3
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDFS Audit Log – Hive Table Definition
CREATE EXTERNAL TABLE hdfs_audit (
date STRING,
time STRING,
log_level STRING,
class STRING,
allowed STRING,
user STRING,
auth_str STRING,
auth_type STRING,
proxy_user STRING,
proxy_user_auth_str STRING,
proxy_user_auth_type STRING,
ip STRING,
command STRING,
src_path STRING,
dest_path STRING,
permissions STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" =
"(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via
(S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)"
) ;
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sample Reports
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Most Frequently Used Datasets
SELECT src_path, Count(1) AS access_frequency
FROM hdfs_audit
GROUP BY src_path
ORDER BY access_frequency DESC
LIMIT 3;
File Path Access Frequency
/domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774
/domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181
/domains/drd/production/config/DBFBlockCriteria.log 4,816,841
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Datasets not read even once
SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size
AS file_size
FROM lsr_view lsr
LEFT JOIN (SELECT Max(date), src_path
FROM hdfs_audit
WHERE command = 'open'
GROUP BY src_path) audit
ON ( lsr.file_path = audit.src_path )
WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL
ORDER BY creation_date DESC
LIMIT 3;
File Path Creation Date File Size
/app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465
/app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342
/app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Potentially intrusive users
SELECT user, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY user
ORDER BY failed_attempts DESC
LIMIT 3;
User Failed Attempts
ryan_m 266
drown_d 238
mac_t 66
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Potentially malicious client hosts
SELECT ip, Count(1) AS failed_attempts
FROM hdfs_audit
WHERE allowed != 'true'
GROUP BY ip
LIMIT 3;
IP Address Failed Attempts
10.20.147.245 1059
10.20.145.137 1021
10.20.146.203 1018
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
• Hadoop generates lots of useful metrics
• Many of the datasets can be easily analyzed with a little effort
o Hive and Pig are great analytical tools
o There are inbuilt SerDes/Loaders for many of the formats
• Simple analytics on HDFS lsr, HDFS Audit, Job History can
empower DevOps to manage their clusters better
Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank You!
Questions ?

Weitere ähnliche Inhalte

Was ist angesagt?

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 

Was ist angesagt? (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 

Andere mochten auch

Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Rohit Agrawal
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 

Andere mochten auch (8)

Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Negotiating Meaning
Negotiating MeaningNegotiating Meaning
Negotiating Meaning
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 

Ähnlich wie Analyzing Hadoop Using Hadoop

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldSean Roberts
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidRaúl Marín
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming dataCarolyn Duby
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal GemfireIMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal GemfireIn-Memory Computing Summit
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real WorldDataWorks Summit
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without DataBryan Bende
 

Ähnlich wie Analyzing Hadoop Using Hadoop (20)

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal GemfireIMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Analyzing Hadoop Using Hadoop

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing Hadoop Using Hadoop 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda • Need for operational insights • Challenges • Data sets available • Using Hadoop to analyze itself • Sample reports • Q and A
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Operational Insights
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Metrics Analysis • Metrics can reveal the story about your cluster • They help you understand workload characteristics o Reveal the pain point o Clear the misconceptions o Drive towards action plan • Operational insights are critical for SLA management by improving o System Reliability o Uptime o Performance o Security
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Metrics Challenges • Hadoop generates lot of metrics o Host metrics (CPU, Memory, Disk, Network) o Service metrics (JVM metrics, GC, Transactions, Performance) o Service reports (fsck, lsr, dfs admin, audit logs) o Job Metrics (Resource utilization, data processed, performance) • Understanding and analyzing them is overwhelming • No good enough tools that address the whole spectrum • Need for deeper technology understanding
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation hmm… hmmmm… hah! ahem! ahh! eh? Hadoop Expert Hadoop Newbie
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation You know all the words and their meaning
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Metrics can appear like this conversation But still don’t get the meaning of conversation
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved We need tools that help extract meaning out of it Hadoop Expert Hadoop Newbie hmm… Hadoop has Magnificent Metrics hmmmm… Hadoop Metrics Make Me Mad hah! Hadoop Analyzes Hadoop ahem! Analyze Hadoop Easily in Minutes ahh! Awesome! Hail Hadoop! eh? Elucidative Hadoop?
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets available
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets available for analysis • MapReduce job history log • HDFS lsr report • HDFS Audit log
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MapReduce Job History Log
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Job history log • Stored on HDFS • Contains all the events occurred in a job plus the event metadata • Has its own format o Can be parsed using Rumen API
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing MapReduce Job history log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Job Log Parsing Rumen Job Resource Computations Periodically read the job history logs from HDFS 1 Parse the logs compute data and write it back to Hive 2 Query data through a preferred interface 3
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved CPU Utilization 33% 28% 25% 8% 3% 3% 0% 0% 0% CPU Utilization - By Queue - Week To Date productintelligence cfld adhoc hive techsupport mnm webhcat infosecurity prodintel_small
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Disk IO 53% 33% 7% 7% 0% Data IO (GB) - By User - Yesterday katharine.matsumoto hadoop_sa ebrown mzang justin.meyer jmarquez nbhupalam rchakravarthy pyan rchirala thomas.cox User Ids
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Hour of Day 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 14 16 18 20 22 Numberofjobs submitted Job submission hour Number of jobs submitted 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 0 2 4 6 8 10 12 14 16 18 20 22 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 0 2 4 6 8 10 12 14 16 18 20 22 TotaldataprocessedGBs Job submission hour Total Data Processed
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Workload Distribution Through Day of Week 0 50 100 150 200 250 300 Numberofjobssubmitted Job submission hour Number of jobs submitted 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 Numberoftaskssubmitted Job submission hour Number of tasks submitted - 100,000.00 200,000.00 300,000.00 400,000.00 500,000.00 600,000.00 700,000.00 TotaldataprocessedGBs Job submission hour Total Data Processed
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Job Type and Status 73 143 199 Job Distribution By Type Yesterday Hive MapReduce Pig SUCCEEDE D 98% FAILED 2% KILLED 0% Job Distribution By Status Yesterday SUCCEEDED FAILED KILLED
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 long running jobs - Yesterday Job Id Job Name User Name Queue Name Job Duration job_1409197939494_7043 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 16 h 33 m 15 s job_1409197939494_7629 PigLatin:LTF:09:12:Job3 john_s infosecurity 1 d 8 h 40 m 42 s job_1409197939494_7243 PigLatin:mbl_chtr_ios_blackberry_metrics.pig joy_d cfld 1 d 6 h 54 m 56 s job_1409197939494_7042 PigLatin:mbl_chtr_android_metrics.pig hadoop_sa hive 1 d 3 h 37 m 30 s job_1409197939494_7328 INSERT INTO TABLE com...ILE__NAME,'.')[5])(Stage-1) hadoop_sa hive 1 d 1 h 28 m 35 s
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 long waiting jobs - Yesterday Job Id Job Name User Name Queue Name Job Submission Wait job_1409197939494_7621 ODS_S.ODS_LOG_ORG_TYP_METRICS.jar joy_d cfld 5 h 39 m 38 s job_1409197939494_8222 PigLatin:LTF:09:15:Job3 john_s infosecurity 5 h 19 m 46 s job_1409197939494_8357 PigLatin:LTF:09:19:Job9 raj_s mnm 5 h 18 m 47 s job_1409197939494_7622 PigLatin:Log_U_Org_Metrics.pig katherine_d cfld 5 h 11 m 12 s job_1409197939494_8071 PigLatin:LTF:09:16:Job10 raj_s mnm 5 h 4 m
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Top 5 resource consuming jobs job_id Total maps Total reduces Requested map GB Requeste d reduce GB total memory blocked by job GB job_1403277400645_1400 27,358 6 4 4 109,456 job_1403277400645_1423 27,358 3 4 4 109,444 job_1403277400645_1745 5,581 1 4 4 22,328 job_1403277400645_1497 1,807 0 4 4 7,228 job_1403277400645_1564 1,794 0 4 4 7,176
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Showback Reports Queue Name Total Cpu Hours Used Cpu Cost Total Memory Gb Hours Blocked Memory Cost Total Data Io Gb Data Io Cost Total Network Io Gb Network Io Cost Total Cost adhoc 4,422.94 17.69 20,404.09 81.62 70,918.29 1,418.37 394.01 7.88 $1,525.56 cfld 41,038.93 164.16 150,130.90 600.52 446,762.29 8,935.25 7,258.97 145.18 $9,845.11 hive 73,322.16 293.29 372,560.04 1,490.24 977,333.05 19,546.66 90,800.40 1,816.01 $23,146.20 infosecurity 23,476.46 93.91 77,515.34 310.06 293,616.02 5,872.32 7,458.77 149.18 $6,425.47 mnm 27,113.03 108.45 100,027.28 400.11 391,907.76 7,838.16 10,436.65 208.73 $8,555.45 productintelligence 74,113.17 296.45 158,423.62 633.69 851,435.74 17,028.71 10,456.78 209.14 $18,167.99 techsupport 34,037.16 136.15 100,904.89 403.62 400,972.22 8,019.44 7,120.19 142.40 $8,701.61 Resource Pricing CPU Cost Per Hour: $ 0.004 Memory Cost Per Gb Per Hour: $ 0.004 Data Io Cost Per Gb: $ 0.020 Network Io Cost Per Gb: $ 0.020
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report • lsr is recursive file listing • Contains metadata about files o Permissions o Owner & Group o Replication factor o File size o Last modified date time o File path ------------------------------------------------------------------------------------------------------- |Permissions |rep factor | user | group | size | date | time| file path | ------------------------------------------------------------------------------------------------------- drwx------ - sheetal etl_users 0 2014-12-13 01:18 /user/sheetal/analytics -rw-r--r-- 3 sheetal etl_users 15552642 2014-12-13 01:18 /user/sheetal/analytics/server.log
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing lsr report HDFS lsr report Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI Periodically generate lsr repot hdfs dfs –lsr / Load it into hive load data local inpath ‘/tmp/lsr.txt’ overwrite into table lsr Query data through a preferred interface 1 2 3
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive Table Definition CREATE EXTERNAL TABLE lsr ( permissions STRING, replication STRING, owner STRING, group STRING, size STRING, date STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(S+)s+(.*)" ) ;
  • 30. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS lsr report – Hive View Definition CREATE VIEW lsr_view AS SELECT ( CASE Substr(permissions, 1, 1) WHEN 'd' THEN 'DIR' ELSE 'FILE' END ) AS file_type, permissions, ( CASE replication WHEN '-' THEN 0 ELSE Cast (replication AS INT) END ) AS replication, owner, group, Cast (size AS INT) AS size, date, time, file_path FROM lsr ;
  • 31. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 32. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Security Checks – Files Readable by All SELECT permissions, owner, file_path FROM lsr_view WHERE file_type = 'FILE' AND Substr(permissions, 8, 1) = 'r' LIMIT 3; Permissions Owner File Path -rwxr-xr-x sheetal /user/sheetal/analytics/finance_report/000001_0 -rwxr-xr-x joe_lee /apps/hive/warehouse/sales.db/sales/date=2014-08-17/000001_1 -rw-r--r-- sales_etl /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt
  • 33. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data loss risk – Files with low replication factor SELECT owner, replication, file_path FROM lsr_view WHERE file_type = 'FILE' AND file_path LIKE '/apps/hive/warehouse/%' AND replication < 3 LIMIT 3; Owner Replication File Path elizabeth 1 /apps/hive/warehouse/sales_stg.db/order/order_summary.txt sales_etl 2 /apps/hive/warehouse/sales_stg.db/user/new_subscribers.txt john_smith 1 /apps/hive/warehouse/archive.db/report_d/000001_0
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data storage by user SELECT owner, Sum(size) AS total_size FROM lsr_view WHERE file_type = 'FILE' GROUP BY owner ORDER BY total_size DESC; agrissia 30% albarma 26% blackupli 15% blackwardap 8% brilliantbox 7% bumpkin 5% catstoopshard 4% cozyboyal 2% fallenvivala 2% fonetter 1%
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Small Files SELECT relative_size, Count(1) AS total FROM (SELECT ( CASE size < 134217728 WHEN true THEN 'small' ELSE 'large' END ) AS relative_size FROM lsr_view WHERE file_type = 'FILE') tmp GROUP BY relative_size; large 10% small 90% SELECT Avg(size) FROM lsr_view WHERE file_type = 'FILE'; > 61,305,522 Average File Size
  • 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS Audit Logs
  • 37. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS audit logs • Can be enabled by setting audit log level to INFO • Every hdfs access request is logged • Contains metadata about access requests o User name (actual user and proxy user if any) o IP Address (where request came from) o Action (Command) o File Name (Source and destination files involved) ------------------------------------------------------------------------------------------------------------------------------------- |Date |Time | Status | User | Auth Type | IP Address | Command | Src Path |Dest Path|Perms | ------------------------------------------------------------------------------------------------------------------------------------- 2014-11-19 23:54:57,083 allowed=true ugi=hdfs (auth:SIMPLE) ip=/10.10.150.103 cmd=listStatus src=/mr-history/tmp dst=null perm=null
  • 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Analyzing HDFS Audit Log Hadoop Cluster Tez HDFS Yarn Hive Analysis JDBC Clients ODBC Clients Hive CLI HDFS Audit Logs Periodically load it into hive load data local inpath ‘/log/Hadoop/hdfs/hdfs- audit.log.2014-11-19’ into table hdfs_audit 2 Audit log generated during normal operations of HDFS 1 Query data through a preferred interface 3
  • 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDFS Audit Log – Hive Table Definition CREATE EXTERNAL TABLE hdfs_audit ( date STRING, time STRING, log_level STRING, class STRING, allowed STRING, user STRING, auth_str STRING, auth_type STRING, proxy_user STRING, proxy_user_auth_str STRING, proxy_user_auth_type STRING, ip STRING, command STRING, src_path STRING, dest_path STRING, permissions STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(S+)s+(S+)s+(S+)s+(S+)s+allowed=(S+)s+ugi=(S+)s+.auth:(S+)Ss+(via (S+))?s*(.auth:(S+)S)?s*ip=.(S+)s+cmd=(S+)s+src=(S+)s+dst=(S+)s+perm=(S+)" ) ;
  • 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Reports
  • 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Most Frequently Used Datasets SELECT src_path, Count(1) AS access_frequency FROM hdfs_audit GROUP BY src_path ORDER BY access_frequency DESC LIMIT 3; File Path Access Frequency /domains/drd/production/config/AnalysisModule02Signatures.log 5,758,774 /domains/drd/production/config/ANLCustAnalysisModule02Signatures.log 5,754,181 /domains/drd/production/config/DBFBlockCriteria.log 4,816,841
  • 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Datasets not read even once SELECT lsr.file_path AS file_path, lsr.date AS creation_date, lsr.size AS file_size FROM lsr_view lsr LEFT JOIN (SELECT Max(date), src_path FROM hdfs_audit WHERE command = 'open' GROUP BY src_path) audit ON ( lsr.file_path = audit.src_path ) WHERE lsr.file_type = 'FILE’ AND audit.src_path IS NULL ORDER BY creation_date DESC LIMIT 3; File Path Creation Date File Size /app/hive/warehouse/sales_stg.db/account/account_extract.txt 2014-10-16 76,598,987,465 /app/hive/warehouse/sales_stg.db/order/order_history.txt 2014-11-26 901,341,097,342 /app/hive/warehouse/sales_stg.db/catalog/catalog.txt 2014-11-28 213,353,902,128
  • 43. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Potentially intrusive users SELECT user, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY user ORDER BY failed_attempts DESC LIMIT 3; User Failed Attempts ryan_m 266 drown_d 238 mac_t 66
  • 44. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Potentially malicious client hosts SELECT ip, Count(1) AS failed_attempts FROM hdfs_audit WHERE allowed != 'true' GROUP BY ip LIMIT 3; IP Address Failed Attempts 10.20.147.245 1059 10.20.145.137 1021 10.20.146.203 1018
  • 45. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary
  • 46. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary • Hadoop generates lots of useful metrics • Many of the datasets can be easily analyzed with a little effort o Hive and Pig are great analytical tools o There are inbuilt SerDes/Loaders for many of the formats • Simple analytics on HDFS lsr, HDFS Audit, Job History can empower DevOps to manage their clusters better
  • 47. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank You! Questions ?