SlideShare a Scribd company logo
1 of 39
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Keep your Hadoop
cluster at its best!
Chris Nauroth
Sheetal Dolas
Hadoop Summit, San Jose, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ Principal Engineer @ Hortonworks
⬢ Committer and PMC, Apache Hadoop
– Key contributor to HDFS ACLs, Windows compatibility, and operability improvements
⬢ Hadoop user since 2010
– Experience deploying, maintaining and using Hadoop clusters
cnauroth@hortonworks.com
cnauroth
Chris Nauroth
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ SmartSense Engineering Lead @ Hortonworks
⬢ Most of the career has been in the field, solving real life business problems
⬢ Last 6+ years in Big Data
⬢ Committer and PMC, Apache Metron
sheetal@hortonworks.com
sheetal_dolas
Sheetal Dolas
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
⬢ Days in a life of Hadoop users – Real war stories!
⬢ Hadoop Operational Challenges
⬢ Winning and avoiding the wars
⬢ Q & A
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Days in a life of
Hadoop users
Real war stories!
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story I: Unstable NameNode, Frequent Fail Overs
⬢ NameNode periodically becomes unresponsive
⬢ In HA scenario, fails over to standby
⬢ In short time, falls back again
⬢ Very frequent fail overs and fail backs
It was the garbage
collection!
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story II: Very high CPU usage but low throughput
⬢ Unusually high system CPU usage
⬢ Jobs slowed down
⬢ Reduced data IO
System CPU
User CPU N/W IO
Transparent Huge Pages (THP) was turned
on!
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story III: Cascading impact and cluster melt down
⬢ HDFS upgraded
⬢ HDFS utilization kept on increasing even after large data deletion
⬢ Rebalancing made the situation worse
⬢ Eventually HDFS became unresponsive
un-finalized HDFS had
cascading impact on cluster!
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story IV: Overloaded cluster
⬢ Jobs run slower
⬢ Always waiting containers and jobs, all YARN queues are fully utilized
⬢ Some jobs had to wait for hours to get the container slots
Sub optimally configured container sizes!
Requested
Memory
Used Memory
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story V: Accidental deletion of critical datasets
⬢ User accidentally executed hdfs dfs -rm -R on a root directory
⬢ Delete is issued in parallel, control + c did not help
⬢ In panic, user shuts down HDFS immediately (fortunately)
⬢ Restarts later to check trash, loses all data
⬢ It’s nearly impossible to recover blocks from local file system
This is a more common mistake than one may
think!
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story VI: Hive query returning random results
⬢ A hive query returns different results every time
⬢ Results are usually accurate during office hours
⬢ After office hours, results keep changing randomly on every execution
-- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ?
SELECT SUM(amount)
FROM sales
WHERE sale_date = TO_DATE (UNIX_TIMESTAMP())
One of the host had a different time zone!
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
and the stories continue…
Hadoop operational
challenges
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop has lots of configurations
⬢ So many configurations! Overwhelming for many users
⬢ Best practices are evolving and change across versions
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Many configurations are cluster and workload specific
⬢ A configuration good for one cluster may not be suitable for another cluster
⬢ Optimally configured clusters may become sub optimal tomorrow as they
grow
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Large clusters add to the complexities
⬢ Managing, updating and keeping nodes in sync becomes challenging
⬢ Nodes going down miss the maintenance cycles and get out of sync
⬢ Newly added nodes may have different standards (java version, os, user
configurations etc.)
⬢ Clusters start having heterogeneous hardware over period of time
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Winning
and
avoiding
the wars with
SmartSense
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Proactive support & personalized cluster insights by
– Enabling faster case resolution
– Applying industry best practices
– Providing proactive analysis
⬢ SmartSense is a collection of tools and services
– Evaluates cluster’s current configuration and runtime environment against rich set of rules
– Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads
– Continuously evolving and improving rule sets, developed by or in close consultation with active
committers, support engineers, field engineers.
SmartSense
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
L A N D I N G Z O N E
S E R V E R
A M B A R I
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
B U N D L E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
S m a r t S e n s e
A n a l y t i c s
SmartSense Architecture
G A T E W A Y
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Unstable NameNode, Frequent Fail Overs
Daunting Questions
⬢ What is right Heap size for my
NN ?
⬢ What should be the new gen
size ?
⬢ Which GC should I use ?
⬢ What GC options to be
configured?
⬢ What if my cluster grows ?
SmartSense Answer
⬢ Rule: hdfs_nn_jvm_opts
⬢ Calculates Heap size based on
– Current heap usage
– Total number of objects in file system
– Best practices
⬢ Recalculates dependent JVM
options based on Heap size
⬢ Validates existing JVM opts
⬢ Provides continuous validations
and proactive
recommendations
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Heap Size
– 200 bytes per HDFS object (files, directories, blocks)
– 25 % buffer
 -Xms should be same as –Xmx
 New generation size should be 1/8th of –Xmx (capped at 8G)
 Use Concurrent Mark Sweep (CMS) Garbage Collection
– -XX:+UseConcMarkSweepGC
– -XX:CMSInitiatingOccupancyFraction=70
– -XX:+UseCMSInitiatingOccupancyOnly
– -XX:ParallelGCThreads=8
NameNode JVM Opts
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Very high CPU usage but low throughput
Daunting Questions
⬢ Is THP applicable to my OS
version ?
⬢ Is it disabled ? Completely
disabled ?
⬢ How do I make sure it is
disabled on newly added nodes
too ?
⬢ How do I make these
configurations person
independent ?
SmartSense Answer
⬢ Rule: os_thp
⬢ Checks if thp is completely
disabled
⬢ Provides OS specific disabling
instructions
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disable THP
⬢ For RedHat & CentOS
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled
⬢ For Debian, Ubuntu & SUSE
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
System CPU
User CPU
N/W IO
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Cascading impact and cluster melt down
Daunting Questions
⬢ Should I finalize upgrade ?
⬢ What is right time to finalize ?
⬢ How do I make sure it does not
fall through cracks ?
SmartSense Answer
⬢ Rule: hdfs_nn_finalize_upgrade
⬢ Checks HDFS health after
upgrade
⬢ Evaluates how long HDFS is
running in un-finalized state
⬢ Reminds until it is finalized
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Check NN UI / JMX for upgrade status
 Do not finalize HDFS upgrade until
– All files and blocks have been verified after upgrade
– Critical jobs have been executed at least once after upgrade
 Finalize between 2 - 7 days after upgrade
hdfs dfsadmin -finalizeUpgrade
HDFS Upgrade finalization
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Overloaded cluster
Daunting Questions
⬢ What is right container size for
my cluster ?
⬢ If I add additional components
(HBase, Storm), how does the
container size change ?
⬢ How does container sizes
change when I add new types
of nodes in the cluster ?
⬢ What’s impact on container
sizes if I add SSDs to the
nodes?
SmartSense Answer
⬢ Rules: yarn_container_size,
mr_container_size,
tez_container_size
⬢ Evaluates resources available
on individual host (CPU,
Memory, Disks, Running
Services etc.)
⬢ Calculates technology specific
container sizes (MR, Tez, Hive)
⬢ Continuously evaluates as the
cluster dynamics change
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Container sizing
 Identify resources (CPU, Memory, Disks) available on each node
 Keep aside resources required for other processes (OS, DN, NM, HBase
RS)
 Calculate max possible containers for each resource (CPU, Memory, Disks)
– CPU Containers: 4x cores
– Disk Containers: ( 3x HDD + 10x SSD )
– Memory Containers: (Available RAM / 2 )
 Number of containers = Min (CPU Containers, Disk Containers, Memory
Containers)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Accidental deletion of critical datasets
Daunting Questions
⬢ Is HDFS trash enabled ?
⬢ What is safe trash interval ?
⬢ How to prevent accidental
deletion of critical data ?
SmartSense Answer
⬢ Rule: hdfs_trash_interval
– Checks if trash is enabled
– Validates if trash interval is within
reasonable limits
⬢ Rule:
hdfs_nn_protect_imp_dirs
– New feature available in Hadoop 2.8
– Helps you mark critical directories such
as “/”, “/user”, “/user/apps/hive”,
“/user/apps/hbase” etc. are delete
protected.
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Trash interval and directory protection
 fs.trash.interval detects number of minutes after which the trashed
data gets deleted
– 0 means trash disabled (data gets deleted immediately)
– Keep it the range 1440 (1 day) – 10080 (7 days)
– Recommended 4320 (3 days)
 fs.protected.directories specifies directories that will be delete
protected
– Available from Hadoop 2.8
– List all key directories there ("/", "/user","/user/apps",
"/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data",
"/mapred", "/mapred/system", "/tmp" etc. )
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Hive query returning random results
Daunting Questions
⬢ Is my cluster configured
consistently ?
⬢ How do I prevent such hard to
analyze issues ?
⬢ How do I make sure newly
added do not bring these types
of issues ?
⬢ How do I make these set ups
person independent ?
SmartSense Answer
⬢ Rule: os_time_zone
⬢ Checks if all hosts have same
time zone
⬢ Rule os_service_ntpd_on make
sure all host times are in sync
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There are 250+ more such rules
Operations
 hdfs_dn_volume_tolerance
 hdfs_dn_xceivers
 hdfs_nn_handler_count
 …
 yarn_zk_quorum
 yarn_nm_recovery
 …
 os_hostname_reverse_lookup
 os_ssd_tuning
 …
 hive_mr_strict_mode
 hive_datanucleus_cache
 …
 tez_am_heap
 tez_shuffle_buffer
 …
Performance
 ams_mc_distributed_configs
 ams_mc_write_path
 ...
 hbase_jvm_opts
 hbase_rs_open_region_thread
s
 hbase_tcp_nodelay
 ...
 hdfs_dn_jvm_opts
 hdfs_mount_options
 hdfs_nn_dn_staleness_interva
l
 ...
 hive_auto_convert_join
 hive_disable_caching
 hive_enable_cbo
 ...
Security
 hdfs_dn_volume_tolerance
 hdfs_audit_log
 hdfs_block_access_token
 hdfs_enable_security_check
 hdfs_nn_super_user_group
 hdfs_zkfc_ha_acl
 ...
 ranger_policy_refresh_interval
 smartsense_2_way_ssl_enabl
ed
 ...
 yarn_ats_security
 yarn_enable_acl
 ...
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There is more than just configurations
How do I
show
back/charg
e back my
tenants ?
Who are the
top users of
my platform
?What type of
work loads
are running
on my cluster
?
Which jobs
have
significant
impact on my
cluster ?
How do I
improve
performanc
e of key
jobs ?
What is good
time for
maintenance
?
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Activity Analysis
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 There are many things involved in managing Hadoop cluster
 Best practices evolve and change across versions
 What is optimal today may not be optimal for tomorrow
 Changing cluster dynamics, workload characteristic need continuous re-
evaluation and configuration adjustments
 SmartSense can significantly help avoid common mistakes, issues, pitfalls
and simplify Hadoop operations
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lets keep your
Hadoop cluster at
its best!
Thank You!
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Appendix
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Resources
⬢ https://docs.hortonworks.com/index.html
⬢ http://hortonworks.com/products/subscriptions/smartsense/
⬢ http://hortonworks.com/info/smartsense/
⬢ http://hortonworks.com/blog/introducing-hortonworks-smartsense/
⬢ https://www.youtube.com/watch?v=IKulo9c8PjE
⬢ https://community.hortonworks.com/topics/smartsense.html
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Bundle Security
⬢ All Bundles are Anonymized and Encrypted
⬢ Multiple built-in security measures
– Ambari clear text passwords are not collected
– Hive and Oozie database properties are not collected
– All IP addresses and host names are anonymized
⬢ Extensible security rules
– Exclude properties within specific Hadoop configuration files
– Global REGEX replacements across all configuration, metrics, and logs
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Stack Support
HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0
SmartSense 1.x
Ambari 2.2
Built-In!
Ambari 2.1
Plug-In
Ambari 2.0
Plug-In
Ambari 1.7 Ambari 1.6
SmartSense 1.x

More Related Content

What's hot

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Data Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsData Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsDataWorks Summit
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
 

What's hot (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Data Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsData Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming Systems
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 

Similar to Keep your hadoop cluster at its best! v4

Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!Sheetal Dolas
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Wangda Tan
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory StorageDataWorks Summit
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormDataWorks Summit
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scaleCarolyn Duby
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiBryan Bende
 
Data-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloData-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloJosh Elser
 

Similar to Keep your hadoop cluster at its best! v4 (20)

Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Data-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloData-Center Replication with Apache Accumulo
Data-Center Replication with Apache Accumulo
 

Recently uploaded

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Recently uploaded (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Keep your hadoop cluster at its best! v4

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Keep your Hadoop cluster at its best! Chris Nauroth Sheetal Dolas Hadoop Summit, San Jose, 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Us ⬢ Principal Engineer @ Hortonworks ⬢ Committer and PMC, Apache Hadoop – Key contributor to HDFS ACLs, Windows compatibility, and operability improvements ⬢ Hadoop user since 2010 – Experience deploying, maintaining and using Hadoop clusters cnauroth@hortonworks.com cnauroth Chris Nauroth
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Us ⬢ SmartSense Engineering Lead @ Hortonworks ⬢ Most of the career has been in the field, solving real life business problems ⬢ Last 6+ years in Big Data ⬢ Committer and PMC, Apache Metron sheetal@hortonworks.com sheetal_dolas Sheetal Dolas
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda ⬢ Days in a life of Hadoop users – Real war stories! ⬢ Hadoop Operational Challenges ⬢ Winning and avoiding the wars ⬢ Q & A
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Days in a life of Hadoop users Real war stories!
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story I: Unstable NameNode, Frequent Fail Overs ⬢ NameNode periodically becomes unresponsive ⬢ In HA scenario, fails over to standby ⬢ In short time, falls back again ⬢ Very frequent fail overs and fail backs It was the garbage collection!
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story II: Very high CPU usage but low throughput ⬢ Unusually high system CPU usage ⬢ Jobs slowed down ⬢ Reduced data IO System CPU User CPU N/W IO Transparent Huge Pages (THP) was turned on!
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story III: Cascading impact and cluster melt down ⬢ HDFS upgraded ⬢ HDFS utilization kept on increasing even after large data deletion ⬢ Rebalancing made the situation worse ⬢ Eventually HDFS became unresponsive un-finalized HDFS had cascading impact on cluster!
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story IV: Overloaded cluster ⬢ Jobs run slower ⬢ Always waiting containers and jobs, all YARN queues are fully utilized ⬢ Some jobs had to wait for hours to get the container slots Sub optimally configured container sizes! Requested Memory Used Memory
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story V: Accidental deletion of critical datasets ⬢ User accidentally executed hdfs dfs -rm -R on a root directory ⬢ Delete is issued in parallel, control + c did not help ⬢ In panic, user shuts down HDFS immediately (fortunately) ⬢ Restarts later to check trash, loses all data ⬢ It’s nearly impossible to recover blocks from local file system This is a more common mistake than one may think!
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story VI: Hive query returning random results ⬢ A hive query returns different results every time ⬢ Results are usually accurate during office hours ⬢ After office hours, results keep changing randomly on every execution -- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ? SELECT SUM(amount) FROM sales WHERE sale_date = TO_DATE (UNIX_TIMESTAMP()) One of the host had a different time zone!
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved and the stories continue…
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop has lots of configurations ⬢ So many configurations! Overwhelming for many users ⬢ Best practices are evolving and change across versions
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Many configurations are cluster and workload specific ⬢ A configuration good for one cluster may not be suitable for another cluster ⬢ Optimally configured clusters may become sub optimal tomorrow as they grow
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large clusters add to the complexities ⬢ Managing, updating and keeping nodes in sync becomes challenging ⬢ Nodes going down miss the maintenance cycles and get out of sync ⬢ Newly added nodes may have different standards (java version, os, user configurations etc.) ⬢ Clusters start having heterogeneous hardware over period of time
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Winning and avoiding the wars with SmartSense
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Proactive support & personalized cluster insights by – Enabling faster case resolution – Applying industry best practices – Providing proactive analysis ⬢ SmartSense is a collection of tools and services – Evaluates cluster’s current configuration and runtime environment against rich set of rules – Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads – Continuously evolving and improving rule sets, developed by or in close consultation with active committers, support engineers, field engineers. SmartSense
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A G E N T A G E N T A G E N TA G E N TA G E N T A G E N T L A N D I N G Z O N E S E R V E R A M B A R I A G E N T A G E N T A G E N TA G E N TA G E N T A G E N T B U N D L E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E S m a r t S e n s e A n a l y t i c s SmartSense Architecture G A T E W A Y
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Unstable NameNode, Frequent Fail Overs Daunting Questions ⬢ What is right Heap size for my NN ? ⬢ What should be the new gen size ? ⬢ Which GC should I use ? ⬢ What GC options to be configured? ⬢ What if my cluster grows ? SmartSense Answer ⬢ Rule: hdfs_nn_jvm_opts ⬢ Calculates Heap size based on – Current heap usage – Total number of objects in file system – Best practices ⬢ Recalculates dependent JVM options based on Heap size ⬢ Validates existing JVM opts ⬢ Provides continuous validations and proactive recommendations
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Heap Size – 200 bytes per HDFS object (files, directories, blocks) – 25 % buffer  -Xms should be same as –Xmx  New generation size should be 1/8th of –Xmx (capped at 8G)  Use Concurrent Mark Sweep (CMS) Garbage Collection – -XX:+UseConcMarkSweepGC – -XX:CMSInitiatingOccupancyFraction=70 – -XX:+UseCMSInitiatingOccupancyOnly – -XX:ParallelGCThreads=8 NameNode JVM Opts
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Very high CPU usage but low throughput Daunting Questions ⬢ Is THP applicable to my OS version ? ⬢ Is it disabled ? Completely disabled ? ⬢ How do I make sure it is disabled on newly added nodes too ? ⬢ How do I make these configurations person independent ? SmartSense Answer ⬢ Rule: os_thp ⬢ Checks if thp is completely disabled ⬢ Provides OS specific disabling instructions ⬢ Continuous evaluation that validates newly added nodes and re-commissioned nodes
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disable THP ⬢ For RedHat & CentOS echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled ⬢ For Debian, Ubuntu & SUSE echo "never" > /sys/kernel/mm/transparent_hugepage/enabled System CPU User CPU N/W IO
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Cascading impact and cluster melt down Daunting Questions ⬢ Should I finalize upgrade ? ⬢ What is right time to finalize ? ⬢ How do I make sure it does not fall through cracks ? SmartSense Answer ⬢ Rule: hdfs_nn_finalize_upgrade ⬢ Checks HDFS health after upgrade ⬢ Evaluates how long HDFS is running in un-finalized state ⬢ Reminds until it is finalized
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Check NN UI / JMX for upgrade status  Do not finalize HDFS upgrade until – All files and blocks have been verified after upgrade – Critical jobs have been executed at least once after upgrade  Finalize between 2 - 7 days after upgrade hdfs dfsadmin -finalizeUpgrade HDFS Upgrade finalization
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing : Overloaded cluster Daunting Questions ⬢ What is right container size for my cluster ? ⬢ If I add additional components (HBase, Storm), how does the container size change ? ⬢ How does container sizes change when I add new types of nodes in the cluster ? ⬢ What’s impact on container sizes if I add SSDs to the nodes? SmartSense Answer ⬢ Rules: yarn_container_size, mr_container_size, tez_container_size ⬢ Evaluates resources available on individual host (CPU, Memory, Disks, Running Services etc.) ⬢ Calculates technology specific container sizes (MR, Tez, Hive) ⬢ Continuously evaluates as the cluster dynamics change
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Container sizing  Identify resources (CPU, Memory, Disks) available on each node  Keep aside resources required for other processes (OS, DN, NM, HBase RS)  Calculate max possible containers for each resource (CPU, Memory, Disks) – CPU Containers: 4x cores – Disk Containers: ( 3x HDD + 10x SSD ) – Memory Containers: (Available RAM / 2 )  Number of containers = Min (CPU Containers, Disk Containers, Memory Containers)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Accidental deletion of critical datasets Daunting Questions ⬢ Is HDFS trash enabled ? ⬢ What is safe trash interval ? ⬢ How to prevent accidental deletion of critical data ? SmartSense Answer ⬢ Rule: hdfs_trash_interval – Checks if trash is enabled – Validates if trash interval is within reasonable limits ⬢ Rule: hdfs_nn_protect_imp_dirs – New feature available in Hadoop 2.8 – Helps you mark critical directories such as “/”, “/user”, “/user/apps/hive”, “/user/apps/hbase” etc. are delete protected.
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Trash interval and directory protection  fs.trash.interval detects number of minutes after which the trashed data gets deleted – 0 means trash disabled (data gets deleted immediately) – Keep it the range 1440 (1 day) – 10080 (7 days) – Recommended 4320 (3 days)  fs.protected.directories specifies directories that will be delete protected – Available from Hadoop 2.8 – List all key directories there ("/", "/user","/user/apps", "/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data", "/mapred", "/mapred/system", "/tmp" etc. )
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing : Hive query returning random results Daunting Questions ⬢ Is my cluster configured consistently ? ⬢ How do I prevent such hard to analyze issues ? ⬢ How do I make sure newly added do not bring these types of issues ? ⬢ How do I make these set ups person independent ? SmartSense Answer ⬢ Rule: os_time_zone ⬢ Checks if all hosts have same time zone ⬢ Rule os_service_ntpd_on make sure all host times are in sync ⬢ Continuous evaluation that validates newly added nodes and re-commissioned nodes
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There are 250+ more such rules Operations  hdfs_dn_volume_tolerance  hdfs_dn_xceivers  hdfs_nn_handler_count  …  yarn_zk_quorum  yarn_nm_recovery  …  os_hostname_reverse_lookup  os_ssd_tuning  …  hive_mr_strict_mode  hive_datanucleus_cache  …  tez_am_heap  tez_shuffle_buffer  … Performance  ams_mc_distributed_configs  ams_mc_write_path  ...  hbase_jvm_opts  hbase_rs_open_region_thread s  hbase_tcp_nodelay  ...  hdfs_dn_jvm_opts  hdfs_mount_options  hdfs_nn_dn_staleness_interva l  ...  hive_auto_convert_join  hive_disable_caching  hive_enable_cbo  ... Security  hdfs_dn_volume_tolerance  hdfs_audit_log  hdfs_block_access_token  hdfs_enable_security_check  hdfs_nn_super_user_group  hdfs_zkfc_ha_acl  ...  ranger_policy_refresh_interval  smartsense_2_way_ssl_enabl ed  ...  yarn_ats_security  yarn_enable_acl  ...
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There is more than just configurations How do I show back/charg e back my tenants ? Who are the top users of my platform ?What type of work loads are running on my cluster ? Which jobs have significant impact on my cluster ? How do I improve performanc e of key jobs ? What is good time for maintenance ?
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Activity Analysis
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  There are many things involved in managing Hadoop cluster  Best practices evolve and change across versions  What is optimal today may not be optimal for tomorrow  Changing cluster dynamics, workload characteristic need continuous re- evaluation and configuration adjustments  SmartSense can significantly help avoid common mistakes, issues, pitfalls and simplify Hadoop operations
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lets keep your Hadoop cluster at its best! Thank You!
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Appendix
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Resources ⬢ https://docs.hortonworks.com/index.html ⬢ http://hortonworks.com/products/subscriptions/smartsense/ ⬢ http://hortonworks.com/info/smartsense/ ⬢ http://hortonworks.com/blog/introducing-hortonworks-smartsense/ ⬢ https://www.youtube.com/watch?v=IKulo9c8PjE ⬢ https://community.hortonworks.com/topics/smartsense.html
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SmartSense Bundle Security ⬢ All Bundles are Anonymized and Encrypted ⬢ Multiple built-in security measures – Ambari clear text passwords are not collected – Hive and Oozie database properties are not collected – All IP addresses and host names are anonymized ⬢ Extensible security rules – Exclude properties within specific Hadoop configuration files – Global REGEX replacements across all configuration, metrics, and logs
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SmartSense Stack Support HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0 SmartSense 1.x Ambari 2.2 Built-In! Ambari 2.1 Plug-In Ambari 2.0 Plug-In Ambari 1.7 Ambari 1.6 SmartSense 1.x

Editor's Notes

  1. SmartSense bundles include configuration, and metrics, and bundles used for Support Case troubleshooting included configuration, metrics, and log files. This data is captured for the Operating System of cluster nodes, as well as for all of the installed HDP services. The capture process can be configured to exclude specific files from capture, or specific Hadoop properties within HDP configuration files. In order to provide protection to organization-specific data, such as customer ID’s, patient ID’s, Credit Card #’s, etc. We provide the capability to specify a regular expression that can be removed or replaced in any file that is captured by SmartSense. This allows protection of sensitive data in the event that data is unintentionally leaked into log files. By default we remove all properties associated with clear text passwords. Ambari, Hive, and Oozie by default store DB credentials as cleartext, unless they’ve been configured to encrypt them. Just in case Hadoop Operators have not taken the time to do so, we exclude those properties by default.