Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Hadoop for sys_admin
1. for System Administrators – Hadoop for System Administrators O –h iOo hLiion uLxi nFuexs tF 2e0s1t 42014
Justin Miller
Senior Systems Engineer/DevOps at iHealth Technologies
Weston Bassler
Systems Engineer at Verizon Wireless
2. Hadoop for System Administrators – Ohio Linux Fest 2014
What we will be covering:
Intro
Why Hadoop?
How Hadoop Works
Architecture
Planning Hardware/Storage/Network
Processing and Storage
HDFS Components
YARN Components
Operations
Job scheduling
Jobs alerts
Monitoring
Core Services
Job scheduler and SLA
Hardware
High Availability
YARN
HDFS
Oozie
Security
Security Issues
Authentication
Authorization
Encrption
Backup and Recovery
What to plan for?
How to combat
Hadoop Vendors/Distros
Cloudera
HortonWorks
MapR
4. Hadoop for System Administrators – Ohio Linux Fest 2014
Why Hadoop? Cont...
Sort through TB, even PB worth of data in a matter of minutes
Easily sift through LOGS (patterns, data mining) → switch logs, application
logs
Batch Processing
History → Inspired by 2 Google Papers on MapReduce and GoogleFS
Implemented By Yahoo!
5. Hadoop for System Administrators – Ohio Linux Fest 2014
Whose using it?
6. Hadoop for System Administrators – Ohio Linux Fest 2014
How Hadoop?
Processing
• MapReduce (MRv1)
What is MapReduce?
Nobody likes it
• YARN (MRv2)
Yet Another Resource Negotiator
Newer better/versatile
2 New Roles → Resource Manager and Application Manager
Spark → New Hotness
• Bringing Processing and Storage together
Data locality → avoid network!
“MO NODES MO BETTA”
7. Hadoop for System Administrators – Ohio Linux Fest 2014
YARN in Action
8. Hadoop for System Administrators – Ohio Linux Fest 2014
Storage
• HDFS
What is HDFS?
Why HDFS?
• Components of HDFS
NameNode
Metadata → fsimage + fsedits
ZooKeeper → HA management
Quorum based journaling
3 JournalNodes
Active/Passive NameNode
DataNodes – what do they do?
Blocks in relation to NameNode Metadata
Block storage
9. Hadoop for System Administrators – Ohio Linux Fest 2014
HDFS Write Path
10. Hadoop for System Administrators – Ohio Linux Fest 2014
Benefits and Limitations of HDFS
Benefits
Low cost per byte → commodity storage
High Bandwidth/Scales effectively → “Mo nodes Mo speed”
Rock solid data reliability
Supports distributed computing I/O patterns
OPEN SOURCE!!!!!
11. Hadoop for System Administrators – Ohio Linux Fest 2014
Benefits and Limitations of HDFS (Continued...)
Limitations
Updates → data is immutable (can't be updated only appended)
Write Once
Optimized for sequential reads → not for real-time data processing
Challenging import/export → requires additional tooling
12. Hadoop for System Administrators – Ohio Linux Fest 2014
Architectur e
• Planning your Hardware/Storage
Cheap disks
Distributed disk approach → replication factor of 3 for HA
NO LVM and NO Raid and NO swap
noatime, nodiratime
• Network considerations
Rack awareness affects data distribution
Prefer a faster network when available → 10GB if possible
13. Hadoop for System Administrators – Ohio Linux Fest 2014
Hadoop Operations
• Jobs
What is a job?
Scheduling jobs with Oozie
Alerts on Jobs
Oozie SLAs → Start time, end time & duration
File driven Job Configuration
14. Hadoop for System Administrators – Ohio Linux Fest 2014
Example of a Job:
Example of a coordinator:
15. Hadoop for System Administrators – Ohio Linux Fest 2014
Troubleshooting
• Application → Debug Code
16. Hadoop for System Administrators – Ohio Linux Fest 2014
• Job → Debug Execution
17. Hadoop for System Administrators – Ohio Linux Fest 2014
• Service → Debug Linux Process (/var/log/hadoop-*)
Services wont start → port conflicts (nmap, netstat, lsof)
if not application OR job;
do
cat /var/log/hadoop-* | grep ERROR
done
18. Hadoop for System Administrators – Ohio Linux Fest 2014
Monitoring
• Core Services
HDFS
YARN
JMX → JVM Monitoring
Cloudera Manager
• Performance
Ganglia (HortonWorks)
Cloudera Manager
• Hardware → to each his own (traditional monitoring)
SNMP
Nagios
Zenoss
Cloudera Manager
19. Hadoop for System Administrators – Ohio Linux Fest 2014
High Availability
• HDFS
ZooKeeper → quorum based journaling
• YARN
ZooKeeper
21. Hadoop for System Administrators – Ohio Linux Fest 2014
Security (Because people are evil)
22. Hadoop for System Administrators – Ohio Linux Fest 2014
Security Continued....
• Known issues – Stupid/Lazy People
Hadoop can be very secure
• Authentication - Kerberos
Principal (user)
Realm (group of principals)
Keytab file
• Authorization
LDAP
Active Directory
Role based
• Encryption – For your eyes Only!
Kerberos 1st
SSL Certificates
**** SSL must be enabled for all core Hadoop services
23. Hadoop for System Administrators – Ohio Linux Fest 2014
Backup and Recovery – When things go wrong (And they will)
What can go wrong? What to plan for?
Data Corruption
Node crashes
Disk crashes
Ways to combat when things do go wrong
• Data Corruption
checksums of metadata fail → NameNode replaces with fresh
HDFS → hdfs fsck tool
• Node crashes/Disk crashes
HDFS saves the day!
NameNode HA
First 2 replicas of data on different hosts
Heartbeat detection
24. Hadoop for System Administrators – Ohio Linux Fest 2014
Hadoop Wars - Vendors and Distributions
• Cloudera
Specializes in Enterprise tools
Auditing
Access Control
Cluster Management (Cloudera Manager)
• HortonWorks
Specializes in Engineering
Also Open Source
Top new cool things
• MapR
Lead developers begin Mahout
25. Hadoop for System Administrators – Ohio Linux Fest 2014
Hopefully you enjoyed!
If interested:
Quick Ways to get started Learning Hadoop
• Free Stuff – Who doesn't like free?
Big Data University – Hadoop fundamentals, Pig, Oozie, lots more
Udactity – Intro to Hadoop and Mapreduce
MapR, Cloudera, HortonWorks – Training Videos