7. 2017 Software Architecture Summit
Data Ingestion - Logging
Singer
- Logging Agent for Kafka
- Decouple logging and kafka
- Failure Isolation
- Buffering, Pooling
Merced
- Log Persistence and Transportation Service
- Kafka -> Storage Destinations
- S3, HBase and SQS
- Auto rebalancing, failure handling
https://hortonworks.com/apache/kafka/
8. 2017 Software Architecture Summit
Data Ingestion - Full DB Dump
Tracker
â Daily db dump: Hadoop Streaming with mysqldump
â Hadoop node <-> MySQL node traffic pains
â Balancing & Throttling
â Performance pains
â Remote mysqldump
â Python streaming
9. 2017 Software Architecture Summit
Data Ingestion - Inc DB Dump
WaterMill
â Maxwell tails MySQL bin logs and writes to Kafka
â Tail bin logs and write to kafka as json
â Mini-batch job to apply updates on base version
â Distribute records by db shard id
â Small (update) table join Big (base) table
â Alias table points to the latest snapshot
17. 2017 Software Architecture Summit
Hive - Scheduled Query
INSERT OVERWRITE DIRECTORY 's3://hive_query_output/cluster_name/20171019/application_1508263025827_2301/0'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
SELECT country, count(*)
FROM db_users
GROUP BY country
SELECT country, count(*)
FROM db_users
GROUP BY country
Rewrite
â Query is submitted by System
â Based on Hive CLI
â Talking to MetaStore (MySQL)
â YARN REST API
â DistShell: AppMaster + Worker
â Hive CLI as pure shell command
â Get stdout/stderr from NM REST API
â Rewrite select query to insert to s3
â Container log is not stable
18. 2017 Software Architecture Summit
Hive - AdHoc Query
â Query is Submitted by Human
â Through beeline/tabeau etc
â HiveServer2 is the Query Submission Service
â via JDBC/ODBC
â Zookeeper based HA
â Talking to Hive MetaStore Server
â HiveServer2 is a shared resource
â LDAP for Authentication
â Impersonation for 3rd party tool
â Hive hook to Audit all queries
19. 2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
â Kind of â2-phase commitâ protocol
â Phase 1 - each task attempt will write to a unique tmp output path
hdfs://${staging_dir}/${query_id}/
_task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id}
_tmp.-ext-${stage_id}/${task_id}_${attempt_id}
Other Hive Operator
FileSinkOperator
process()
closeOp()
âŠâŠâŠ...
MapRed Task
Rename
Write
20. 2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
â Kind of â2-phase commitâ protocol
â Phase 2 - stage controller (HiveDriver) rename query temp to final path
hdfs://${staging_dir}_${query_id}/
_task_tmp.-ext-${stage_id}/_tmp.${task_id}_${attpt_id}
_tmp.-ext-${stage_id}/
${task_id}_0
${task_id}_1
${task_id}_2
${task_id}
hdfs://${staging_dir}_${query_id}/_tmp.-ext-${stage_id}
hdfs://hive_db_name/tab_name/dt=20171026
Other Hive Stage
Output Stage
jobCloseOp()
mvFileToFinalPath
âŠâŠâŠ...
HiveDriver
Invoke
21. 2017 Software Architecture Summit
Hive - Cloud Integration
FileSinkOperator - Where S3 doesn't fit
â Fast & Atomic file/dir rename operation is the key
â Unfortunately, rename on s3 is copy & delete
22. 2017 Software Architecture Summit
Hive - Direct Committer @ S3
Direct Output Committer
â Each task attempt writes to final output directly
â {target folder}/{task_id}
â Consequences
â Query Level Failure -> partial output in target folder
â Cleanup target folder in the very beginning of each query
â Disable INSERT INTO (using INSERT OVERWRITE Q1 UNION ALL Q2)
â Task Level Failure -> FileAlreadyExistsException
â File open mode CREATE -> OVERWRITE (ORC & Parquet)
â Speculative Execution -> corrupted task output
â Cause: outWriter.close() in killed task attempt
â S3A write to local disk first and then upload to s3 in close()
â Solution: Disable speculative execution
23. 2017 Software Architecture Summit
Hive - Speculative Exec @ S3
Direct Committer with Speculative Execution
â S3A Staging Committer (originated from Netflix)
â Based on S3 multipart uploading 1
â Decouple upload & commit, simulate write to tmp & rename to final
â Being actively worked in Apache: HADOOP-13786
â S3A Hive Committer (originated from Pinterest, on-going)
â Ignore the MapReduce commit protocol
â Allow different task attempts to commit at the same time
â Stage controller (HiveDriver) dedup multiple attempts for the same task
24. 2017 Software Architecture Summit
Hive - Dynamic Partition @ S3
New Challenges
â Target folder path is only known at runtime
â Race condition among multiple FileSinkOperators
â Discover new dynamic partitions to update Metastore
Solutions
â Encode the ts into output file name: {target_part_folder}/{dpCtx.queryStartTs}_{task_id}
â New dynamic partition: folders that
â Modification time > query start time
â Contains files starting with query_start_ts
28. 2017 Software Architecture Summit
Presto - Feature Challenges
Concurrency Bug in ObjectInspector library
â Hive runtime is for single thread env
â Presto is highly concurrent and shared
Thrift Table Support
â Presto Hive MetaStore client doesnât support âdynamicâ table
â Deeply nested table will hang in planning phase
Schema Evolution Support
â Historical partition schema mismatch
â Small int -> big int
â Parquet table column rename
29. 2017 Software Architecture Summit
Presto - Security Enhancement
Access Control
â Check LDAP group for proxy user for access rights
â Data Level Access is controlled by IAM role
Security Features
â Authentication
â Integrated middle service with Impersonation
â Audit
â Every Query is Logged
â Plugin based on EventListener
30. 2017 Software Architecture Summit
Presto - Cluster Operations
Separate AdHoc & Application Cluster
â Adhoc cluster is generally less stable
â App cluster only run well known queries
Bad Query Monitor
â Detected by bot based on runtime and splits #
â Kill the query and send slack message to end user
Bad Node Monitor
â Detected by bot based on query failure due to node issue stats
â Bad node - 3 query failed on it in last 5 min
â Specially important when the cluster is scaled up & down on a daily basis