12. Features
Source Converter Target
Hadoop File System
SFTP
Apache server filer
Hive
…
Byte-level stream
transformations:
• Encrypt / Decrypt
• (Un) Gzip
• Untar
Atomic publishing
Data availability
notification
Hive Registration
File deletion / sync
13. File Sets
Distcp atomic unit, single dataset can be split into
multiple file sets
1. All-or-nothing publish*
2. Isolation: failed file set does not affect other file sets
3. Event emitted on publish per file and file set
* best-effort. Future: use write-ahead log for better guarantee.
14. Smart file limits
Limit the number of files copied in a single run
1. File sets are never split
2. Soft limit: stop processing new file sets, currently
running file sets can finish
3. Hard limit: do not accept any more files
4. Prioritize file sets (Future)
15. Unpublished File Persistence
1. Files that were copied successfully but not published
are persisted in private directory. (File set failure,
permission failure, etc.)
2. Future run identifies persisted file, reuse instead of
re-copying.
3. Time-based automatic retention on persist directory.
20. Hive Copy - Numbers
100+ tables
3000+ partitions
20,000+ new files per hour
2TB+ new data per hour
File listing 30k files: < 30s
Copy 30k files, 5TB: ~20 min
21. Current bottlenecks
Work unit serialization
• ~100 work units / second
Bad nodes in Hadoop cluster
• Need speculation
Serial publishing of file sets
• Solution in progress
22. Gobblin Distcp vs ReAir
Reair: Hive warehouse data replication (Airbnb)
Offers batch and incremental replication
Gobblin Distcp ReAir
File listing and modification
times for incremental
changes
MySQL and audit log hook
store for incremental
changes
Portable Gobblin job (MR,
thread based, Helix)
MR job
Same framework can copy
non-Hive data
Monitoring / Web UI (in
progress for Gobblin)
Explain copy configuration encapsulates job configurations: preserve attributes, targetfs, target directory, as well as a copy context with global objects (e.g. file status cache).
File set is optional
This is all that is needed for a copy