Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Gobblin meetup-whats new in 0.7

240 Aufrufe

Veröffentlicht am

Learn how Gobblin evolves from ingest to a big data management framework

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Gobblin meetup-whats new in 0.7

  1. 1. Gobblin: What’s new? Vasanth Rajamani Chavdar Botev
  2. 2. About Vasanth Rajamani Manager ETL Infrastructure, LinkedIn Chavdar Botev Tech Lead ETL Infrastructure, LinkedIn 2
  3. 3. 3 Gobblin for Data Ingest Streaming events OLTP Snapshots OLTP Changelog Cloud Services Kafka JDBC REST SOAP HDFS SFTP
  4. 4. A Peek in Our Support List: Beyond the Data Ingest Can you also copy this data onto these other Hadoop clusters?2 Replication Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention Can you provide certain datasets in a more optimal format like ORC?4 Optimization Can you guarantee that the data doesn’t have duplicates?5 Compaction Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance 4 When and how often is the data made available?1 Monitoring
  5. 5. Beyond Data Ingest 5 Oracle Espresso Kafka MySQL Site-facing clusters External Sources • Monitoring • Retention • Optimization  Format  Layout  Compaction • Auditing • Compliance ETL Clusters • Monitoring • Retention • Optimization • Auditing • Compliance Prod Clusters • Monitoring • Retention • Optimization • Auditing • Compliance Dev Clusters HDFS Ingest Replication Data Load
  6. 6. Data Lifecycle Management: The Next Frontier
  7. 7. Managing the flow of systems’ data and metadata throughout its life cycle: from creation and receipt through distribution and maintenance to deletion. 7 Data Lifecycle Management
  8. 8. Hadoop Data Lifecycle Management at LinkedIn 8  Data and metadata  10+K datasets  Dataset auto-discovery  Ownership across many teams  Systems  Multiple loosely coupled systems  Ownership across multiple teams  Systems and data evolve independently over time
  9. 9. 9 Hadoop Data Lifecycle Management with Gobblin
  10. 10. Datasets 10  Ubiquitous  Heterogenous  Common  Dataset URI  E.g. /data/tracking/<TOPIC>, /data/databases/<DATABASE>/<TABLE>  Metadata
  11. 11. Dataset Operators 11  Ingest  Replication  Retention management  Data deduping  …  Different implementations possible
  12. 12. Metadata 12  Ubiquitous  Heterogenous  Common  Associated with a Dataset URI  Can be represented as a collection of K/V pairs  Metadata in Gobblin:  Input: Dataset configuration  Output: Metrics and tracking events
  13. 13. Orchestration 13  Dataset operators: independent actors  Ingest unaware of replication and vice versa  Interaction through shared state  Ingest lands dataset in a data directory  Replication copies all datasets in the directory  Retention runs all datasets in the directory  Datasets and metadata: the common language
  14. 14. How About Falcon? 14  Top-down approach  Tight coupling: centralized repository for feeds (datasets) and processes  Not designed for multi-tenancy  Lack of dataset auto-discovery  Lack of policies  Inflexible flows
  15. 15. Conclusion 15  Data lifecycle management  It’s more than just ingest  Loosely coupled systems  Flexible processing is a must for growth  Dataset-centric processing  Think about datasets, not jobs

×