Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Extending Twitter's Data Platform to Google Cloud

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 36 Anzeige

Extending Twitter's Data Platform to Google Cloud

Herunterladen, um offline zu lesen

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Extending Twitter's Data Platform to Google Cloud (20)

Anzeige

Weitere von DataWorks Summit (20)

Aktuellste (20)

Anzeige

Extending Twitter's Data Platform to Google Cloud

  1. 1. Extending Twitter’s Data Platform to Google Cloud 1 Lohit VijayaRenu , Vrushali Channapattan
  2. 2. Data Platform @Twitter Oxpecker Roneobird Data Access Layer ETL Pipelines
  3. 3. Why Cloud? - Provides a convenient way to test Hadoop changes at scale - Temporarily rapidly grow / shrink - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  4. 4. Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  5. 5. Before Partly Cloudy
  6. 6. Partly Cloudy
  7. 7. Design considerations User Experience Consistency in user experience for on-premises & in cloud data processing Scalability Ability scale out to handle all datasets & all users from day 1 Onboarding Seamless onboarding experience New Avenues Data access in new processing tools in cloud
  8. 8. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Least privileged access Audit Ability to easily determine who performed what actions on the data
  9. 9. Workstreams ● Various focus areas across the tech stack ○ Networking ○ GCP config ○ Replication ○ Data Processing Tools ○ Internal services ● Collaboration across teams within Twitter ● Collaboration with Google
  10. 10. Partly Cloudy Data Replication Sync Datasets to GCS
  11. 11. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  12. 12. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools
  13. 13. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  14. 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  15. 15. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  16. 16. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  17. 17. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  18. 18. Partly Cloudy Resource Hierarchy Organization and Project structure
  19. 19. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  20. 20. Project Dataset bucket User Bucket Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest Name Nodes Worker Nodes Resource Manager Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Scrubbed bucket Project contents
  21. 21. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/ 04/10/03 /gcs/dataY/2019/ 04/10/03 /gcs/dataZ/2019/ 04/10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  22. 22. Partly Cloudy Resource Hierarchy Storage in the Cloud
  23. 23. GCS On-premises path /dc1/cluster1/user/ helen/some/path/par t-001.lzo Logical Cloud path /gcs/user/helen/ some/path/part- 001.lzo GCS bucket path gs://user.helen.dp. twitter.domain/some /path/part-001.lzo
  24. 24. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly- cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly- cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  25. 25. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  26. 26. Partly Cloudy Resource Hierarchy User Management
  27. 27. User UNIX Kerberos credentials GSuite OAuth2 credentials GSuite OAuth2 credentials Shadow account (GCP Service account) Users & Accounts Shadow account Json key
  28. 28. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  29. 29. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  30. 30. How do the Data Processing Users at Twitter get to use Partly Cloudy DemiGod Services
  31. 31. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  32. 32. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  33. 33. Twitter infra eng project Twitter product project Partly Cloudy Admin Project Twitter user project bucket-creation -ie org (svc-acc-ie) bucket-creation Product (svc-acc- product) shadow-user- creation policy-granting-ie Key/ Secrets store LDAP/Googl e Groups GCS Config bucket key- rotation/creation Deployment of DemiGods
  34. 34. What do the Data Processing Users at Twitter get ❏ Datasets replicated on GCS ❏ A shadow account to access GCS ❏ GCS buckets for their scratch & scrubbed data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ Exploring other Google offerings (such as BigQuery, DataProc & DataFlow)
  35. 35. ● Copied tens of petabytes of data and keeping them in sync ● Tens of different projects with hundreds of buckets ● Complex set of VPC rules ● Hundreds of users using GCP ● Unlocked multiple use cases on GCP Where are we today
  36. 36. Thank you! Hiring https://careers.twitter.com Tweet @TwitterHadoop

Hinweis der Redaktion

  • To transfer data from on-premise to GCS
    Runs only yarn for GCS transfer, no local data
    Security
    Minimal in-DC hosts connect to GCS
    Networking
    Dedicated high bandwidth
    Requires separate dedicated configuration for routing to public end-points
    Each worker node has two IP addresses.
    Our DC RFC space that can't be used on the public Internet
    GCS traffic uses public IP
    Internal traffic (reading from cluster, observability, puppet, etc) uses internal IP
  • Data is identified by a dataset name
    HDFS is the primary storage for Analytics
    Users configure replication rules for different clusters
    Dataset also has retention rules defined per cluster
    Dataset are always represented on fixed interval partitions (hourly/daily)
    Dataset is defined in system called Data Access Layer (DAL)*
    Data is made available at different destination using Replicator
  • Long running daemon (on mesos)
    Daemon checks configuration and schedules copy on hourly partition
    Copy jobs are executed as Hadoop distcp jobs
    Jobs are on destination cluster
    After hourly copy, publish partition to DAL
  • Some datasets are collected across multiple DataCenters
    Replicator would kick off multiple DistCP jobs to copy at tmp location
    Replicator then merges dataset into single directory and does atomic rename to final destination
    Renames on HDFS are cheap and atomic, which makes this operation easy
  • Use same Replicator code to sync data to GCS
    Utilize ViewFileSystem abstraction to hide GCS
    /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03
    Use Google Hadoop Connector to interact with GCS using Hadoop APIs

    Distcp jobs runs on dedicated Copy cluster
    Create ViewFileSystem mount point on Copy cluster to fake GCS destination
    Distcp tasks stream data from source HDFS to GCS (no local copy)
  • Data for same dataset is aggregated at multiple DataCenters (DC x and DC y)
    Replicators in each DC schedules individual DistCp jobs
    Data from multiple DC ends up under same path on GCS
  • UI support via EagleEye to view all replication configurations
    Properties associated with configuration. Src, dest, owner, email, etc…
    CLI support to manage replication configurations
    Load new or modify existing configuration
    List all configurations
    Mark active/inactive configurations
    API support for clients and replicators
    Rich set of api access for all above operations
  • GCP Projects are based on organization
    Deploy separate Replicator with its own credentials per project
    Shared copy cluster per DataCenter
    Enables independent updates and reduces risk of errors
  • Logs vs user path resolution
    Projects and buckets have standard naming convention
    Logs at : gs://logs.<category name>.twttr.net/
    User data at gs://user.<user name>.twtter.net/
    Access to these buckets are via standard path
    Logs at /gcs/logs/<category name>/
    User data at /gcs/user/<user name>/
    Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml
    Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard
    No configuration or update needed

×