Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
SEC302: Twitter's GCP
Architecture for Its Petabyte-
Scale Data Storage
in GCS and User Identity
Management
Vrushali Chann...
● What is Partly Cloudy
● Architecture
● Project & Bucket Design
● User Identity Management
● DemiGod services
● Deploymen...
What is Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premise...
Why Partly Cloudy
- A long term desire to have some cloud presence
- Right strategy will balance developer agility, capabi...
Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit auth...
Before Partly Cloudy
Partly Cloudy
Partly Cloudy Resource Hierarchy
Organization and Project
structure
Partly Cloudy Resource Hierarchy
TWITTER Org
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Proje...
What do these Projects contain?
twitter-project
Project
Dataset bucket
Cloud Storage
User Bucket
Cloud Storage
Google Cloud Storage
Connector for Hadoop
Google Cloud Stor...
Partly Cloudy Resource Hierarchy
Storage in the Cloud
GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-...
Partly Cloudy Resource Hierarchy
User Management
Users & Accounts
Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute...
Partly Cloudy Resource Hierarchy
DATA INFRA
twitter-[org]
twitter-
employee-users
What
do the
Data Processing
Users
at Twitter get
What
do the
Data Processing
Users
at Twitter get
❏ A shadow account to access GCS
❏ A GCS bucket for their data
❏ Access t...
Who configures
GCP for the
Data Processing Users
at Twitter
Who configures
GCP for the
Data Processing Users
at Twitter
DemiGod
Services
Partly Cloudy Resource Hierarchy
DemiGod Services
What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Pla...
Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service acco...
Partly Cloudy
Types of DemiGods
Bucket Creation
❏ Creates buckets
❏ Twitter domain
❏ One DemiGod per pillar twitter
project
❏ Inputs
❏ Configurable prefix...
Shadow Account
Management
❏ Creates shadow accounts
❏ Google service accounts
❏ One DemiGod
❏ Inputs
❏ Configurable patter...
Policy
Management
❏ Creates IAM policies
❏ One DemiGod per pillar project
❏ Inputs
❏ LDAP input
❏ YAML input
❏ Google grou...
Key LifeCycle
Management
❏ Creates keys with expiration
❏ Manges lifecycle of every N
days
❏ Adds them to Key Store
❏ Inpu...
Partly Cloudy
Deployment of DemiGods
Deployment considerations
- Demigods will run on GCE with the VM running a demigod service
account
- Demigod service accou...
Partly Cloudy
DemiGods execution flow
What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
❏ A shadow account is created
❏ ad...
What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
❏ Dataset info is replicated to a YAML
fil...
Thank you!
We are hiring
https://careers.twitter.com https://careers.google.com/cloud/
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comm...
Appendix
Google Cloud Twitter
https://cloud.google.com/twitter/
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management
Nächste SlideShare
Wird geladen in …5
×

SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

122 Aufrufe

Veröffentlicht am

Twitter collects petabytes of data every day and empowers its engineers and data scientists for large data processing with an hybrid on-premises and cloud model. In this talk, we will look at its GCP architecture and the resource hierarchy. We will deep dive into the storage design that uses Google Cloud Storage to organize petabytes of data that are replicated from on-premises HDFS clusters. We will take a look at how the user-management tooling has been designed to create and manage access for thousands of accounts (human and service accounts) at Twitter. We will talk about how the design deals with the security measures for accounts and tooling systems running in GCP and the complexities of dataset permissions. We will share the challenges we faced as we tried to design our system at scale and our learnings and solutions.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

  1. 1. SEC302: Twitter's GCP Architecture for Its Petabyte- Scale Data Storage in GCS and User Identity Management Vrushali Channapattan, Staff Engineer, Twitter James Duke, Strategic Cloud Engineer, Google Cloud
  2. 2. ● What is Partly Cloudy ● Architecture ● Project & Bucket Design ● User Identity Management ● DemiGod services ● Deployment Outline
  3. 3. What is Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  4. 4. Why Partly Cloudy - A long term desire to have some cloud presence - Right strategy will balance developer agility, capabilities, and cost. - Provides a convenient way to test Hadoop changes at scale - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  5. 5. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Audit Ability to easily determine who performed what actions on the data
  6. 6. Before Partly Cloudy
  7. 7. Partly Cloudy
  8. 8. Partly Cloudy Resource Hierarchy Organization and Project structure
  9. 9. Partly Cloudy Resource Hierarchy TWITTER Org
  10. 10. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder
  11. 11. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  12. 12. What do these Projects contain? twitter-project
  13. 13. Project Dataset bucket Cloud Storage User Bucket Cloud Storage Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest nest-compute@project- name.iam.gserviceacc ount.com Name Nodes nn-per-cluster- compute@project- name.iam.gserviceacc ount.com Worker Node(s) wn-per-cluster- compute@project- name.iam.gserviceaccou nt.com Resource Manager rm-per-cluster- compute@project- name.iam.gserviceacc ount.com Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Cloud Storage Scrubbed bucket Cloud Storage
  14. 14. Partly Cloudy Resource Hierarchy Storage in the Cloud
  15. 15. GCS On-premises path /dc1/cluster1/user/ helen/some/path/par t-001.lzo Logical Cloud path /gcs/user/helen/ some/path/part- 001.lzo GCS bucket path gs://user.helen.dp. twitter.domain/some /path/part-001.lzo
  16. 16. Partly Cloudy Resource Hierarchy User Management
  17. 17. Users & Accounts
  18. 18. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  19. 19. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  20. 20. What do the Data Processing Users at Twitter get
  21. 21. What do the Data Processing Users at Twitter get ❏ A shadow account to access GCS ❏ A GCS bucket for their data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ To work with us to leverage other Google offerings (such as BigQuery, Cloud DataProc & Cloud DataFlow)
  22. 22. Who configures GCP for the Data Processing Users at Twitter
  23. 23. Who configures GCP for the Data Processing Users at Twitter DemiGod Services
  24. 24. Partly Cloudy Resource Hierarchy DemiGod Services
  25. 25. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  26. 26. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  27. 27. Partly Cloudy Types of DemiGods
  28. 28. Bucket Creation ❏ Creates buckets ❏ Twitter domain ❏ One DemiGod per pillar twitter project ❏ Inputs ❏ Configurable prefixes ❏ LDAP input ❏ YAML input
  29. 29. Shadow Account Management ❏ Creates shadow accounts ❏ Google service accounts ❏ One DemiGod ❏ Inputs ❏ Configurable pattern ❏ LDAP input ❏ YAML input
  30. 30. Policy Management ❏ Creates IAM policies ❏ One DemiGod per pillar project ❏ Inputs ❏ LDAP input ❏ YAML input ❏ Google groups ❏ Ignore list
  31. 31. Key LifeCycle Management ❏ Creates keys with expiration ❏ Manges lifecycle of every N days ❏ Adds them to Key Store ❏ Inputs ❏ destination for keys ❏ LDAP input ❏ Shadow account
  32. 32. Partly Cloudy Deployment of DemiGods
  33. 33. Deployment considerations - Demigods will run on GCE with the VM running a demigod service account - Demigod service accounts will be created in partly-cloudy-admin project that has limited ssh access - Demigod processes will run as a kerberized headless twitter user - Demigod Key Creation Service shall NOT write service account keys to disk. It will store in memory until written to Secret Store.
  34. 34. Partly Cloudy DemiGods execution flow
  35. 35. What happens when a user joins an ldap pillar group? Demigod ⇔ twitter user interaction
  36. 36. What happens when a user joins an ldap pillar group? Demigod ⇔ twitter user interaction ❏ A shadow account is created ❏ added to google group ❏ A GCS user bucket is created ❏ Scratch bucket ❏ Keys are generated ❏ Added to Secrets Store ❏ Keys are distributed thereby enabling access to a Twitter managed Hadoop cluster in GCP & Presto cluster in GCP
  37. 37. What happens when a new dataset is added? Demigod ⇔ twitter dataset interaction
  38. 38. What happens when a new dataset is added? Demigod ⇔ twitter dataset interaction ❏ Dataset info is replicated to a YAML file in a GCS config bucket ❏ A GCS dataset bucket is created ❏ Scratch, Scrubbed, Scratch-scrubbed bucket also created ❏ Access privileges are granted ❏ Owner - read on orig dataset , r/w on scratch & scrubbed ❏ Reader group: read on dataset, scrubbed
  39. 39. Thank you! We are hiring https://careers.twitter.com https://careers.google.com/cloud/
  40. 40. Your Feedback is Greatly Appreciated! Complete the session survey in mobile app 1-5 star rating system Open field for comments Rate icon in status bar
  41. 41. Appendix Google Cloud Twitter https://cloud.google.com/twitter/

×