SlideShare a Scribd company logo
1 of 43
SEC302: Twitter's GCP
Architecture for Its Petabyte-
Scale Data Storage
in GCS and User Identity
Management
Vrushali Channapattan, Staff Engineer, Twitter
James Duke, Strategic Cloud Engineer, Google Cloud
● What is Partly Cloudy
● Architecture
● Project & Bucket Design
● User Identity Management
● DemiGod services
● Deployment
Outline
What is Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model
Why Partly Cloudy
- A long term desire to have some cloud presence
- Right strategy will balance developer agility, capabilities, and
cost.
- Provides a convenient way to test Hadoop changes at scale
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc
Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Audit
Ability to easily
determine who
performed what
actions on the data
Before Partly Cloudy
Partly Cloudy
Partly Cloudy Resource Hierarchy
Organization and Project
structure
Partly Cloudy Resource Hierarchy
TWITTER Org
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Projects
What do these Projects contain?
twitter-project
Project
Dataset bucket
Cloud Storage
User Bucket
Cloud Storage
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
nest-compute@project-
name.iam.gserviceacc
ount.com
Name Nodes
nn-per-cluster-
compute@project-
name.iam.gserviceacc
ount.com
Worker Node(s)
wn-per-cluster-
compute@project-
name.iam.gserviceaccou
nt.com
Resource Manager
rm-per-cluster-
compute@project-
name.iam.gserviceacc
ount.com
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based access
User account based access
User account based access
Scratch bucket
Cloud Storage
Scrubbed bucket
Cloud Storage
Partly Cloudy Resource Hierarchy
Storage in the Cloud
GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-
001.lzo
GCS bucket path
gs://user.helen.dp.
twitter.domain/some
/path/part-001.lzo
Partly Cloudy Resource Hierarchy
User Management
Users & Accounts
Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user
Partly Cloudy Resource Hierarchy
DATA INFRA
twitter-[org]
twitter-
employee-users
What
do the
Data Processing
Users
at Twitter get
What
do the
Data Processing
Users
at Twitter get
❏ A shadow account to access GCS
❏ A GCS bucket for their data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ To work with us to leverage other
Google offerings (such as BigQuery,
Cloud DataProc & Cloud DataFlow)
Who configures
GCP for the
Data Processing Users
at Twitter
Who configures
GCP for the
Data Processing Users
at Twitter
DemiGod
Services
Partly Cloudy Resource Hierarchy
DemiGod Services
What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.
Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible
Partly Cloudy
Types of DemiGods
Bucket Creation
❏ Creates buckets
❏ Twitter domain
❏ One DemiGod per pillar twitter
project
❏ Inputs
❏ Configurable prefixes
❏ LDAP input
❏ YAML input
Shadow Account
Management
❏ Creates shadow accounts
❏ Google service accounts
❏ One DemiGod
❏ Inputs
❏ Configurable pattern
❏ LDAP input
❏ YAML input
Policy
Management
❏ Creates IAM policies
❏ One DemiGod per pillar project
❏ Inputs
❏ LDAP input
❏ YAML input
❏ Google groups
❏ Ignore list
Key LifeCycle
Management
❏ Creates keys with expiration
❏ Manges lifecycle of every N
days
❏ Adds them to Key Store
❏ Inputs
❏ destination for keys
❏ LDAP input
❏ Shadow account
Partly Cloudy
Deployment of DemiGods
Deployment considerations
- Demigods will run on GCE with the VM running a demigod service
account
- Demigod service accounts will be created in partly-cloudy-admin
project that has limited ssh access
- Demigod processes will run as a kerberized headless twitter user
- Demigod Key Creation Service shall NOT write service account
keys to disk. It will store in memory until written to Secret Store.
Partly Cloudy
DemiGods execution flow
What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
❏ A shadow account is created
❏ added to google group
❏ A GCS user bucket is created
❏ Scratch bucket
❏ Keys are generated
❏ Added to Secrets Store
❏ Keys are distributed thereby
enabling access to a Twitter
managed Hadoop cluster in GCP &
Presto cluster in GCP
What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
❏ Dataset info is replicated to a YAML
file in a GCS config bucket
❏ A GCS dataset bucket is created
❏ Scratch, Scrubbed, Scratch-scrubbed
bucket also created
❏ Access privileges are granted
❏ Owner - read on orig dataset , r/w on
scratch & scrubbed
❏ Reader group: read on dataset, scrubbed
Thank you!
We are hiring
https://careers.twitter.com https://careers.google.com/cloud/
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar
Appendix
Google Cloud Twitter
https://cloud.google.com/twitter/

More Related Content

What's hot

GCP Gaming 2016 Seoul, Korea Firebase
GCP Gaming 2016 Seoul, Korea FirebaseGCP Gaming 2016 Seoul, Korea Firebase
GCP Gaming 2016 Seoul, Korea FirebaseChris Jang
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...tdc-globalcode
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Simon Su
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Ido Green
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingSimon Su
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Google compute engine - overview
Google compute engine - overviewGoogle compute engine - overview
Google compute engine - overviewCharles Fan
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud PlatformOpsta
 
Google Compute Engine Starter Guide
Google Compute Engine Starter GuideGoogle Compute Engine Starter Guide
Google Compute Engine Starter GuideSimon Su
 
Introduction to Google's Cloud Technologies
Introduction to Google's Cloud TechnologiesIntroduction to Google's Cloud Technologies
Introduction to Google's Cloud TechnologiesChris Schalk
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineCsaba Toth
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolatsliwowicz
 
#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data UnlimitedAudrey Huvet
 
Introduction to Google Cloud
Introduction to Google CloudIntroduction to Google Cloud
Introduction to Google CloudDSC IEM
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep DiveMichelle Holley
 
Anthos Security: modernize your security posture for cloud native applications
Anthos Security: modernize your security posture for cloud native applicationsAnthos Security: modernize your security posture for cloud native applications
Anthos Security: modernize your security posture for cloud native applicationsGreg Castle
 
Shakr - Container CI/CD with Google Cloud Platform
Shakr - Container CI/CD with Google Cloud PlatformShakr - Container CI/CD with Google Cloud Platform
Shakr - Container CI/CD with Google Cloud PlatformMinku Lee
 
Google cloud platform
Google cloud platformGoogle cloud platform
Google cloud platformrajdeep
 

What's hot (20)

Gdsc muk - innocent
Gdsc   muk - innocentGdsc   muk - innocent
Gdsc muk - innocent
 
GCP Gaming 2016 Seoul, Korea Firebase
GCP Gaming 2016 Seoul, Korea FirebaseGCP Gaming 2016 Seoul, Korea Firebase
GCP Gaming 2016 Seoul, Korea Firebase
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Google compute engine - overview
Google compute engine - overviewGoogle compute engine - overview
Google compute engine - overview
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Google Compute Engine Starter Guide
Google Compute Engine Starter GuideGoogle Compute Engine Starter Guide
Google Compute Engine Starter Guide
 
Introduction to Google's Cloud Technologies
Introduction to Google's Cloud TechnologiesIntroduction to Google's Cloud Technologies
Introduction to Google's Cloud Technologies
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
 
#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited
 
L2 3.fa19
L2 3.fa19L2 3.fa19
L2 3.fa19
 
Introduction to Google Cloud
Introduction to Google CloudIntroduction to Google Cloud
Introduction to Google Cloud
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep Dive
 
Anthos Security: modernize your security posture for cloud native applications
Anthos Security: modernize your security posture for cloud native applicationsAnthos Security: modernize your security posture for cloud native applications
Anthos Security: modernize your security posture for cloud native applications
 
Shakr - Container CI/CD with Google Cloud Platform
Shakr - Container CI/CD with Google Cloud PlatformShakr - Container CI/CD with Google Cloud Platform
Shakr - Container CI/CD with Google Cloud Platform
 
Google cloud platform
Google cloud platformGoogle cloud platform
Google cloud platform
 

Similar to SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud lohitvijayarenu
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)bigdata trunk
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern ApplicationRahul Kumar Gupta
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCPKnoldus Inc.
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptxDSCIITPatna
 
Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentationJanardhan Reddy
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud PlatformPradeep Bhadani
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCPOliver Fierro
 
Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud PlatformGeneXus
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdfKnoldus Inc.
 
Developing Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesDeveloping Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesChakradhar Rao Jonagam
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyLeonid Nekhymchuk
 

Similar to SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management (20)

Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern Application
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCP
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptx
 
Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentation
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
Amazon cloud service
Amazon cloud serviceAmazon cloud service
Amazon cloud service
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCP
 
Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud Platform
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
Developing Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesDeveloping Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/Kubernetes
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

  • 1. SEC302: Twitter's GCP Architecture for Its Petabyte- Scale Data Storage in GCS and User Identity Management Vrushali Channapattan, Staff Engineer, Twitter James Duke, Strategic Cloud Engineer, Google Cloud
  • 2. ● What is Partly Cloudy ● Architecture ● Project & Bucket Design ● User Identity Management ● DemiGod services ● Deployment Outline
  • 3. What is Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  • 4. Why Partly Cloudy - A long term desire to have some cloud presence - Right strategy will balance developer agility, capabilities, and cost. - Provides a convenient way to test Hadoop changes at scale - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  • 5. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Audit Ability to easily determine who performed what actions on the data
  • 8. Partly Cloudy Resource Hierarchy Organization and Project structure
  • 9. Partly Cloudy Resource Hierarchy TWITTER Org
  • 10. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder
  • 11. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  • 12. What do these Projects contain? twitter-project
  • 13. Project Dataset bucket Cloud Storage User Bucket Cloud Storage Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest nest-compute@project- name.iam.gserviceacc ount.com Name Nodes nn-per-cluster- compute@project- name.iam.gserviceacc ount.com Worker Node(s) wn-per-cluster- compute@project- name.iam.gserviceaccou nt.com Resource Manager rm-per-cluster- compute@project- name.iam.gserviceacc ount.com Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Cloud Storage Scrubbed bucket Cloud Storage
  • 14. Partly Cloudy Resource Hierarchy Storage in the Cloud
  • 16. Partly Cloudy Resource Hierarchy User Management
  • 18. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  • 19. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  • 21. What do the Data Processing Users at Twitter get ❏ A shadow account to access GCS ❏ A GCS bucket for their data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ To work with us to leverage other Google offerings (such as BigQuery, Cloud DataProc & Cloud DataFlow)
  • 22. Who configures GCP for the Data Processing Users at Twitter
  • 23. Who configures GCP for the Data Processing Users at Twitter DemiGod Services
  • 24. Partly Cloudy Resource Hierarchy DemiGod Services
  • 25. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  • 26. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  • 28. Bucket Creation ❏ Creates buckets ❏ Twitter domain ❏ One DemiGod per pillar twitter project ❏ Inputs ❏ Configurable prefixes ❏ LDAP input ❏ YAML input
  • 29. Shadow Account Management ❏ Creates shadow accounts ❏ Google service accounts ❏ One DemiGod ❏ Inputs ❏ Configurable pattern ❏ LDAP input ❏ YAML input
  • 30. Policy Management ❏ Creates IAM policies ❏ One DemiGod per pillar project ❏ Inputs ❏ LDAP input ❏ YAML input ❏ Google groups ❏ Ignore list
  • 31. Key LifeCycle Management ❏ Creates keys with expiration ❏ Manges lifecycle of every N days ❏ Adds them to Key Store ❏ Inputs ❏ destination for keys ❏ LDAP input ❏ Shadow account
  • 33.
  • 34. Deployment considerations - Demigods will run on GCE with the VM running a demigod service account - Demigod service accounts will be created in partly-cloudy-admin project that has limited ssh access - Demigod processes will run as a kerberized headless twitter user - Demigod Key Creation Service shall NOT write service account keys to disk. It will store in memory until written to Secret Store.
  • 36. What happens when a user joins an ldap pillar group? Demigod ⇔ twitter user interaction
  • 37. What happens when a user joins an ldap pillar group? Demigod ⇔ twitter user interaction ❏ A shadow account is created ❏ added to google group ❏ A GCS user bucket is created ❏ Scratch bucket ❏ Keys are generated ❏ Added to Secrets Store ❏ Keys are distributed thereby enabling access to a Twitter managed Hadoop cluster in GCP & Presto cluster in GCP
  • 38. What happens when a new dataset is added? Demigod ⇔ twitter dataset interaction
  • 39. What happens when a new dataset is added? Demigod ⇔ twitter dataset interaction ❏ Dataset info is replicated to a YAML file in a GCS config bucket ❏ A GCS dataset bucket is created ❏ Scratch, Scrubbed, Scratch-scrubbed bucket also created ❏ Access privileges are granted ❏ Owner - read on orig dataset , r/w on scratch & scrubbed ❏ Reader group: read on dataset, scrubbed
  • 40.
  • 41. Thank you! We are hiring https://careers.twitter.com https://careers.google.com/cloud/
  • 42. Your Feedback is Greatly Appreciated! Complete the session survey in mobile app 1-5 star rating system Open field for comments Rate icon in status bar