SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Extracting Insights from Data at Twitter
Prasad Wagle
Technical Lead, Core Data and Metrics, Data Platform
twitter.com/prasadwagle
Jan 26, 2016
● What are the properties of Big Data at Twitter?
● Where do we store it and how do we process it?
● What do we learn from the data?
Overview of the talk
● Velocity: Rate at which data is created
○ 313 million monthly active users. (June 2016)
○ Hundreds of millions of Tweets are sent per day. TPS record:
one-second peak of 143,199 Tweets per second
○ 100 Billion interaction events per day
● Volume: 100s of petabytes of data
● Variety: Tweets, Users, Client events and many more
○ Client events logs have a unified Thrift format for wide variety of
application events
3Vs of Big Data @Twitter
Data Processing Big Picture
Production
systems
Batch
Scalding
Spark
Real-time
Heron
Lambda (Batch + Real-time)
Summingbird
TSAR
Interactive
Presto
Vertica
R
Custom
Dashboards
Tableau
Apache Zeppelin
Command line
tools
Batch
Hadoop
(HDFS
MapReduce)
Analytics Tools
Analytics Front-ends
Real-time
Eventbus,
Kafka
Streams
Data Abstraction Layer (DAL), Pipeline Orchestration
Data Platform
● Batch Processing Engine - Hadoop
● Real-time Processing Engine - Heron
● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet
● Data Pipeline - Data Access Layer (DAL), Orchestration
● Interactive SQL - Presto, Vertica
● Data Visualization - Tableau, Apache Zeppelin
● Core Data and Metrics
Data Platform Projects
● Largest Hadoop clusters in the world, some > 10K nodes
● Store 100s of petabytes of data
● More than 100K daily jobs
● Improvements to open source hadoop software
● hRaven - tool that collects run time data of hadoop jobs and lets users
visualize job metrics
○ YARN Timelineserver is next-gen hRaven
● Log pipeline software (scribe -> HDFS)
○ Scribe is being replace by Flume
Hadoop
● Heron - a real-time, distributed, fault tolerant stream processing engine
● Successor of Storm, API compatible with Storm
● Analyze data as it is being produced
● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency
● Use cases
○ Real-time impression and engagement counts
○ Real-time trends, recommendations, spam detection
Real-time Processing
● Tools that make it easy to create MapReduce and Heron jobs
● Scalding
○ Scala DSL on top of Cascading
● Summingbird
○ Lambda architecture: real-time and batch
● Tsar: TimeSeries AggregatoR
○ DSL implemented on top of Summingbird
Core Data Libraries
● DAL is a service that simplifies the discovery, usage, and maintainability
of data
● Users work with logical datasets
● Physical dataset describes the serialization of a logical dataset to a
specific location (hadoop, vertica) and format
● Logical dataset can simultaneously exist in multiple places
● Users can use logical dataset name to consume data with different
tools like Scalding, Presto
Data Access Layer (DAL)
● Eagleeye web application is front-end for end users
● Users discover datasets with Eagleeye
● Eagleeye displays metadata like owners and schema
● Applications access to datasets is recorded
● Enables Eagleye to show dependency graphs for a dataset - jobs that
produce a dataset and jobs that consume it
Data Access Layer (DAL)
Data Discovery
● Statebird service
○ Tracks state of batch jobs
○ Used to manage dependencies
Pipeline Orchestration
● Interactive means that results of a query are available in the range of
seconds to a few minutes
● SQL is still the lingua franca for ad hoc data analysis
● Vertica
○ Columnar architecture, high performance analytics queries
● Presto
○ Data in HDFS in Parquet format
Interactive SQL
● Custom Dashboards
● Apache Zeppelin Strengths
○ Notebook metaphor - notebook is a collection of notes, each note
is a collection of paragraphs (queries)
○ Web based report authoring, collaborative like Google docs
○ Very easy to create a note and then share it
○ > 2K notes, 18K queries
○ Supports JDBC (Presto, Vertica, MySQL)
○ Open source, Easy to add new interpreters like Scalding
Data Visualization
● Tableau Strengths
○ Easy to create reports, does not require SQL expertise
○ Built in analytics functions e.g. Rank, Percentile
○ Polished visualizations
○ Row level security
Data Visualization
● Big part of data analysis is data cleansing
● Makes sense to do this once
● Core Data
○ Create pipelines to create “verified” datasets like Users, Tweets,
Interactions
○ Reliable and easy to use
● Core Metrics
○ Create pipelines to compute Twitter’s important metrics
○ DAU, MAU, Tweet Impressions
Core Data and Metrics
Data Processing
● Analytics - Basic Counting
● A/B Testing
● Data Science - Custom analysis
● Data Science - Machine Learning
Data Processing
● Daily/Monthly Active Users
● Number of Tweets, Retweets, Likes
● Tweet Impressions
● Logic is relatively simple
● Challenges: scale and timeliness
○ Results for previous day should be available by 10 am
○ Some metrics are real-time
Basic Counting
● Goal: find the number of impressions and engagements for a tweet
● Real-time
● Used in analytics.twitter.com
Example - Counting Tweet Impressions
aggregate {
onKeys(
(TweetId)
) produce (
Count
) sinkTo (Manhattan)
} fromProducer {
ClientEventSource(“client_events”)
.filter { event => isImpressionEvent(event) }
.map { event =>
(event.timestamp, ImpressionAttributes(event.tweetId))
}
}
TSAR job
Dimension
Metric
Data Sink
Data Source
● TSAR job is converted to a Summingbird job
● Summingbird job creates
○ Real-time pipeline with Heron
○ Batch pipeline with Scalding
● Users access results using TSAR query service
● Write once, run batch and real-time
Example - Counting Tweet Impressions
● Experimentation is at the heart of Twitter’s product development cycle
● Expertise needed in Statistics and Technology
A/B Testing Framework
● Goal: informative experiment,
● Minimize false positive and false negative errors
● How many users do we need to sample?
● How long should we run the experiment?
A/B Testing Statistics
● Process 100 B events daily, compute intensive.
● Metrics computed using Scalding pipeline that combines client event
logs, internal user models, and other datasets.
● Lightweight statistics are computed in a streaming job using TSAR
running on Heron.
A/B Testing Technology
● Cause of spikes and dips in key metrics
● Growth Trends
○ By country, client
● Analysis to understand user behavior
○ Creators vs Consumers
○ Distribution of followers
○ User clusters
● Analysis to inform product feature decisions
Data Science - Custom Analysis
● Recommendations
○ Users: WTF - who to follow
○ Tweets: Algorithmic timeline
● Cortex, Deep learning based on Torch framework
○ Identify NSFW images
○ Recognize what is happening in live feeds
Data Science - Machine Learning
● Product Safety
○ Detect fake accounts
○ Detect tweet spam and abuse
● Ad Targeting
○ Promoted Trends, Accounts and Tweets
○ Show only if it is likely to be interesting and relevant to that user
○ Predict click probability using signals including what a user
chooses to follow, how they interact with a Tweet and what they
retweet
Machine Learning
● Systems (Hadoop, Vertica)
○ Necessary because higher level abstraction are leaky
● Programming (Scala, Scalding, SQL)
● Math (Statistics, Linear Algebra)
Ideal Talent Stack
Systems Programming Statistics
Data Engineers Data Scientists
Data Platform and Data Science
work hand-in-hand
to extract insights from Big Data at Twitter
Summary
Questions?
● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter
● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
● Heron http://www.slideshare.net/KarthikRamasamy3
● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview
● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests
● Algorithmic timeline: https://support.twitter.com/articles/164083
● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/
● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/
References

Weitere ähnliche Inhalte

Was ist angesagt?

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 

Was ist angesagt? (20)

Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Pandora Paper Leaks With TypeDB
 Pandora Paper Leaks With TypeDB Pandora Paper Leaks With TypeDB
Pandora Paper Leaks With TypeDB
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta Sharing
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Open, FAIR data and RDM
Open, FAIR data and RDMOpen, FAIR data and RDM
Open, FAIR data and RDM
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 

Andere mochten auch

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz
 
Building Kafka-powered Activity Stream
Building Kafka-powered Activity StreamBuilding Kafka-powered Activity Stream
Building Kafka-powered Activity Stream
Oleksiy Holubyev
 

Andere mochten auch (20)

Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
Apache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACKApache Kafka lessons learned @PAYBACK
Apache Kafka lessons learned @PAYBACK
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics services
 
Blr hadoop meetup
Blr hadoop meetupBlr hadoop meetup
Blr hadoop meetup
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Kafka connect
Kafka connectKafka connect
Kafka connect
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache Kafka
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, SparkBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Building Kafka-powered Activity Stream
Building Kafka-powered Activity StreamBuilding Kafka-powered Activity Stream
Building Kafka-powered Activity Stream
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 

Ähnlich wie Extracting Insights from Data at Twitter

Ähnlich wie Extracting Insights from Data at Twitter (20)

Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Extracting Insights from Data at Twitter

  • 1. Extracting Insights from Data at Twitter Prasad Wagle Technical Lead, Core Data and Metrics, Data Platform twitter.com/prasadwagle Jan 26, 2016
  • 2. ● What are the properties of Big Data at Twitter? ● Where do we store it and how do we process it? ● What do we learn from the data? Overview of the talk
  • 3. ● Velocity: Rate at which data is created ○ 313 million monthly active users. (June 2016) ○ Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second ○ 100 Billion interaction events per day ● Volume: 100s of petabytes of data ● Variety: Tweets, Users, Client events and many more ○ Client events logs have a unified Thrift format for wide variety of application events 3Vs of Big Data @Twitter
  • 4. Data Processing Big Picture Production systems Batch Scalding Spark Real-time Heron Lambda (Batch + Real-time) Summingbird TSAR Interactive Presto Vertica R Custom Dashboards Tableau Apache Zeppelin Command line tools Batch Hadoop (HDFS MapReduce) Analytics Tools Analytics Front-ends Real-time Eventbus, Kafka Streams Data Abstraction Layer (DAL), Pipeline Orchestration
  • 6. ● Batch Processing Engine - Hadoop ● Real-time Processing Engine - Heron ● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet ● Data Pipeline - Data Access Layer (DAL), Orchestration ● Interactive SQL - Presto, Vertica ● Data Visualization - Tableau, Apache Zeppelin ● Core Data and Metrics Data Platform Projects
  • 7. ● Largest Hadoop clusters in the world, some > 10K nodes ● Store 100s of petabytes of data ● More than 100K daily jobs ● Improvements to open source hadoop software ● hRaven - tool that collects run time data of hadoop jobs and lets users visualize job metrics ○ YARN Timelineserver is next-gen hRaven ● Log pipeline software (scribe -> HDFS) ○ Scribe is being replace by Flume Hadoop
  • 8. ● Heron - a real-time, distributed, fault tolerant stream processing engine ● Successor of Storm, API compatible with Storm ● Analyze data as it is being produced ● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency ● Use cases ○ Real-time impression and engagement counts ○ Real-time trends, recommendations, spam detection Real-time Processing
  • 9. ● Tools that make it easy to create MapReduce and Heron jobs ● Scalding ○ Scala DSL on top of Cascading ● Summingbird ○ Lambda architecture: real-time and batch ● Tsar: TimeSeries AggregatoR ○ DSL implemented on top of Summingbird Core Data Libraries
  • 10. ● DAL is a service that simplifies the discovery, usage, and maintainability of data ● Users work with logical datasets ● Physical dataset describes the serialization of a logical dataset to a specific location (hadoop, vertica) and format ● Logical dataset can simultaneously exist in multiple places ● Users can use logical dataset name to consume data with different tools like Scalding, Presto Data Access Layer (DAL)
  • 11. ● Eagleeye web application is front-end for end users ● Users discover datasets with Eagleeye ● Eagleeye displays metadata like owners and schema ● Applications access to datasets is recorded ● Enables Eagleye to show dependency graphs for a dataset - jobs that produce a dataset and jobs that consume it Data Access Layer (DAL)
  • 13.
  • 14.
  • 15. ● Statebird service ○ Tracks state of batch jobs ○ Used to manage dependencies Pipeline Orchestration
  • 16. ● Interactive means that results of a query are available in the range of seconds to a few minutes ● SQL is still the lingua franca for ad hoc data analysis ● Vertica ○ Columnar architecture, high performance analytics queries ● Presto ○ Data in HDFS in Parquet format Interactive SQL
  • 17. ● Custom Dashboards ● Apache Zeppelin Strengths ○ Notebook metaphor - notebook is a collection of notes, each note is a collection of paragraphs (queries) ○ Web based report authoring, collaborative like Google docs ○ Very easy to create a note and then share it ○ > 2K notes, 18K queries ○ Supports JDBC (Presto, Vertica, MySQL) ○ Open source, Easy to add new interpreters like Scalding Data Visualization
  • 18. ● Tableau Strengths ○ Easy to create reports, does not require SQL expertise ○ Built in analytics functions e.g. Rank, Percentile ○ Polished visualizations ○ Row level security Data Visualization
  • 19. ● Big part of data analysis is data cleansing ● Makes sense to do this once ● Core Data ○ Create pipelines to create “verified” datasets like Users, Tweets, Interactions ○ Reliable and easy to use ● Core Metrics ○ Create pipelines to compute Twitter’s important metrics ○ DAU, MAU, Tweet Impressions Core Data and Metrics
  • 21. ● Analytics - Basic Counting ● A/B Testing ● Data Science - Custom analysis ● Data Science - Machine Learning Data Processing
  • 22. ● Daily/Monthly Active Users ● Number of Tweets, Retweets, Likes ● Tweet Impressions ● Logic is relatively simple ● Challenges: scale and timeliness ○ Results for previous day should be available by 10 am ○ Some metrics are real-time Basic Counting
  • 23. ● Goal: find the number of impressions and engagements for a tweet ● Real-time ● Used in analytics.twitter.com Example - Counting Tweet Impressions
  • 24. aggregate { onKeys( (TweetId) ) produce ( Count ) sinkTo (Manhattan) } fromProducer { ClientEventSource(“client_events”) .filter { event => isImpressionEvent(event) } .map { event => (event.timestamp, ImpressionAttributes(event.tweetId)) } } TSAR job Dimension Metric Data Sink Data Source
  • 25. ● TSAR job is converted to a Summingbird job ● Summingbird job creates ○ Real-time pipeline with Heron ○ Batch pipeline with Scalding ● Users access results using TSAR query service ● Write once, run batch and real-time Example - Counting Tweet Impressions
  • 26. ● Experimentation is at the heart of Twitter’s product development cycle ● Expertise needed in Statistics and Technology A/B Testing Framework
  • 27. ● Goal: informative experiment, ● Minimize false positive and false negative errors ● How many users do we need to sample? ● How long should we run the experiment? A/B Testing Statistics
  • 28. ● Process 100 B events daily, compute intensive. ● Metrics computed using Scalding pipeline that combines client event logs, internal user models, and other datasets. ● Lightweight statistics are computed in a streaming job using TSAR running on Heron. A/B Testing Technology
  • 29. ● Cause of spikes and dips in key metrics ● Growth Trends ○ By country, client ● Analysis to understand user behavior ○ Creators vs Consumers ○ Distribution of followers ○ User clusters ● Analysis to inform product feature decisions Data Science - Custom Analysis
  • 30. ● Recommendations ○ Users: WTF - who to follow ○ Tweets: Algorithmic timeline ● Cortex, Deep learning based on Torch framework ○ Identify NSFW images ○ Recognize what is happening in live feeds Data Science - Machine Learning
  • 31. ● Product Safety ○ Detect fake accounts ○ Detect tweet spam and abuse ● Ad Targeting ○ Promoted Trends, Accounts and Tweets ○ Show only if it is likely to be interesting and relevant to that user ○ Predict click probability using signals including what a user chooses to follow, how they interact with a Tweet and what they retweet Machine Learning
  • 32. ● Systems (Hadoop, Vertica) ○ Necessary because higher level abstraction are leaky ● Programming (Scala, Scalding, SQL) ● Math (Statistics, Linear Algebra) Ideal Talent Stack Systems Programming Statistics Data Engineers Data Scientists
  • 33. Data Platform and Data Science work hand-in-hand to extract insights from Big Data at Twitter Summary
  • 35. ● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator ● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter ● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron ● Heron http://www.slideshare.net/KarthikRamasamy3 ● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview ● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests ● Algorithmic timeline: https://support.twitter.com/articles/164083 ● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/ ● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/ References