SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Mining Large Scale Temporal Dynamics
                with Hadoop
                           Nish Parikh
                     Senior Research Scientist
eBay Research Labs
                     eBay Research Labs


       Joint Work With: Evan Chiu, Hui Hong, Neel Sundaresan
About eBay Research Labs

•  Who we are
   –  The group's charter is to conduct research and deliver innovative
      solutions.
•  What we do
   –  Computer Vision
   –  Economics and Optimization
   –  Human Language Technologies
   –  Data Science
   –  Machine Learning
   –  Infrastructure & Systems
   –  Innovation Programs & Incubations


•  Basically we “Dig” into data!
Why study temporal dynamics?

•  Stock Markets
•  Bio-Medical Signals
•  Traffic, Weather and Network Systems
•  Web Search & Ranking
•  Recommender Systems
•  eCommerce…
Challenges

•  Large Scale data
   –  100M+ users
   –  Petabytes of click-stream logs
   –  Billions of user sessions
   –  Billions of unique queries

•  Noisy Data
   –  Robots
   –  API Calls
   –  Crawlers, Spiders
   –  Tools, Scripts
   –  Data Biases

•  Data spread across long time frames
   –  Differences in collection methodologies

•  Complexity of certain algorithms
Hadoop Cluster at eBay

•  Nodes
   –  Cent OS 4 64 Bit
   –  Intel Dual Hex Core Xeon 2.4
      GHz
   –  72 GB RAM
   –  2 * 12 (24TB) HDD
   –  SSD for OS

•  Network
   –  TOR 1Gbps
   –  Core Switches uplink 40 Gbps

•  Cluster
   –  532n – 1008n
   –  4000+ cores – 24000 vCPUs
   –  5 – 18 PB
Mobius – Computation Platform

Application
              Click Stream Visualizer          Metrics Dashboard          Research Projects
  Layer

                                      Mobius Studio (Eclipse plugin)

 Mobius           Generic Java Dataset API                         Query Language
 Layer

                                       Low level Dataset access API

   eBay
   Infra-                                                Hadoop Cluster
structure &
   Data
  Source
   Layer                                  eBay Data (Logs, Tables)

                    E. Chiu, W. Mann, J. Shen, G. Singh, Z. Yang
Mobius – Generic JAVA Dataset API


•  Java-based, high-level data processing framework built
   on top of Apache Hadoop.
•  Tuple oriented.
•  Supports job chaining.
•  Supports high level operators such as join (inner or
   outer) or grouping.
•  Supports filtering.
•  Used internally at eBay for various data science
   applications.
•  http://labs.ebay.com/openmobius/
Mobius – Inner Join
Mobius – Filtering
Mobius – Grouping
Mobius – Chaining
Mining Temporal Data

•  When its in your mind, its in the Query Logs!
   –  Queries as a proxy for demand
Mining Temporal Data

•  Data Preparation
                                           Christmas trend – raw
   –  Robot Filtering                      data
   –  Session Log Analysis

•  Data Cleaning
   –  Normalization
   –  De-duplication




                       Christmas trend –
                       prepared data
Mining Temporal Data – What’s Buzzing?

•  Automatic Buzz Detection
Mining Temporal Data – Does History Repeat
Itself?

•  Seasonality and Trend Prediction




     Why are searches related to monopoly pieces popular every
     October?
Air conditioner searches become popular as summer approaches
Mining Temporal Data – Temporal Similarity




Similar patterns for queries related to Hanukkah
Preparing Data – Getting Queries from User
Sessions

                              Typical eBay flow


        Search                       View                      Purchase



•  Search: specify a query, with optional constraints
•  View: click on an item shown on search results page
•  Purchase: buy a fixed-price item or place winning bid on an auction item


Consider only queries typed in by humans. Ignore
   page views from robots or views from paid
advertisements, campaigns or natural search links.
Cleaning Data

•  Apply default robot detection and removal algorithm
   –  Based on IP, number of actions per day, agent information.

•  Find the right flows from the sessions.
   –  Filter out noisy search events.
   –  Remove anomalies due to outlier users.
   –  Limit the impact a single user can have on aggregated data (de-
      duplication).
Finding the right flow in the session

  Session 1

    Search               Exit
May not consider flows without any interesting activity like clicks


  Session 2

    Ads/paid
                                View                    Purchase
     search
   May not consider searches coming from advertisements


   Session 3

     Search                         View                        Purchase

These kind of sessions are considered and information is aggregated.
Data Preparation - Map Reduce Flow


                                         Save the result so it
                                          can be reused by
        Preprocessing stage                  other apps.                  Collecting stage



      M                        R                                       M                             R

                                                                 •  Find the right flow.       Calculate sum per
Read raw events   •    Group events into sessions.
                                                                 •  Emit query as key.         key
                  •    Group sessions by GUID
                                                                 •  Emit de-duplicated query
                  •    Apply bot filtering algorithm
                                                                 volume as value




                                                                   Query Volume
                                                                   output daily as
                                                                   dailyQueryData
Time Series Generation

                           Input: dailyQueryData for multi-year time-frames
          •  Data Cleaning.
 Mapper   •  Query normalization.



                                                             Key: query
                                                             Value: date: query volume
            Data not to scale and only shown as an example




          •  Time Series formation for all unique queries
Reducer   •  Time Series indicating total daily activity volume



              Output: Vectors of Query à Volume Time Series
Buzz Detection – 2 state automaton model


•  Arrival of queries as a stream.
•  “low rate” state (q0) and a “high rate” state (q1).
•      f 0 ( x) = α 0e   −α 0 x
                                  f1 (x ) = α1e−α1x     where α1 > α0.
•  The automaton changes state with probability p ε (0,
   1) between query arrivals.
•  Let Q = (qi1, qi2… qin) be a state sequence. Each
   state sequence Q induces a density function fQ over
   sequences of gaps, which has the form
                       n
    fQ(x1, x2 …xn) = ∏t =1 fi (xt )       t




     N. Parikh, N. Sundaresan. KDD 2008.
     Scalable and Near Real-time Burst Detection from eCommerce Queries.
Buzz Detection – Modeling Queries as a
Stream




 Frequency of Query


                        Gaps between arrival
                        times for queries
Buzz Detection – 2 state automaton model

•  If number of state transitions in sequence Q are denoted as b
•  Prior probability of Q is given asb
     ⎛        ⎞⎛            ⎞               n −b     ⎛ p ⎞
     ⎜ ∏ p ⎟⎜ ∏1 −
     ⎜ i ≠ i ⎟⎜ i = i
                             p ⎟
                               ⎟   = pb (1 − p )      = ⎜       ⎟ (1 − p )n
                                                         ⎜ 1 − p ⎟
     ⎝ t t +1 ⎠⎝ t t +1     ⎠                        ⎝       ⎠
•  Using Bayes theorem, the cost equation is
                         1− p        n
      C (Q | X ) = b. ln(     ) + (∑ − ln fit ( xt ))
                          p        t =1

•  Sequence that minimizes the cost would depend on
   –  Ease of jumps between 2 states.
   –  How well the sequence conforms to the rate of query arrivals.
•  Configurable Parameters for model are α0, α1 and cost p.
    –  α0, α1 are calculated from data in the MR job.
    –  Heuristically determined value of p = 0.38 is used.
Query Volume Time Series – 2 State
Representation
Time Series Normalization and Buzz Detection

                        Input: 4-7 Years Query Time Series Vectors
            •  Normalize Time Series
            •  Transform Time Series to two state model
                  •  Calculate parameters α0, α1 for every query and
 Mapper           apply dynamic programming for 2 state calculation
            •  Calculate probability of being a periodic event query e.g.
            superbowl

                                           Key: query
                                           Value: normalized time series,
                                           two state model, probability of
                                           being a seasonal event query

                                           Key: time-frame
                                           Value: query that buzzes
                                           during that time frame
            Group queries buzzing at similar time intervals
Reducer

  Output: time-frame à Queries Buzzing during that time-period
Catman – http://labs.ebay.com/Catman/
Trends Application for eBay sellers & buyers
Binary data structure generation from MR job


•  Created new FileOutputFormat
•  Write time series data to two files
   – Binary File with fixed sized records indicating time
     series volume
   – Text file mapping each unique query string to binary
     file and offset
•  Index created by reducers directly loaded by custom
   servers written in C++.
•  Used for an internal Query Trends Application
Query Trends
Query Trends – Mapping to External Events
Trends – Comparing Queries
Temporal Similarity

•  1+ Billion Queries
•  Naïve Algorithm – Quadratic Complexity
•  Pearson’s Correlation




•  Candidate Set Reduction
   –  Correlations useful only for event-based or seasonal queries
   –  Correlations useful in applications only for head and torso queries
   –  These filters reduce candidate space from B+ to a few M.
Exact Correlations amongst candidates – All
pairs similarity on Reduced Set
Applications of Temporal Correlations – Query
Suggestions
Remarks


•  Log Mining and Time Series mining algorithms are
   parallelizable.
•  Easy to scale such algorithms using Hadoop.
•  Hadoop empowers us to look at data-sets spanning
   years and years.
•  Hadoop enables us to iterate faster and hence run more
   user-facing experiments.
References

•  Parikh et al. Scalable and near real-time burst detection from
   eCommerce queries. KDD 2008.
•  Sundaresan et al. Scalable Stream Processing and Map Reduce.
   Hadoop World 2009.
•  Anil Madan. Hadoop at eBay.
   http://www.slideshare.net/madananil/hadoop-at-ebay.
•  N Sundaresan. Popup Commerce, Towards Building Transient and
   Thematic Stores. X.Innovate 2011.
•  Pantel et al. Web-Scale Distributional Similarity and Entity Set
   Expansion. EMNLP 2009.
Acknowledgments : Project Members &
Alumni

•  Long Hoang
•  Rifat Joyee
•  Wallace Mann
•  Karin Mauge
•  Hill Nguyen
•  Jack Shen
•  Gyanit Singh
•  Zhou Yang
Questions?

•  nparikh@ebay.com

Weitere ähnliche Inhalte

Ähnlich wie Mining Large-Scale Temporal Dynamics with Hadoop

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing EcosystemsAmazon Web Services
 
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...Raman Chandrasekar
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.pptQuyn590023
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Monitoring and mining network traffic in clouds
Monitoring and mining network traffic in cloudsMonitoring and mining network traffic in clouds
Monitoring and mining network traffic in cloudsEdward Yoon
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Associations1
Associations1Associations1
Associations1mancnilu
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout SessionSplunk
 
MachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptxMachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptxEhsanUllah221132
 

Ähnlich wie Mining Large-Scale Temporal Dynamics with Hadoop (20)

IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
 
Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...Actions speak louder than words: Analyzing large-scale query logs to improve ...
Actions speak louder than words: Analyzing large-scale query logs to improve ...
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
Monitoring and mining network traffic in clouds
Monitoring and mining network traffic in cloudsMonitoring and mining network traffic in clouds
Monitoring and mining network traffic in clouds
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Associations1
Associations1Associations1
Associations1
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
MachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptxMachineLearning_Seminar_final.pptx
MachineLearning_Seminar_final.pptx
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Mining Large-Scale Temporal Dynamics with Hadoop

  • 1. Mining Large Scale Temporal Dynamics with Hadoop Nish Parikh Senior Research Scientist eBay Research Labs eBay Research Labs Joint Work With: Evan Chiu, Hui Hong, Neel Sundaresan
  • 2. About eBay Research Labs •  Who we are –  The group's charter is to conduct research and deliver innovative solutions. •  What we do –  Computer Vision –  Economics and Optimization –  Human Language Technologies –  Data Science –  Machine Learning –  Infrastructure & Systems –  Innovation Programs & Incubations •  Basically we “Dig” into data!
  • 3. Why study temporal dynamics? •  Stock Markets •  Bio-Medical Signals •  Traffic, Weather and Network Systems •  Web Search & Ranking •  Recommender Systems •  eCommerce…
  • 4. Challenges •  Large Scale data –  100M+ users –  Petabytes of click-stream logs –  Billions of user sessions –  Billions of unique queries •  Noisy Data –  Robots –  API Calls –  Crawlers, Spiders –  Tools, Scripts –  Data Biases •  Data spread across long time frames –  Differences in collection methodologies •  Complexity of certain algorithms
  • 5. Hadoop Cluster at eBay •  Nodes –  Cent OS 4 64 Bit –  Intel Dual Hex Core Xeon 2.4 GHz –  72 GB RAM –  2 * 12 (24TB) HDD –  SSD for OS •  Network –  TOR 1Gbps –  Core Switches uplink 40 Gbps •  Cluster –  532n – 1008n –  4000+ cores – 24000 vCPUs –  5 – 18 PB
  • 6. Mobius – Computation Platform Application Click Stream Visualizer Metrics Dashboard Research Projects Layer Mobius Studio (Eclipse plugin) Mobius Generic Java Dataset API Query Language Layer Low level Dataset access API eBay Infra- Hadoop Cluster structure & Data Source Layer eBay Data (Logs, Tables) E. Chiu, W. Mann, J. Shen, G. Singh, Z. Yang
  • 7. Mobius – Generic JAVA Dataset API •  Java-based, high-level data processing framework built on top of Apache Hadoop. •  Tuple oriented. •  Supports job chaining. •  Supports high level operators such as join (inner or outer) or grouping. •  Supports filtering. •  Used internally at eBay for various data science applications. •  http://labs.ebay.com/openmobius/
  • 12. Mining Temporal Data •  When its in your mind, its in the Query Logs! –  Queries as a proxy for demand
  • 13. Mining Temporal Data •  Data Preparation Christmas trend – raw –  Robot Filtering data –  Session Log Analysis •  Data Cleaning –  Normalization –  De-duplication Christmas trend – prepared data
  • 14. Mining Temporal Data – What’s Buzzing? •  Automatic Buzz Detection
  • 15. Mining Temporal Data – Does History Repeat Itself? •  Seasonality and Trend Prediction Why are searches related to monopoly pieces popular every October? Air conditioner searches become popular as summer approaches
  • 16. Mining Temporal Data – Temporal Similarity Similar patterns for queries related to Hanukkah
  • 17. Preparing Data – Getting Queries from User Sessions Typical eBay flow Search View Purchase •  Search: specify a query, with optional constraints •  View: click on an item shown on search results page •  Purchase: buy a fixed-price item or place winning bid on an auction item Consider only queries typed in by humans. Ignore page views from robots or views from paid advertisements, campaigns or natural search links.
  • 18. Cleaning Data •  Apply default robot detection and removal algorithm –  Based on IP, number of actions per day, agent information. •  Find the right flows from the sessions. –  Filter out noisy search events. –  Remove anomalies due to outlier users. –  Limit the impact a single user can have on aggregated data (de- duplication).
  • 19. Finding the right flow in the session Session 1 Search Exit May not consider flows without any interesting activity like clicks Session 2 Ads/paid View Purchase search May not consider searches coming from advertisements Session 3 Search View Purchase These kind of sessions are considered and information is aggregated.
  • 20. Data Preparation - Map Reduce Flow Save the result so it can be reused by Preprocessing stage other apps. Collecting stage M R M R •  Find the right flow. Calculate sum per Read raw events •  Group events into sessions. •  Emit query as key. key •  Group sessions by GUID •  Emit de-duplicated query •  Apply bot filtering algorithm volume as value Query Volume output daily as dailyQueryData
  • 21. Time Series Generation Input: dailyQueryData for multi-year time-frames •  Data Cleaning. Mapper •  Query normalization. Key: query Value: date: query volume Data not to scale and only shown as an example •  Time Series formation for all unique queries Reducer •  Time Series indicating total daily activity volume Output: Vectors of Query à Volume Time Series
  • 22. Buzz Detection – 2 state automaton model •  Arrival of queries as a stream. •  “low rate” state (q0) and a “high rate” state (q1). •  f 0 ( x) = α 0e −α 0 x f1 (x ) = α1e−α1x where α1 > α0. •  The automaton changes state with probability p ε (0, 1) between query arrivals. •  Let Q = (qi1, qi2… qin) be a state sequence. Each state sequence Q induces a density function fQ over sequences of gaps, which has the form n fQ(x1, x2 …xn) = ∏t =1 fi (xt ) t N. Parikh, N. Sundaresan. KDD 2008. Scalable and Near Real-time Burst Detection from eCommerce Queries.
  • 23. Buzz Detection – Modeling Queries as a Stream Frequency of Query Gaps between arrival times for queries
  • 24. Buzz Detection – 2 state automaton model •  If number of state transitions in sequence Q are denoted as b •  Prior probability of Q is given asb ⎛ ⎞⎛ ⎞ n −b ⎛ p ⎞ ⎜ ∏ p ⎟⎜ ∏1 − ⎜ i ≠ i ⎟⎜ i = i p ⎟ ⎟ = pb (1 − p ) = ⎜ ⎟ (1 − p )n ⎜ 1 − p ⎟ ⎝ t t +1 ⎠⎝ t t +1 ⎠ ⎝ ⎠ •  Using Bayes theorem, the cost equation is 1− p n C (Q | X ) = b. ln( ) + (∑ − ln fit ( xt )) p t =1 •  Sequence that minimizes the cost would depend on –  Ease of jumps between 2 states. –  How well the sequence conforms to the rate of query arrivals. •  Configurable Parameters for model are α0, α1 and cost p. –  α0, α1 are calculated from data in the MR job. –  Heuristically determined value of p = 0.38 is used.
  • 25. Query Volume Time Series – 2 State Representation
  • 26. Time Series Normalization and Buzz Detection Input: 4-7 Years Query Time Series Vectors •  Normalize Time Series •  Transform Time Series to two state model •  Calculate parameters α0, α1 for every query and Mapper apply dynamic programming for 2 state calculation •  Calculate probability of being a periodic event query e.g. superbowl Key: query Value: normalized time series, two state model, probability of being a seasonal event query Key: time-frame Value: query that buzzes during that time frame Group queries buzzing at similar time intervals Reducer Output: time-frame à Queries Buzzing during that time-period
  • 27. Catman – http://labs.ebay.com/Catman/ Trends Application for eBay sellers & buyers
  • 28. Binary data structure generation from MR job •  Created new FileOutputFormat •  Write time series data to two files – Binary File with fixed sized records indicating time series volume – Text file mapping each unique query string to binary file and offset •  Index created by reducers directly loaded by custom servers written in C++. •  Used for an internal Query Trends Application
  • 30. Query Trends – Mapping to External Events
  • 32. Temporal Similarity •  1+ Billion Queries •  Naïve Algorithm – Quadratic Complexity •  Pearson’s Correlation •  Candidate Set Reduction –  Correlations useful only for event-based or seasonal queries –  Correlations useful in applications only for head and torso queries –  These filters reduce candidate space from B+ to a few M.
  • 33. Exact Correlations amongst candidates – All pairs similarity on Reduced Set
  • 34. Applications of Temporal Correlations – Query Suggestions
  • 35. Remarks •  Log Mining and Time Series mining algorithms are parallelizable. •  Easy to scale such algorithms using Hadoop. •  Hadoop empowers us to look at data-sets spanning years and years. •  Hadoop enables us to iterate faster and hence run more user-facing experiments.
  • 36. References •  Parikh et al. Scalable and near real-time burst detection from eCommerce queries. KDD 2008. •  Sundaresan et al. Scalable Stream Processing and Map Reduce. Hadoop World 2009. •  Anil Madan. Hadoop at eBay. http://www.slideshare.net/madananil/hadoop-at-ebay. •  N Sundaresan. Popup Commerce, Towards Building Transient and Thematic Stores. X.Innovate 2011. •  Pantel et al. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP 2009.
  • 37. Acknowledgments : Project Members & Alumni •  Long Hoang •  Rifat Joyee •  Wallace Mann •  Karin Mauge •  Hill Nguyen •  Jack Shen •  Gyanit Singh •  Zhou Yang