SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Antoine Amend (Barclays)
Andrew Morgan (ByteSumo Ltd)
STORY DEDUPLICATION
AND MUTATION
#EUstr9
Challenge
2#EUstr9
How do we study
evolving geopolitics,
as it happens, using global news?
It raises questions
3#EUstr9
How to use global news as a source?
How do we work at this scale?
What design patterns are needed?
What specialist techniques do we need?
How do we trace evolving storylines?
News Data: GDELT
• TLDR: It’s big
• global news feed, published every 15min
• ~63k news publishing websites scraped
• ~60 Languages
• ~400k to ~800k articles / month
4#EUstr9
http://gdeltproject.org
http://data.gdeltproject.org/gdeltv2/masterfilelist.txt
The Data
5#EUstr9
GKG pipeline emits URLs, tags, sentiments. But not full text.
All languages ~15k articles each 15 minutes.
This charthighlights the global media
machine’s heartbeat,as news articles
are published in all languages,globally.
It’s a very regular pattern.
Notice clearly weekday vs weekend
volumes.Spikes drive NFRs.
Who are we?
6#EUstr9
Antoine Amend
Andrew Morgan
Matthew Hallett
David George
We are engineers doing
big data science.
We wrote a text book to
teach others about data
science at scale.
For it, we wrote lots of new
code, much not available
elsewhere.
We open-sourced all our
code.
Our Approach
7#EUstr9
Source near real-time URLs from GDELT
Scrape article text from source
De-duplicate articles
Use “Random Indexing” to build stable features
Streaming K-Means to build top Stories
Track Story Mutation, to trace Evolving Geopolitics
Interpret the analysis
Ingestion pipeline
NiFi + Kafka + Spark Streaming
8#EUstr9
Indexing using Simhash
The secret sauce
9#EUstr9
Deduplication Graph
Detecting near duplicates
10#EUstr9
2
1
2
0
RDD[String] RDD[Hash,String]
RDD[Int,String]
RDD[Int,String]
RDD[Int,String]
RDD[Int,String]
RDD[(Int, Iterable[String])]
RDD[(String, String, Int)]
RDD[(String, String, Int)]
RDD[(String, String, Int)]
RDD[(String, String, Int)]
Expand
&
Conquer
Dimensionality reduction
Hashing + Random Indexing = stable streaming
11#EUstr9
• TF-IDF is commonly used as a feature vector
• Oxford dictionary contains 170K words
• We need dimensionality reduction!
– PCA
• Computationally intensive
• Not fit for streaming, where future words are unknown
– Hashing TF
• Fixed number of features (220
by default), fit for streaming
• Collisions leads to dramatic overestimates
– Random Indexing
• Preserved distancemeasure(as per Johnson-Lindenstrauss Lemma)
• Combined with Hashing TF, guarantees fixed number ofdimensions embeddedin lower vector space
https://github.com/derrickburns/generalized-kmeans-clustering
https://www.sics.se/~mange/papers/RI_intro.pdf
Streaming Kmeans
Remember the past, drift towards the future
12#EUstr9
• Defining k = 10 (number of topics at a given time)
• Defining forgetfulness parameter = 1h
– Infinite decay: we remember all the past,our code is more defensive towards fake news
– Zero decay: our clusters are re-definedatevery batch,we can’ttrack stories over time
• Cluster centers adjust to new articles, allowing stories to drift over time
Story drift
Tracking stories over time
13#EUstr9
• Index each article + simhash + predicted cluster
• Detected Paris attack articles on Nov. 15’ at 9:30pm
• Story burst from the [TERROR / VIOLENCE] topic
• We track GDELT metadata over time
• We ensure homogeneous clusters (RI embedding)
14#EUstr9
Story drift:
Tracking the Paris attacks
Recycling dying clusters
Evolving geopolitics
15#EUstr9
• Searching for “paris attack” shows additional clusters after 1h
– Spark streaming started to recycle dying clusters
– All eyes are now on breaking news, ignoring the least popular topics
• Topic modelling (LDA) exhibits different “flavors” of a same event
– Main cluster stays focused on the fact
– Second most important is around social media
– Third one is about politics and tributes paid
Connecting stories
Branching / Merging model
16#EUstr9
If we observe many articles at a time t that belonged to a cluster S and now belong to a cluster S’ at
a time t+dt, then S migrated to S’ in dt time.
• We store our feature vectors alongside the articles
• We predict outdated vectors against an up-to-date model
• We save these connections as Edges and explore branching as a weighted directed graph
Watching the world unfold
A fragile equilibrium
17#EUstr9
• Before the attack, the world fitted into 10 distinct topics
– The attack recycled dying clusters and created new flavors of a same event
– The extra coverage on Paris pushed other topics upwards, resulting in this scatter shape
• The Paris attack perturbed an equilibrium, reshaped the vision of the world
– Will we reach a new equilibrium after pulling those strings so badly?
– Perturbative theory, studying cycles and harmonics?
– Will the world ever heal from that “wound”? What if we increase forgetfulness?
Your Challenge
18#EUstr9
Now we can create
stable geopolitical signals
how will we use this to better predict?
Thanks!
19#EUstr9
Antoine Amend
Andrew Morgan
Matthew Hallett
David George
All the code needed to try this at home is here:
https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science

Weitere ähnliche Inhalte

Ähnlich wie Story Deduplication and Mutation with Antoine Amend and Andrew Morgan

2007 amazon as hub of world brain final-6.51
2007 amazon as hub of world brain final-6.512007 amazon as hub of world brain final-6.51
2007 amazon as hub of world brain final-6.51Robert David Steele Vivas
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldaelang
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar
 
Our Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureOur Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureC4Media
 
Science communication workshop at PBE2021
Science communication workshop at PBE2021Science communication workshop at PBE2021
Science communication workshop at PBE2021Global Plant Council
 
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...Ahmed Saleh
 
iSGTW - What it is
iSGTW - What it isiSGTW - What it is
iSGTW - What it isiSGTW
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfRanvinuHewage
 
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)Prashant Khare
 
N8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptxN8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptxNafisa Vaz
 
R in the Humanities: Text Analysis
R in the Humanities: Text AnalysisR in the Humanities: Text Analysis
R in the Humanities: Text AnalysisLeah Henrickson
 
Transformers for time series
Transformers for time seriesTransformers for time series
Transformers for time seriesEzeLanza
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseSimon Lia-Jonassen
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
Module 1 Introduction to Big and Smart Data- Online
Module 1 Introduction to Big and Smart Data- Online Module 1 Introduction to Big and Smart Data- Online
Module 1 Introduction to Big and Smart Data- Online caniceconsulting
 

Ähnlich wie Story Deduplication and Mutation with Antoine Amend and Andrew Morgan (20)

2007 amazon as hub of world brain final-6.51
2007 amazon as hub of world brain final-6.512007 amazon as hub of world brain final-6.51
2007 amazon as hub of world brain final-6.51
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018Kim Hammar Msc Thesis Defense - 2018
Kim Hammar Msc Thesis Defense - 2018
 
Our Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed FutureOur Concurrent Past; Our Distributed Future
Our Concurrent Past; Our Distributed Future
 
Science communication workshop at PBE2021
Science communication workshop at PBE2021Science communication workshop at PBE2021
Science communication workshop at PBE2021
 
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...
Attend2trend: Attention Model for Real-Time Detecting and Forecasting of Tren...
 
iSGTW - What it is
iSGTW - What it isiSGTW - What it is
iSGTW - What it is
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Checklists for transformation
Checklists for transformationChecklists for transformation
Checklists for transformation
 
Lecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdfLecture 1 Slides -Introduction to algorithms.pdf
Lecture 1 Slides -Introduction to algorithms.pdf
 
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
 
N8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptxN8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptx
 
R in the Humanities: Text Analysis
R in the Humanities: Text AnalysisR in the Humanities: Text Analysis
R in the Humanities: Text Analysis
 
Transformers for time series
Transformers for time seriesTransformers for time series
Transformers for time series
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
ma
mama
ma
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Module 1 Introduction to Big and Smart Data- Online
Module 1 Introduction to Big and Smart Data- Online Module 1 Introduction to Big and Smart Data- Online
Module 1 Introduction to Big and Smart Data- Online
 
Digital humanities
Digital humanitiesDigital humanities
Digital humanities
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Kürzlich hochgeladen (20)

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Story Deduplication and Mutation with Antoine Amend and Andrew Morgan

  • 1. Antoine Amend (Barclays) Andrew Morgan (ByteSumo Ltd) STORY DEDUPLICATION AND MUTATION #EUstr9
  • 2. Challenge 2#EUstr9 How do we study evolving geopolitics, as it happens, using global news?
  • 3. It raises questions 3#EUstr9 How to use global news as a source? How do we work at this scale? What design patterns are needed? What specialist techniques do we need? How do we trace evolving storylines?
  • 4. News Data: GDELT • TLDR: It’s big • global news feed, published every 15min • ~63k news publishing websites scraped • ~60 Languages • ~400k to ~800k articles / month 4#EUstr9 http://gdeltproject.org http://data.gdeltproject.org/gdeltv2/masterfilelist.txt
  • 5. The Data 5#EUstr9 GKG pipeline emits URLs, tags, sentiments. But not full text. All languages ~15k articles each 15 minutes. This charthighlights the global media machine’s heartbeat,as news articles are published in all languages,globally. It’s a very regular pattern. Notice clearly weekday vs weekend volumes.Spikes drive NFRs.
  • 6. Who are we? 6#EUstr9 Antoine Amend Andrew Morgan Matthew Hallett David George We are engineers doing big data science. We wrote a text book to teach others about data science at scale. For it, we wrote lots of new code, much not available elsewhere. We open-sourced all our code.
  • 7. Our Approach 7#EUstr9 Source near real-time URLs from GDELT Scrape article text from source De-duplicate articles Use “Random Indexing” to build stable features Streaming K-Means to build top Stories Track Story Mutation, to trace Evolving Geopolitics Interpret the analysis
  • 8. Ingestion pipeline NiFi + Kafka + Spark Streaming 8#EUstr9
  • 9. Indexing using Simhash The secret sauce 9#EUstr9
  • 10. Deduplication Graph Detecting near duplicates 10#EUstr9 2 1 2 0 RDD[String] RDD[Hash,String] RDD[Int,String] RDD[Int,String] RDD[Int,String] RDD[Int,String] RDD[(Int, Iterable[String])] RDD[(String, String, Int)] RDD[(String, String, Int)] RDD[(String, String, Int)] RDD[(String, String, Int)] Expand & Conquer
  • 11. Dimensionality reduction Hashing + Random Indexing = stable streaming 11#EUstr9 • TF-IDF is commonly used as a feature vector • Oxford dictionary contains 170K words • We need dimensionality reduction! – PCA • Computationally intensive • Not fit for streaming, where future words are unknown – Hashing TF • Fixed number of features (220 by default), fit for streaming • Collisions leads to dramatic overestimates – Random Indexing • Preserved distancemeasure(as per Johnson-Lindenstrauss Lemma) • Combined with Hashing TF, guarantees fixed number ofdimensions embeddedin lower vector space https://github.com/derrickburns/generalized-kmeans-clustering https://www.sics.se/~mange/papers/RI_intro.pdf
  • 12. Streaming Kmeans Remember the past, drift towards the future 12#EUstr9 • Defining k = 10 (number of topics at a given time) • Defining forgetfulness parameter = 1h – Infinite decay: we remember all the past,our code is more defensive towards fake news – Zero decay: our clusters are re-definedatevery batch,we can’ttrack stories over time • Cluster centers adjust to new articles, allowing stories to drift over time
  • 13. Story drift Tracking stories over time 13#EUstr9 • Index each article + simhash + predicted cluster • Detected Paris attack articles on Nov. 15’ at 9:30pm • Story burst from the [TERROR / VIOLENCE] topic • We track GDELT metadata over time • We ensure homogeneous clusters (RI embedding)
  • 15. Recycling dying clusters Evolving geopolitics 15#EUstr9 • Searching for “paris attack” shows additional clusters after 1h – Spark streaming started to recycle dying clusters – All eyes are now on breaking news, ignoring the least popular topics • Topic modelling (LDA) exhibits different “flavors” of a same event – Main cluster stays focused on the fact – Second most important is around social media – Third one is about politics and tributes paid
  • 16. Connecting stories Branching / Merging model 16#EUstr9 If we observe many articles at a time t that belonged to a cluster S and now belong to a cluster S’ at a time t+dt, then S migrated to S’ in dt time. • We store our feature vectors alongside the articles • We predict outdated vectors against an up-to-date model • We save these connections as Edges and explore branching as a weighted directed graph
  • 17. Watching the world unfold A fragile equilibrium 17#EUstr9 • Before the attack, the world fitted into 10 distinct topics – The attack recycled dying clusters and created new flavors of a same event – The extra coverage on Paris pushed other topics upwards, resulting in this scatter shape • The Paris attack perturbed an equilibrium, reshaped the vision of the world – Will we reach a new equilibrium after pulling those strings so badly? – Perturbative theory, studying cycles and harmonics? – Will the world ever heal from that “wound”? What if we increase forgetfulness?
  • 18. Your Challenge 18#EUstr9 Now we can create stable geopolitical signals how will we use this to better predict?
  • 19. Thanks! 19#EUstr9 Antoine Amend Andrew Morgan Matthew Hallett David George All the code needed to try this at home is here: https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science