SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Building a Gigaword Corpus
Data Ingestion, Management, and Processing for NLP
Rebecca Bilbro
Data Intelligence 2017
● Me and my motivation
● Why make a custom corpus?
● Things likely to go wrong
○ Ingestion
○ Management
○ Loading
○ Preprocessing
○ Analysis
● Lessons we learned
● Open source tools we made
Rebecca Bilbro
Data Scientist
Yellowbrick
Natural language
processing
Everyone’s doing it
NLTK
So many great tools
The Natural Language Toolkit
import nltk
moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
print(moby.similar("ahab"))
print(moby.common_contexts(["ahab", "starbuck"]))
print(moby.concordance("monstrous", 55, lines=10))
Gensim + Wikipedia
import bz2
import gensim
# Load id to word dictionary
id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
# Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
# Do latent Semantic Analysis and find 10 prominent topics
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsa.print_topics(10)
Case Study:
Predicting Political Orientation
Partisan Discourse: Architecture
Initial ModelDebate Transcripts
Submit URL
Preprocessing
Feature
Extraction
Evaluate
Model
Fit Model
Model Storage
ModelMonitoring
Corpus
Storage
CorpusMonitoring
Classification
Feedback
Model Selection
Partisan Discourse: New Documents
Partisan Discourse: User Model
A custom corpus
RSS
import os
import requests
import feedparser
feed = "http://feeds.washingtonpost.com/rss/national"
for entry in feedparser.parse(feed)['entries']:
r = requests.get(entry['link'])
path = entry['title'].lower().replace(" ", "-") + ".html"
with open(path, 'wb') as f:
f.write(r.content)
Ingestion
● Scheduling
● Adding new feeds
● Synchronizing feeds, finding duplicates
● Parsing different feeds/entries into a
standard form
● Monitoring
Storage
● Database choice
● Data representation, indexing, fetching
● Connection and configuration
● Error tracking and handling
● Exporting
And as the corpus began
to grow …
… new questions arose about costs
(storage, time) and surprising results
(videos?).
Post
+ title
+ url
+ content
+ hash()
+ htmlize()
Feed
+ title
+ link
+ active
OPML Reader
+ categories()
+ counts()
+ __iter__()
+ __len__()
ingest()
Configuration
+ logging
+ database
+ flags
Exporter
+ export()
+ readme()
Admin
+ ingest_feeds()
+ ingest_opml()
+ summary()
+ run()
+ export()
Utilities
+ timez
Logging
+ logger
+ mongolog
Ingest
+ feeds()
+ started()
+ finished()
+ process()
+ ingest()
Feed Sync
+ parse()
+ sync()
+ entries()
Post Wrangler
+ wrangle()
+ fetch()
connect()
<OPML />
Production-grade
ingestion:
Baleen
Raw corpus != Usable data
From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags
Streaming
Corpus Preprocessing
Tokenized
Corpus
CorpusReader for streaming access, preprocessing, and saving the tokenized version
HTML
Paras
Sents
Tokens
Tags
Raw
Corpus
Vectorization
...so many features
Visualize top tokens,
document distribution
& part-of-speech
tagging
Yellowbrick
Data Loader
Text
Normalization
Text
Vectorization
Feature
Transformation
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text Extraction
Summary
Vectorization
Article
Vectorization
Concept Features
Metadata Features
Dict Vectorizer
Minke
Dynamic graph
analysis
1.5 M documents, 7,500 jobs, 524 GB (uncompressed)
Keyphrase Graph:
- 2.7 M nodes
- 47 M edges
- Average
degree of 35
Lessons learned
Meaningfully literate data products rely
on…
...a custom, domain-specific corpus.
Meaningfully literate data products rely
on…
...a data management layer for flexibility
and iteration during
modeling.
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
corpus
├── citation.bib
├── feeds.json
├── LICENSE.md
├── manifest.json
├── README.md
└── books
├── 56d629e7c1808113ffb87eaf.html
├── 56d629e7c1808113ffb87eb3.html
└── 56d629ebc1808113ffb87ed0.html
└── business
├── 56d625d5c1808113ffb87730.html
├── 56d625d6c1808113ffb87736.html
└── 56d625ddc1808113ffb87752.html
└── cinema
├── 56d629b5c1808113ffb87d8f.html
├── 56d629b5c1808113ffb87d93.html
└── 56d629b6c1808113ffb87d9a.html
└── cooking
├── 56d62af2c1808113ffb880ec.html
├── 56d62af2c1808113ffb880ee.html
└── 56d62af2c1808113ffb880fa.html
Preprocessing
Transformer
Raw
CorpusReader
Tokenized
Corpus
Post-processed
CorpusReader
Meaningfully literate data products rely
on…
...a custom CorpusReader for streaming,
and also intermediate storage.
Meaningfully literate data products rely
on…
...visual steering and
graph analysis for
interpretation.
Corpus
Processing
Extract noun
keyphrases
weighted by
TF-IDF.
Corpus
Ingestion
Routine
Document
Collection
Every Hour
Baleen & Minke
Yellowbrick
Thank you!
Rebecca Bilbro
Twitter: twitter.com/rebeccabilbro
Github: github.com/rebeccabilbro
Email: rebecca.bilbro@bytecubed.com

Weitere ähnliche Inhalte

Was ist angesagt?

Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSMongoDB
 
Java development with MongoDB
Java development with MongoDBJava development with MongoDB
Java development with MongoDBJames Williams
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know Norberto Leite
 
Building Your First Application with MongoDB
Building Your First Application with MongoDBBuilding Your First Application with MongoDB
Building Your First Application with MongoDBMongoDB
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkMongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB AppHenrik Ingo
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
 
PhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHPPhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHPichikaway
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopSteven Francia
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Henrik Ingo
 
Mongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg SolutionsMongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg SolutionsMetatagg Solutions
 
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big DataTakahiro Inoue
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework MongoDB
 
Webinar: Building Your First App
Webinar: Building Your First AppWebinar: Building Your First App
Webinar: Building Your First AppMongoDB
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Miguel Gallardo
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDBNate Abele
 

Was ist angesagt? (20)

Getting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJSGetting Started with MongoDB and NodeJS
Getting Started with MongoDB and NodeJS
 
Java development with MongoDB
Java development with MongoDBJava development with MongoDB
Java development with MongoDB
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
 
Building Your First Application with MongoDB
Building Your First Application with MongoDBBuilding Your First Application with MongoDB
Building Your First Application with MongoDB
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation FrameworkBack to Basics Webinar 5: Introduction to the Aggregation Framework
Back to Basics Webinar 5: Introduction to the Aggregation Framework
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
PhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHPPhpstudyTokyo MongoDB PHP CakePHP
PhpstudyTokyo MongoDB PHP CakePHP
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDBMongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
MongoDB
MongoDBMongoDB
MongoDB
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
 
Mongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg SolutionsMongo Presentation by Metatagg Solutions
Mongo Presentation by Metatagg Solutions
 
MongoDB: Intro & Application for Big Data
MongoDB: Intro & Application  for Big DataMongoDB: Intro & Application  for Big Data
MongoDB: Intro & Application for Big Data
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Reverse on go
Reverse on goReverse on go
Reverse on go
 
Webinar: Building Your First App
Webinar: Building Your First AppWebinar: Building Your First App
Webinar: Building Your First App
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDB
 

Ähnlich wie Data Intelligence 2017 - Building a Gigaword Corpus

Kotlin for android 2019
Kotlin for android 2019Kotlin for android 2019
Kotlin for android 2019Shady Selim
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBMongoDB
 
What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?Trisha Gee
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDBNorberto Leite
 
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpus
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpusNOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpus
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpusNOVA DATASCIENCE
 
HTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book AuthorshipHTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book AuthorshipSanders Kleinfeld
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 
The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180Mahmoud Samir Fayed
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysCAPSiDE
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Future-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointFuture-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointBob German
 
The Ring programming language version 1.7 book - Part 14 of 196
The Ring programming language version 1.7 book - Part 14 of 196The Ring programming language version 1.7 book - Part 14 of 196
The Ring programming language version 1.7 book - Part 14 of 196Mahmoud Samir Fayed
 
Origins of Elixir programming language
Origins of Elixir programming languageOrigins of Elixir programming language
Origins of Elixir programming languagePivorak MeetUp
 

Ähnlich wie Data Intelligence 2017 - Building a Gigaword Corpus (20)

Kotlin for android 2019
Kotlin for android 2019Kotlin for android 2019
Kotlin for android 2019
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDB
 
Oreilly
OreillyOreilly
Oreilly
 
What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?What do you mean, Backwards Compatibility?
What do you mean, Backwards Compatibility?
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpus
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpusNOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpus
NOVA Data Science Meetup 5/10/2017 - Presentation Building a gigaword corpus
 
HTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book AuthorshipHTML5 Is the Future of Book Authorship
HTML5 Is the Future of Book Authorship
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180The Ring programming language version 1.5.1 book - Part 10 of 180
The Ring programming language version 1.5.1 book - Part 10 of 180
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
Current state-of-php
Current state-of-phpCurrent state-of-php
Current state-of-php
 
Python for Penetration testers
Python for Penetration testersPython for Penetration testers
Python for Penetration testers
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
From render() to DOM
From render() to DOMFrom render() to DOM
From render() to DOM
 
Future-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointFuture-proof Development for Classic SharePoint
Future-proof Development for Classic SharePoint
 
The Ring programming language version 1.7 book - Part 14 of 196
The Ring programming language version 1.7 book - Part 14 of 196The Ring programming language version 1.7 book - Part 14 of 196
The Ring programming language version 1.7 book - Part 14 of 196
 
Procesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con pythonProcesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con python
 
Origins of Elixir programming language
Origins of Elixir programming languageOrigins of Elixir programming language
Origins of Elixir programming language
 

Mehr von Rebecca Bilbro

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine LearningRebecca Bilbro
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsRebecca Bilbro
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scaleRebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistRebecca Bilbro
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black BoxRebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection processRebecca Bilbro
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability ProjectRebecca Bilbro
 

Mehr von Rebecca Bilbro (20)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Kürzlich hochgeladen

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Data Intelligence 2017 - Building a Gigaword Corpus

  • 1. Building a Gigaword Corpus Data Ingestion, Management, and Processing for NLP Rebecca Bilbro Data Intelligence 2017
  • 2. ● Me and my motivation ● Why make a custom corpus? ● Things likely to go wrong ○ Ingestion ○ Management ○ Loading ○ Preprocessing ○ Analysis ● Lessons we learned ● Open source tools we made
  • 5.
  • 7. Everyone’s doing it NLTK So many great tools
  • 8. The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab")) print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))
  • 9. Gensim + Wikipedia import bz2 import gensim # Load id to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)
  • 11. Partisan Discourse: Architecture Initial ModelDebate Transcripts Submit URL Preprocessing Feature Extraction Evaluate Model Fit Model Model Storage ModelMonitoring Corpus Storage CorpusMonitoring Classification Feedback Model Selection
  • 15. RSS import os import requests import feedparser feed = "http://feeds.washingtonpost.com/rss/national" for entry in feedparser.parse(feed)['entries']: r = requests.get(entry['link']) path = entry['title'].lower().replace(" ", "-") + ".html" with open(path, 'wb') as f: f.write(r.content)
  • 16. Ingestion ● Scheduling ● Adding new feeds ● Synchronizing feeds, finding duplicates ● Parsing different feeds/entries into a standard form ● Monitoring Storage ● Database choice ● Data representation, indexing, fetching ● Connection and configuration ● Error tracking and handling ● Exporting
  • 17. And as the corpus began to grow … … new questions arose about costs (storage, time) and surprising results (videos?).
  • 18. Post + title + url + content + hash() + htmlize() Feed + title + link + active OPML Reader + categories() + counts() + __iter__() + __len__() ingest() Configuration + logging + database + flags Exporter + export() + readme() Admin + ingest_feeds() + ingest_opml() + summary() + run() + export() Utilities + timez Logging + logger + mongolog Ingest + feeds() + started() + finished() + process() + ingest() Feed Sync + parse() + sync() + entries() Post Wrangler + wrangle() + fetch() connect() <OPML /> Production-grade ingestion: Baleen
  • 19. Raw corpus != Usable data
  • 20. From each doc, extract html, identify paras/sents/words, tag with part-of-speech Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags
  • 21. Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing, and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus
  • 23. Visualize top tokens, document distribution & part-of-speech tagging Yellowbrick
  • 24. Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Minke
  • 26. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed) Keyphrase Graph: - 2.7 M nodes - 47 M edges - Average degree of 35
  • 28. Meaningfully literate data products rely on… ...a custom, domain-specific corpus.
  • 29. Meaningfully literate data products rely on… ...a data management layer for flexibility and iteration during modeling. Feature Analysis Algorithm Selection Hyperparameter Tuning
  • 30. corpus ├── citation.bib ├── feeds.json ├── LICENSE.md ├── manifest.json ├── README.md └── books ├── 56d629e7c1808113ffb87eaf.html ├── 56d629e7c1808113ffb87eb3.html └── 56d629ebc1808113ffb87ed0.html └── business ├── 56d625d5c1808113ffb87730.html ├── 56d625d6c1808113ffb87736.html └── 56d625ddc1808113ffb87752.html └── cinema ├── 56d629b5c1808113ffb87d8f.html ├── 56d629b5c1808113ffb87d93.html └── 56d629b6c1808113ffb87d9a.html └── cooking ├── 56d62af2c1808113ffb880ec.html ├── 56d62af2c1808113ffb880ee.html └── 56d62af2c1808113ffb880fa.html Preprocessing Transformer Raw CorpusReader Tokenized Corpus Post-processed CorpusReader Meaningfully literate data products rely on… ...a custom CorpusReader for streaming, and also intermediate storage.
  • 31. Meaningfully literate data products rely on… ...visual steering and graph analysis for interpretation.
  • 35. Thank you! Rebecca Bilbro Twitter: twitter.com/rebeccabilbro Github: github.com/rebeccabilbro Email: rebecca.bilbro@bytecubed.com