SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Data Mining News Articles
Amir Othman
About myself.
* Software engineer @ Instance
* Education from Bauhaus Universität
Weimar and Hochschule Ulm
* Love my wife, building cool pieces
of software and making music
* http://www.instance.com.sg
* http://www.amirmeludah.com
About this project.
* Initially intended to be a part of
a thesis project
* Grew into a fun side project
* Fulfilling a weird obsession about
web scraping
What is data mining news
articles?
"Data mining is the computing process of
discovering patterns in large data sets
involving methods at the intersection of
machine learning, statistics, and database
systems."
source:
https://en.wikipedia.org/wiki/Data_mining
What is data mining news
articles?
"Data mining, the science of extracting
useful knowledge from such huge data
repositories, has emerged as a young and
interdisciplinary field in computer science."
source:
http://www.kdd.org/curriculum/index.html
What is data mining news
articles?
"Collecting as much relevant data as possible
that with the hopes of gaining insights."
- me
Collecting what?
* News articles:
* German news articles
- Regional and national
* Malaysian news articles
- ALL OF THEM!
Why collecting these data?
* Building a corpus as raw material
for to test out NLP findings
* Piece of digital history
* News organizations go missing -
Wayback Machine not practical
* Cross-validating news sources
How to collect links to
news articles?
* As starting point before expanding
* News aggregators :)
* Search engine
* Curated news from news portals
* Result: Links pointing to news
websites
How to get even more links?
* Related articles – news aggregators
* Tweets from journalists and news
organizations
What about upcoming news?
* We collected a bunch of static
links
* News need to be fresh and young
What about upcoming news?
* Information retrieval
* age
* freshness
* Effective Web Crawling: PhD Thesis
by Carlos Castillo
Simple is better than complex
What about upcoming news?
* News will (almost) always have RSS
feeds
* Slowly being replaced by Twitter
feeds.
* Advantage
- Subscription instead of frontiers
- Convenient way to get recent news
articles
- Structured
What about old news?
* Identify the next/other/more links
* Machine learning approach
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
What about old news?
* Text classification task:
- Is this link with this text a
next/more/other link?
- Train with labeled data - 400
sites from different news
websites
- FastText
- On one iteration:
from 5443 articles to 349111
articles
How To Verify?
* Similarity – above similarity
threshold
* Put through information extraction
pipeline.
* Second layer of sanity check:
- randomly pick link and inspect.
What do we have so far?
* Links pointing to old articles
* RSS and Twitter feeds for links
pointing to new articles
How to retrieve and store
the data?
* Politeness when hitting servers -
schedule delay when on the same
domain
* Queueing with Redis
- One process to push it in a queue
each time we find a new link
- A different process pops the
queue to get the content
How to retrieve and store
the data?
* Scaling with Redis
- Redis Cluster
- multiple servers to get the
content
How to retrieve and store
the data?
* Store with MongoDB
- Document database for documents
- Require the flexibility of
document database
- Save all the extracted
information inside MongoDB
- Sharded Cluster
How to clean the data?
* HTML ==> structured information
{
“title”:<news title>,
“content”:<content of news>,
“date”:<published date>
}
How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
How to clean the data?
* Alternative 1:
- BeautifulSoup
- disadvantage: manual
- advantage: precise
* Alternative 2:
- readability-lxml
- date and title extraction for
free!
- disadvantage: error prone
- advantage: fully automated
How to clean the data?
* For data coming from RSS and
Twitter feeds:
- Cross-validate with meta data
What can be extracted from
the data?
* Language detection:
- pycld2
* Named Entity Recognition:
- Spacy
- Polyglot
* Topic Modelling:
- Gensim
Computing what is trending.
* Extract named entities and rank
them by their tf-idf score.
* Named entity recognition:
- extract names, places, etc.
* tf-idf
- A fancier way of counting the
frequency of words
Querying and similarity
* Querying:
- ElasticSearch for full text
search
* Similarity lookup:
- Run word2vec on entire corpus
- Filter dictionary to only contain
named entities
- Get nearest neighbours
Use case: Automated
timelines creation
* Web application that consumes the
data through a REST API
* www.kronologimalaysia.com
* www.diezeitachse.de
Questions
othman.amir@gmail.com

Weitere ähnliche Inhalte

Was ist angesagt?

Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RESChristophe Guéret
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertOpening-up.eu
 
Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Christophe Guéret
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Itza Carbajal
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansBram van den Hout
 
Search the internet
Search the internetSearch the internet
Search the internetEdit Ostrom
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and dataAndrew Treloar
 

Was ist angesagt? (20)

Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Linked Data
Linked DataLinked Data
Linked Data
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
 
Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...Stop making tools! Nobody likes them anyway...
Stop making tools! Nobody likes them anyway...
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Digital archiving 3.0
Digital archiving 3.0Digital archiving 3.0
Digital archiving 3.0
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1Introduction to Linked Data - Part 1
Introduction to Linked Data - Part 1
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Wehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historiansWehc - Linked Data for Economic-Social historians
Wehc - Linked Data for Economic-Social historians
 
Search the internet
Search the internetSearch the internet
Search the internet
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Instutional repositories and data
Instutional repositories and dataInstutional repositories and data
Instutional repositories and data
 

Ähnlich wie Data mining news articles by Amir Othman for PyCon APAC 2017

Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22jodischneider
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
File and data base management
File and data base managementFile and data base management
File and data base managementAsad Ahmed
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internetcarolyn oldham
 
Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Bernhard Krabina
 
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11William Hall
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenBookNet Canada
 
BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs
 

Ähnlich wie Data mining news articles by Amir Othman for PyCon APAC 2017 (20)

Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Data Infrastructure in Kumparan
Data Infrastructure in KumparanData Infrastructure in Kumparan
Data Infrastructure in Kumparan
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Group 3
Group 3Group 3
Group 3
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Internet Search and DRM Issues
Internet Search and DRM IssuesInternet Search and DRM Issues
Internet Search and DRM Issues
 
File and data base management
File and data base managementFile and data base management
File and data base management
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021Linked open data with Semantic MediaWiki - ENDORSE 2021
Linked open data with Semantic MediaWiki - ENDORSE 2021
 
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
 
Webofdata
WebofdataWebofdata
Webofdata
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Web History 101, or How the Future is Unwritten
Web History 101, or How the Future is UnwrittenWeb History 101, or How the Future is Unwritten
Web History 101, or How the Future is Unwritten
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs at ISKO Conference, UCL, London - July 2013
BBC News Labs at ISKO Conference, UCL, London - July 2013
 

Mehr von PYCON MY PLT

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolPYCON MY PLT
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraPYCON MY PLT
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPYCON MY PLT
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith YangPYCON MY PLT
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarPYCON MY PLT
 
Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...PYCON MY PLT
 

Mehr von PYCON MY PLT (6)

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High School
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa Karkera
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas Lu
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith Yang
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellar
 
Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...Using machine learning to try and predict taxi availability by Narahari Allam...
Using machine learning to try and predict taxi availability by Narahari Allam...
 

Kürzlich hochgeladen

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Kürzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

Data mining news articles by Amir Othman for PyCon APAC 2017

  • 1. Data Mining News Articles Amir Othman
  • 2. About myself. * Software engineer @ Instance * Education from Bauhaus Universität Weimar and Hochschule Ulm * Love my wife, building cool pieces of software and making music * http://www.instance.com.sg * http://www.amirmeludah.com
  • 3. About this project. * Initially intended to be a part of a thesis project * Grew into a fun side project * Fulfilling a weird obsession about web scraping
  • 4. What is data mining news articles? "Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." source: https://en.wikipedia.org/wiki/Data_mining
  • 5. What is data mining news articles? "Data mining, the science of extracting useful knowledge from such huge data repositories, has emerged as a young and interdisciplinary field in computer science." source: http://www.kdd.org/curriculum/index.html
  • 6. What is data mining news articles? "Collecting as much relevant data as possible that with the hopes of gaining insights." - me
  • 7. Collecting what? * News articles: * German news articles - Regional and national * Malaysian news articles - ALL OF THEM!
  • 8. Why collecting these data? * Building a corpus as raw material for to test out NLP findings * Piece of digital history * News organizations go missing - Wayback Machine not practical * Cross-validating news sources
  • 9. How to collect links to news articles? * As starting point before expanding * News aggregators :) * Search engine * Curated news from news portals * Result: Links pointing to news websites
  • 10. How to get even more links? * Related articles – news aggregators * Tweets from journalists and news organizations
  • 11. What about upcoming news? * We collected a bunch of static links * News need to be fresh and young
  • 12. What about upcoming news? * Information retrieval * age * freshness * Effective Web Crawling: PhD Thesis by Carlos Castillo
  • 13. Simple is better than complex
  • 14. What about upcoming news? * News will (almost) always have RSS feeds * Slowly being replaced by Twitter feeds. * Advantage - Subscription instead of frontiers - Convenient way to get recent news articles - Structured
  • 15. What about old news? * Identify the next/other/more links * Machine learning approach * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites
  • 16. What about old news? * Text classification task: - Is this link with this text a next/more/other link? - Train with labeled data - 400 sites from different news websites - FastText - On one iteration: from 5443 articles to 349111 articles
  • 17. How To Verify? * Similarity – above similarity threshold * Put through information extraction pipeline. * Second layer of sanity check: - randomly pick link and inspect.
  • 18. What do we have so far? * Links pointing to old articles * RSS and Twitter feeds for links pointing to new articles
  • 19. How to retrieve and store the data? * Politeness when hitting servers - schedule delay when on the same domain * Queueing with Redis - One process to push it in a queue each time we find a new link - A different process pops the queue to get the content
  • 20. How to retrieve and store the data? * Scaling with Redis - Redis Cluster - multiple servers to get the content
  • 21. How to retrieve and store the data? * Store with MongoDB - Document database for documents - Require the flexibility of document database - Save all the extracted information inside MongoDB - Sharded Cluster
  • 22. How to clean the data? * HTML ==> structured information { “title”:<news title>, “content”:<content of news>, “date”:<published date> }
  • 23. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  • 24. How to clean the data? * Alternative 1: - BeautifulSoup - disadvantage: manual - advantage: precise * Alternative 2: - readability-lxml - date and title extraction for free! - disadvantage: error prone - advantage: fully automated
  • 25. How to clean the data? * For data coming from RSS and Twitter feeds: - Cross-validate with meta data
  • 26. What can be extracted from the data? * Language detection: - pycld2 * Named Entity Recognition: - Spacy - Polyglot * Topic Modelling: - Gensim
  • 27. Computing what is trending. * Extract named entities and rank them by their tf-idf score. * Named entity recognition: - extract names, places, etc. * tf-idf - A fancier way of counting the frequency of words
  • 28. Querying and similarity * Querying: - ElasticSearch for full text search * Similarity lookup: - Run word2vec on entire corpus - Filter dictionary to only contain named entities - Get nearest neighbours
  • 29. Use case: Automated timelines creation * Web application that consumes the data through a REST API * www.kronologimalaysia.com * www.diezeitachse.de