3. moving to linked data
• moving from static HTML to dynamic,
responsive site
• introducing linked data to power content
aggregations around related topics
• starting to embed linked open data in every
page as RDFa
• using the IPTC rNews vocabulary to
describe contnet in a machine-readable way
4. impact on journalists
• annotating (“tagging”) content
with topics
• tool embedded into existing
CMS
• concept extraction/NLP for
topic suggestion
• journalists accept/reject
suggested topics for
annotation
6. learning from the pilot
• generally - it works
• but duplication for
big events
• also need pinning
• concept extraction
poor
• journalists gaming
the system
12. next steps
• rolling out tagging to journalists throughout
BBC News
• making better use of rNews/RDFa - full
mark-up integration
• piloting the use of organising content by
storylines
15. BBC News Labs
• Explore opportunities for BBC News
• Using real data
• Prototype quickly
• …which is normally hard in big Orgs…
16. Unlocking the Data in BBC News
• All we have is a bunch of articles...
• What does a “tagged” world looks like?
• The Juicer does [badly] what Journalists will do
1
Grab
BBC News
& Sport
Articles
2
Extract
Concepts
3
Match to
DBpedia
4
Annotate
Article
5
Push to
Triplestore
6
Expose
via
API
The News Juicer
17. Demo
• Juicer : http://staging.juicer.bbcnewslabs.co.uk/
• Person :
http://staging.juicer.bbcnewslabs.co.uk/demo/person?
q=Andy_Murray
• Place :
http://staging.juicer.bbcnewslabs.co.uk/demo/place?
q=Cheshire
• News Near Me :
http://newsnearme2.herokuapp.com/
18. Next
• “Juice” more of BBC Archive
• Build prototypes
• See what works
• Storyline : News Org Partnerships
UK's most popular news website - 6 million unique browsers every day (3rd biggest site in the UK after Google and Facebook) publish around 500 articles every day - local, national global publish in 27 languages as World Service (+ 2 UK languages alongside English) hundreds of journalists, many working cross-media (TV/radio/online)
articles created in a home-grown Content Management System flat page publishing via FTP - good for high load events but limits our UX and data potential migrating to a dynamic publishing platform typical three-tier architecture: presentation – service – data data layer is a content store (MarkLogic) + a triple store (Bigowlim) that holds annotations made by journalists about content in the content store
need to minimize impact on journalists integration with existing tools and workflow as much as possible tagging rather than semantic annotation suggest concepts rather than free-hand annotation Sheffield University’s GATE framework for Natural Language Processing, identify the ‘things’ in an article use the concepts in the triple store as a data dictionary jiurnalists should mostly just have to accept or reject tags
pilot - can we automate the production of the 58 local news region sub-index pages? (old transmitter locations) currently entirely manual task to maintain these pages GET articles about or mentioning places that fall within the BBC News region
generally worked well – journalists tagging did not cause too much disruption, and we were able to generate aggregations of topic by concept BUT we saw some problems duplication where multiple articles were written about large events journalists wanted the ability to set the running order (defaults to chronologically most recent) quality of concept extraction was poor (may improve over time?) journalists gaming the system – adding tags to get on specific indexes, republishing to effect pinning
- a simple ontology for people, organisations, places and intangibles (themes) and their intersection with events - based on rNews, the Event ontology and PA ’ s SNaP Stuff ontology - annotate articles with events, where the event:place is Birmingham etc.
- IPTC rNews terms in RDFa - basic publishing metadata in the <head> for rich snippets - linked open data in the body
- immediate results - rich snippets for articles - apparently better ranking by topic (anecdotal)
- we introduced the change in the first week of May - by the end of may we were seeing some positive press coverage, people were noticing