Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

TNA taxonomies 20160525

134 Aufrufe

Veröffentlicht am

Presentation of Taxonomy applications

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

TNA taxonomies 20160525

  1. 1. Jeremie Charlet 25th May2016 Presentation of Taxonomy Applications and their development to the BBC
  2. 2. Introduction 3 – Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries – Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch “air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …
  3. 3. Plan Introduction 1. Solution 2. How we implemented it 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 4
  4. 4. 5 Categories displayed on Discovery our archives portal Administration User Interface for taxonomists Command Line Interface to categorise everything once Batch Job to categorise documents every day 1/ Solution
  5. 5. 1. Solution / Discovery
  6. 6. 7 1. Solution / admin GUI
  7. 7. 8 1. Solution / admin GUI
  8. 8. 9 Application to categorise documents every day 1.to categorise new documents 2.to re-categorise documents when they are updated 1. Solution / daily updates
  9. 9. 10 1. Solution / daily updates
  10. 10. 11 Application to categorise everything once 1.To do it for the first time 2.to apply latest modifications from taxonomists on all documents 1. Solution / categorise all docs
  11. 11. 12 under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs
  12. 12. 13 Categorisation and updates on Solr are decoupled1. Solution / categorise all docs
  13. 13. 14 Architecture diagram for daily updates (Java side)1. Solution
  14. 14. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 15
  15. 15. 16 To get it right To get it fast • Algorithm • Fine tuning • Distributed system with Akka 2. Implementation
  16. 16. Many parameters to take into account • Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use? = Iterative process How to evaluate if our results are valid? > Use documents and categories from former system > Categorise them again and compare results To do that quickly, created Command Line Interface 17 [jcharlet@server ~]$ ./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true 2. Implementation / get it right It depends It depends Yes No, use stop words * ? Title, description, context description, categories, people, places, corporate bodies
  17. 17. We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc) • We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches • Distributed system with Akka (13 processes running on 2 servers) 2 * 24 Core CPU 40 Go RAM 18 2. Implementation / get it fast
  18. 18. Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries Use filter instead of query to search on only 1 document + use carefully low level api Profile your application frequently > Identify ugly code, where to add cache, where to add concurrency Spent 7% on creating Query objects for every document: instead, create them once and store them in memory 19 2. Implementation / get it fast
  19. 19. How to transmit documents to categorise efficiently? By sending messages to workers See the problem? Categorisation Supervisor Categorisation Worker Categorisation Worker Categorisation Worker C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879C456321;C65465; C654879;C56879 2. Implementation / get it fast
  20. 20. Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/ 2. Implementation / get it fast
  21. 21. Applied to taxonomy Applications https://github.com/nationalarchives/taxonomy There are 2 types batch applications (each runs in its own application server) • 1 instance of Taxonomy-cat-all-supervisor • N instances of Taxonomy-cat-all-worker Categorisation supervisor browses the whole index and retrieve 1000 documents at a time Categorisation worker receives categorisation requests that contains a list of documents to categorise 2. Implementation / get it right
  22. 22. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it – Get it right – Get it fast • Fine tuning • Distributed system with Akka 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 23
  23. 23. Research on a training set based solution for 2 months 1.Take a data set of known (already classified) documents 2.Split it into a test set and training set – Train the system with the training set – Evaluate it using the test set – Iterate until satisfactory 3.Move it to production – Classify new documents using the trained system 24 3. Attempt on Machine Learning
  24. 24. Why it did not work 1.Using category queries to create the training set – Highly dependent on the validity/accuracy of the category queries 2.Nature of our categories – far too many (136) – categories too vague / broad or too similar (“Poverty”, “Military”): do not suit such a system 3.Not the right tool? We used Lucene (search engine) built in tool 4.Nature of the data? Quality of the meta data? 25 3. Attempt on Machine Learning
  25. 25. Plan Introduction 1. Solution – Discovery portal – Administration UI – Tool to categorise everything once – Batch Job to categorise every day 2. How we implemented it – Get it right – Get it fast • Fine tuning • Distributed system with Akka 3. Attempt on Machine Learning Conclusion: learnings and next steps http://discovery.nationalarchives.gov.uk/ 26
  26. 26. Conclusion: learnings and next steps 27 Gains and losses No * within words categorisation 10 times faster use of free solutions (*) admin interface more fluid and useable
  27. 27. Conclusion: learnings and next steps 28 Possible improvements - Update documents for 1 category on demand - Create more generic solution - Add missing GUI (reporting, categorise all) - Build solution upon Solr, not Lucene - Use Cloud Services instead of onsite servers Next steps - Categorise other archives - Work on new digital-born records  New categories ?  New research on machine learning ? Solr Lucene
  28. 28. Thank you for listening Any questions ?

×