Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Aspects of NLP Practice

886 Aufrufe

Veröffentlicht am

Some notes on the aspects of applying NLP research in industrial environment

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Aspects of NLP Practice

  1. 1. Practical Aspects of NLP Work Vsevolod Dyomkin GrammarlyTAAC2012, Kyiv, Ukraine
  2. 2. Topics* Practical vs Theoretical NLP work* Working with Data for NLP* NLP Tools
  3. 3. A bit about Grammarly (c) xkcd
  4. 4. An example of what we deal with
  5. 5. Research vs Development“Trick for productionizing research:read current 3-5 pubs and note thestupid simple thing they all claim tobeat, implement that. --Jay Kreps https://twitter.com/jaykreps/ status/219977241839411200
  6. 6. NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracyD - development work:implement the algorithm as an API withsufficient performance and scaling characteristics
  7. 7. Research1. Set a goalBusiness goal:* Develop best/good enough/better thanWord/etc spellchecker* Develop a set of grammar rules, that willcatch errors according to MLA Style* Develop a thesaurus, that will producesynonyms relevant to context
  8. 8. Translate it to measurable goal* On a test corpus of 10000 sentences withcommon errors achieve smaller number of FNs(and FPs), that other spellcheckers/Wordspellchecker/etc* On a corpus of examples of sentences witheach kind of error (and similar sentenceswithout this kind of error) find allsentences with errors and do not finderrors in correct sentences* On a test corpus of 1000 sentencessuggest synonyms for all meaningful wordsthat will be considered relevant by humanlinguists in 90% of the cases
  9. 9. Research1. Set a goal2. Devise an algorithm3. Train & improve the algorithm
  10. 10. Research1. Set a goal2. Devise an algorithm3. Train & improve the algorithmhttp://nlp-class.org
  11. 11. 4. Test its performanceML: one corpus, divided intotraining,development,test
  12. 12. 4. Test its performanceML: one corpus, divided intotraining,development,testOften — different corpora:* for training some part of the algorithm* for testing the whole system
  13. 13. Theoretical maximaTheoretical maxima are rarelyachievable. Why?
  14. 14. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* because you need their data
  15. 15. Theoretical maximaTheoretical maxima are rarelyachievable. Why?* because you need their data* domains might differ
  16. 16. Pre/post-processingWhat ultimately matters isnot crude performance, but...
  17. 17. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).
  18. 18. Pre/post-processingWhat ultimately matters isnot crude performance, but...Acceptance to users (muchharder to measure & dependson domain).Real-world is messier, thanany lab set-up.
  19. 19. Examples of pre-processingFor spellcheck:* some people tend to use words, separated by slashes, like: spell/grammar check* handling of abbreviations
  20. 20. Data“Data is the next Intel Inside. --Tim OReilly, What is Web2.0 http://oreilly.com/web2/archive/what-is-web- 20.html?page=3
  21. 21. Categorization of Data* Structured — small* Semi-structured — medium* Unstructured — big
  22. 22. Where to get data?Well-known sources:* Penn Tree Bank* Wordnet* BNC* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  23. 23. More dataAlso well-known sources, butwith a twist:* Wikipedia & Wiktionary, DBPedia* OpenWeb Common Crawl* Public APIs of some services: Twitter, Wordnik
  24. 24. Academic resources* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...
  25. 25. Crowd-sourced data Jonathan Zittrain, The Future of the Internet http://goo.gl/hs4qB
  26. 26. And remember...“Data is ten times morepowerful than algorithms. --Peter Norvig The Unreasonable Effectiveness of Data http://youtu.be/yvDCzhbjYWs
  27. 27. Tools
  28. 28. Levels of NLP toolsHigh-level: user servicesMiddle-level: NLP algorithmsLow-level: data-crunching
  29. 29. Choosing a languageRequirement types:* Research* NLP-specific* Production
  30. 30. Research requirements* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools
  31. 31. Specific NLP requirements* Good support for statistics & number-crunching – Statistical AI* Good support for working with trees & symbols – Symbolic AI
  32. 32. Production requirements* Scalability* Maintainability* Integrability* ...
  33. 33. Choose Lisp (c) xkcd
  34. 34. Lisp FTW* Truly interactive environment* Very flexible => DSLs* Native tree support* Fast and solid- No OpenNLP/NLTK
  35. 35. Heterogeneous systems“Java way” vs. “Unix way”Create language-agnosticsystems, that can easilycommunicate!
  36. 36. Take-aways* As they say, in theory research and practice are the same, but in practice...* Data is key. There are 3 types of it. Collect it, build tools to work with it easily and efficiently* Choose a good language for R&D: interactive & malleable, with as few barriers as possible
  37. 37. Thanks!Vsevolod Dyomkin @vseloved

×