Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Hacking Human Language (PyCon Sweden 2015)

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 75 Anzeige

Hacking Human Language (PyCon Sweden 2015)

Herunterladen, um offline zu lesen

Video: https://www.youtube.com/watch?v=JXjB8yO-M7k
Abstract: This talk introduces computational social science as a new research discipline, gives a brief introduction to natural language processing and explains how word vector representations are computed and how to use them in Python. Word vector representations like word2vec encode semantic relationships like gender and "is the capital city of". This makes it easy to find similar words and compare them visually. To illustrate this, I am using the gensim and scikit-learn Python libraries to compare my own Google searches from 2011 and 2014.

Video: https://www.youtube.com/watch?v=JXjB8yO-M7k
Abstract: This talk introduces computational social science as a new research discipline, gives a brief introduction to natural language processing and explains how word vector representations are computed and how to use them in Python. Word vector representations like word2vec encode semantic relationships like gender and "is the capital city of". This makes it easy to find similar words and compare them visually. To illustrate this, I am using the gensim and scikit-learn Python libraries to compare my own Google searches from 2011 and 2014.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Hacking Human Language (PyCon Sweden 2015) (20)

Anzeige

Aktuellste (20)

Hacking Human Language (PyCon Sweden 2015)

  1. 1. Hacking! Human! Language! Hendrik Heuer PyCon ! Stockholm! Sweden
  2. 2. Hacking?!
  3. 3. – Hacker Ethics “Access to computers —
 and anything which might ! teach you something about ! the way the world works! — should be unlimited and total. Always yield to ! the Hands-On Imperative!” Levy, Steven (2001). Hackers: Heroes of the Computer Revolution (updated ed.). New York: Penguin Books. ISBN 0141000511. OCLC 47216793.
  4. 4. Agenda • Computational Social Science • Natural Language Processing • Word Vector Representations • Visualising and comparing 
 my Google searches
  5. 5. D. Crandall and N. Snavely, ‘Modeling People and Places with Internet Photo Collections’, Commun. ACM, vol. 55, no. 6, pp. 52–60, Jun. 2012. DOI: 10.1145/2184319.2184336
  6. 6. Computational Social Science
 Digital Humanities • combines computer science & social sciences • makes new research possible, e.g. the analysis of massive social networks and content of millions of books immersion.media.mit.edu
  7. 7. Massive-scale automated ! analysis of news-content • 2.5 million articles from 498 different 
 English-language news outlets 
 (Reuters & New York Times Corpus) • automatically annotated into 15 topic areas • the topics were compared in regards to readability, linguistic subjectivity and 
 gender imbalances I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, and N. Cristianini, ‘Research Methods in the Age of Digital Journalism: Massive-scale automated analysis of news-content: topics, style and gender’, Digital Journalism, vol. 1, no. 1, 2013. DOI:10.1080/21670811.2012.714928
  8. 8. Linguistic Subjectivity! Adjectives (Part-of-Speech Tagging) & SentiWordNet
  9. 9. “Low level of political interest and engagement could be connected to the ! lack of subjectivity (adjectival excess)” Linguistic Subjectivity! Adjectives (Part-of-Speech Tagging) & SentiWordNet
  10. 10. Male-to-Female Ratio! Named Entity Recognition
  11. 11. Male-to-Female Ratio! Named Entity Recognition “Gender bias in sports coverage (...) females only account for between only 7 and 25 per cent of coverage”
  12. 12. scikit-learn gensim Natural Language Toolkit spaCyword2vec Machine Learning Text Processing Topic Modeling Visualization d3.js Google Chart API Highcharts
  13. 13. Introduction to 
 Natural Language Processing
  14. 14. nltk.org/book
  15. 15. Word Tokenization! Splitting a sentence into single words >>> from nltk.tokenize import word_tokenize ! >>> word_tokenize("All your base are belong to us")
 ['All', 'your', 'base', 'are', 'belong', 'to', 'us']
  16. 16. Sentence Tokenization! Splitting a text into sentences >>> from nltk.tokenize import sent_tokenize ! >>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']
  17. 17. Sentence Tokenization! Splitting a text into sentences >>> import nltk >>> import functools ! >>> sent_tokenize = 
 nltk.data.load(“tokenizers/punkt/swedish.pickle”)
  18. 18. Stemming! Finding the word stem or root form >>> import nltk >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> wnl = nltk.WordNetLemmatizer() ! >>> [wnl.lemmatize(w) for w in ['investigation','women']] ['investigation', ‘woman'] ! >>> [porter.stem(w) for w in ['investigation','women']] ['investig', 'women'] ! >>> [lancaster.stem(w) for w in ['investigation','women']] ['investig', 'wom']
  19. 19. Part-of-Speech Tagging! Identifying nouns, verbs, adjectives… >>> import nltk >>> text = "In the middle ages Sweden had the 
 same king as Denmark and Norway." >>> words = nltk.word_tokenize( text ) ! >>> nltk.pos_tag( words ) [('In', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('ages', 'NNS'), ('Sweden', 'NNP'), ('had', 'VBD'), ('the', 'DT'), ('same', 'JJ'), ('king', 'NN'), ('as', 'IN'), ('Denmark', 'NNP'), ('and', 'CC'), ('Norway', 'NNP'), ('.', '.')] NN* Noun VB* Verb JJ* Adjective RB* Adverb DT Determiner IN Preposition
  20. 20. Named Entity Recognition! Identifying people, organizations, locations… >>> import nltk >>> text = "New York City is the largest city in the United States." >>> words = nltk.word_tokenize( text ) ! >>> nltk.ne_chunk( nltk.pos_tag( words ) ) Tree('S', [Tree('GPE', [('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('city', 'NN'), ('in', 'IN'), ('the', 'DT'), Tree('GPE', [('United', 'NNP'), ('States', 'NNPS')]), ('.', '.')]) ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian (geo-political entity)
  21. 21. Sentiment Analysis Tell if a sentence is positive or negative
  22. 22. Stanford Core NLP Tools
  23. 23. Vector Representations
  24. 24. –J. R. Firth 1957 “You shall know a word 
 by the company it keeps” Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford: Blackwell.
  25. 25. –J. R. Firth 1957 “You shall know a word 
 by the company it keeps” Quoted after Socher
  26. 26. Vectors are directions in space
  27. 27. Vectors are directions in space Quoted after Socher word2vec Representing a word with a vector
  28. 28. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online]. Available: http://arxiv.org/abs/1301.3781 Vectors can encode relationships MAN WOMAN AUNT UNCLE QUEEN KING word2vec Representing a word with a vector
  29. 29. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online]. Available: http://arxiv.org/abs/1301.3781 man is to woman as king is to ? KINGS KING QUEEN QUEENS word2vec Representing a word with a vector
  30. 30. T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, CoRR, vol. abs/1301.3781, 2013 [Online]. Available: http://arxiv.org/abs/1301.3781 word2vec Representing a word with a vector
  31. 31. Sweden Most similar words
  32. 32. Sweden Most similar words
  33. 33. Sweden Most similar words
  34. 34. Sweden Most similar words
  35. 35. Harvard Most similar words
  36. 36. Link: https://radimrehurek.com/gensim/models/word2vec.html
  37. 37. Link: https://radimrehurek.com/gensim/models/word2vec.html
  38. 38. Link: https://honnibal.github.io/spaCy/
  39. 39. Link: https://honnibal.github.io/spaCy/
  40. 40. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al
  41. 41. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al 2 words context window
  42. 42. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al 5 words context window 2 words context window
  43. 43. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al
  44. 44. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al
  45. 45. spaCy! Dependency-Based Word representations by Levy and Goldberg Gensim! word2vec
 
 by Mikolov et al
  46. 46. Link: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
  47. 47. Applications
  48. 48. Machine Translation T. Mikolov, Q. V. Le, and I. Sutskever, ‘Exploiting Similarities among Languages for Machine Translation’, CoRR, vol. abs/1309.4168, 2013 [Online]. Available: http://arxiv.org/abs/1309.4168
  49. 49. Compare my Google searches
  50. 50. Link: https://support.google.com/websearch/answer/6068625?hl=en
  51. 51. { "event":[ {"query": {"id":[ {"timestamp_usec":"1317002730153183"} ], "query_text":"google hangout" } }, {"query": {"id":[ {"timestamp_usec":"1316577601549660"} ], "query_text":"eurokrise" } }, {"query": {"id":[ {"timestamp_usec":"1315592145720230"} ], "query_text":"hoverboard" } } parsed_json[‘event’][42]['query']['query_text']
  52. 52. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON
  53. 53. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON
  54. 54. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON Link gensim: https://radimrehurek.com/gensim/! Link word2vec: https://code.google.com/p/word2vec/
  55. 55. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON
  56. 56. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON linguistics
  57. 57. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON Link: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
  58. 58. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON
  59. 59. 1. Find Word Representations word2vec 2. Dimensionality Reduction t-SNE 3. Output JSON Link: https://github.com/mbostock/d3/wiki/Gallery
  60. 60. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  61. 61. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  62. 62. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  63. 63. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  64. 64. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  65. 65. My Google Searches Oct – Dec 2014 Jul – Sep 2011 Both, 2011 & 2014
  66. 66. Hacking! Human! Language! Hendrik Heuer PyCon ! Stockholm! Sweden hendrikheuer@gmail.com! http://hen-drik.de! @hen_drik Thanks to Andrii, Jussi & Roelof Slides: https://tinyurl.com/pycon-word2vec
  67. 67. predict the current word! input! wi-2, wi-1, wi+1, wi+2 ! output ! wi!
  68. 68. predict the current word! input! wi-2, wi-1, wi+1, wi+2 ! output ! wi! predict the surrounding words! input 
 wi ! output ! wi-2, wi-1, wi +1, wi +2.

×