NLP from scratch

413 Aufrufe

Veröffentlicht am

NNW

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

NLP from scratch

  1. 1. 11 Bryan Hang Zhang Natural Language Processing Almost From Scratch
  2. 2. 22 • Presents a deep neural network architecture for NLP tasks • Presents results comparable to state-of-art on 4 NLP tasks • Part of Speech tagging • Chunking • Named Entity Recognition • Semantic Role Labeling • Presents word embeddings learned from a large unlabelled corpus and shows an improvement in results by using these features • Presents results of joint training for the above tasks.
  3. 3. 33 • Propose a unified neural network architecture and learning algorithm that can be applied to various NLP tasks • Instead of creating hand-crafted features, we can acquire task-specific features ( internal representation) from great amount of labelled and unlabelled training data. Motivation
  4. 4. 44 • Part of Speech Tagging • Successively assign Part-of-Speech tags to words in a text sequence automatically. • Chunking • Chunking is also called shallow parsing and it's basically the identification of parts of speech and short phrases (like noun phrases) • Named Entity Recognition • classify the elements in the text into predefined categories such as person, location etc. Task Introduction
  5. 5. 55 • SRL is sometimes also called shallow semantic parsing, is a task in consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. Semantic Role Labeling e.g 1. Mark sold the car to Mary. agent represent predicate theme recipient e.g 2.
  6. 6. 66 State-of-the-art systems experiment setup
  7. 7. 77 Benchmark Systems
  8. 8. 88 Networks Architecture
  9. 9. 99 • Traditional Approach: • hand-design features • New Approach: • multi-layer neural networks. The Networks
  10. 10. 1010 • Transforming Words into Feature Vectors • Extracting Higher Level Features from Word Feature Vector • Training • Benchmark Result Bullet
  11. 11. 1111 • Transforming Words into Feature Vectors • Extracting Higher Level Features from Word Feature Vector • Training • Benchmark Result Bullet
  12. 12. 1212 Neural Network
  13. 13. 1313 Window approach network Sentence approach networkWindow approach network Sentence approach network Two Approaches Overview
  14. 14. 1414 • K Discrete Features construct a Matrix as a lookup table Lookup tables K discrete feature Matrix Lookup Tables
  15. 15. 1515 • Window Size: for example, 5 • Raw text features: • — Lower case word • — Capitalised feature Words to Features: Window Approach
  16. 16. 1616 Window Approach My Name is Bryan PADDING PADDING My Name is Bryan PADDING PADDING PADDING PADDING My Name is PADDING My Name is Bryan My Name is Bryan PADDING Name is Bryan PADDING PADDING
  17. 17. 1717 Word to Features Words to features! My Word index! Caps index! Vocabulary size (130,000)! Number of options (5)! 50! 5! 6" Word Lookup Table Caps Lookup Table
  18. 18. 1818 Words to Features PADDING PADDING My Name is Words to features! PADDING PADDING My Name is 275! 7"
  19. 19. 1919 • Transforming Words into Feature Vectors • Extracting Higher Level Features from Word Feature Vector • Training • Benchmark Result Bullet
  20. 20. 2020 Extracting Higher Level Features Word Feature Vectors L-layer Neural Network Word Feature Vectors L Neural Network l Extracting Higher Level Features From Word Feature Vectors L Neural Network l Word Feature Vecto L Neural Network l Any feed forward neural network with L layers cane be seen as a composition of function corresponding to each layer l : parameters
  21. 21. 2121 Window approach t = 3,dwi n = 2 w1 1 w1 2 M w1 3 M w5 K−1 w5 K ndow approach Window approach t = 3,dwi n = 2 w1 1 w1 2 M w1 3 M w5 K−1 w5 K dow approach Words to features! PADDING PADDING My Name is 275! 7" This is a window vector
  22. 22. 2222 Linear Layer (window approach) yer Linear Layer Window approach Parameters to be trained € nhu l l hidden unit a f1 ✓ = hLTW ([w Linear Layer The fixed size vec network layers which perform a ne f where Wl 2 Rnl hu⇥nl 1 hu and bl 2 Rnl hu nl hu is usually called the number of h HardTanh Layer Several linear l function, to extract highly non-linear number of hidden units of the l th layer Linear Layer The fixed size vector f1 ✓ can network layers which perform a ne transforma fl ✓ = Wl fl ✓ where Wl 2 Rnl hu⇥nl 1 hu and bl 2 Rnl hu are the par nl hu is usually called the number of hidden unit HardTanh Layer Several linear layers are of function, to extract highly non-linear features. I 10 hWi1 [w]t+dw Linear Layer The fixed size vector f1 ✓ can be fed to o network layers which perform a ne transformations over th fl ✓ = Wl fl 1 ✓ + bl , where Wl 2 Rnl hu⇥nl 1 hu and bl 2 Rnl hu are the parameters to be nl hu is usually called the number of hidden units of the lth la HardTanh Layer Several linear layers are often stacked, i function, to extract highly non-linear features. If no non-linea 10 To be trained linear layers stacked interleaved with nonlinearity function to extract highly non linear features. with out non linearity, the network would be just a linear model.
  23. 23. 2323 HardTanh Layer yer • Non-linear feature Window approach HardTanh Layer • Non-linear feature Window approach Using hardTanh instead of hyperbolic Tanh is to make the computation cheaper
  24. 24. 2424 • Window Approach works well for most NLP tasks . However, it fails with Semantic Role Labelling. Window Approach Remark Reason: the tag of a word depends on the verb ( predicate) chosen beforehand in the sentence . If the verb falls outside the window then one cannot expect this word to be tagged correctly. Then it requires the consideration of sentence approach.
  25. 25. 2525 Convoluntional Layer : Sentence Approachal Layer Convolutional Layer Sentence approach sentence →1 generalisation of window approach, windows in a sequence can be all taken into consideration
  26. 26. 2626 Neural Network Architecture! Look"up"Table" Words! Linear"Layer" Hard"Tanh" Linear"Layer" Convolu7on" Max"Over"Time" 3"
  27. 27. 2727 Convoluntional NN
  28. 28. 2828 Time Delay Neural Neural NetworkTime Delay Neural Network
  29. 29. 2929 Max Layer: Sentence Approach yer l Max Layer Sentence approach hidden unit t=0 t The max of the hidden units over t = 0 - t
  30. 30. 3030 Tagging SchemeTagging Schemes
  31. 31. 3131 • Transforming Words into Feature Vectors • Extracting Higher Level Features from Word Feature Vector • Training • Benchmark Result Bullet
  32. 32. 3232 Training Maximising the log-likelihood with respect to Theta Training Training is the training set
  33. 33. 3333 Training: Word Level Log-Likelihood Training Word Level Log-Likelihood soft max all over tags cross-entropy, it is not ideal because of the tag of a word in the sentence and its neighbouring tags
  34. 34. 3434 Training: Sentence Level Log-Likelihood Sentence Level Log-Likelihood transition score to jump from tag k to tag iAk,l Sentence score for a tag path € [i ]1 T
  35. 35. 3535 Training Sentence Level Log-Likelihood Training Sentence Level Log-Likelihood Conditional likelihood by normalizing w.r.t all possible paths
  36. 36. 3636 TrainingTraining recursive Forward algorithm Inference: Viterbi algorithm (replace logAdd by max)
  37. 37. 3737 • Transforming Words into Feature Vectors • Extracting Higher Level Features from Word Feature Vector • Training • Benchmark Result Bullet
  38. 38. 3838 • use lower case words in the dictionary • add ‘caps’ feature to words that have at least one non-initial capital letter. • number with in a word are replaced with the string ‘Number’ Pre-processing
  39. 39. 3939 Hyper-parametersHyper-parameters
  40. 40. 4040 Benchmark Result Sentences with similar words should be tagged in the same way. e.g. The cat sat on the mat. The feline sat on the mat.
  41. 41. 4141 Neighbouring Words neighboring words neighboring words word embeddings in the word lookup table of a SRL neural network trained from scratch. 10 nearest neighbours using Euclidean metirc.
  42. 42. 4242 • The Lookup table can also be trained on unlabelled data by optimising it to learn a language model. • This gives words features that map similar words to similar vectors (semantically) Word Embeddings
  43. 43. 4343 Sentence Embedding Document Embedding Word Embedding
  44. 44. 4444 Ranking Language Model Ranking Language Model Ranking Language Model
  45. 45. 4545 Tremendous Unlabelled Data Lots of Unlabeled Data • Two window approach (11) networks (100HU) trained on two corpus • LM1 – Wikipedia: 631 Mwords – order dictionary words by frequency – increase dictionary size: 5000, 10; 000, 30; 000, 50; 000, 100; 000 – 4 weeks of training • LM2 – Wikipedia + Reuter=631+221=852M words – initialized with LM1, dictionary size is 130; 000 – 30,000 additional most frequent Reuters words – 3 additional weeks of training
  46. 46. 4646 Word Embeddings Word Embeddings neighboring words
  47. 47. 4747 Benchmark PerformanceBenchmark Performance
  48. 48. 4848 Multitask Learning
  49. 49. 4949 Xiv Natural Language Processing (almost) from Scratch Lookup Table Linear Lookup Table Linear HardTanh HardTanh Linear Task 1 Linear Task 2 M2 (t1) ⇥ · M2 (t2) ⇥ · LTW 1 ... LTW K M1 ⇥ · n1 hu n1 hu n2 hu,(t1) = #tags n2 hu,(t2) = #tags Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with the window approach architecture presented in Figure 1. Lookup tables as well as the first hidden layer are shared. The last layer is task specific. The principle is the same with more than two tasks. 5.2 Multi-Task Benchmark Results Table 9 reports results obtained by jointly trained models for the POS, CHUNK, NER and SRL tasks using the same setup as Section 4.5. We trained jointly POS, CHUNK and NER using the window approach network. As we mentioned earlier, SRL can be trained only with the sentence approach network, due to long-range dependencies related to the verb Joint Training
  50. 50. 5050 MultiTask Learning
  51. 51. 5151 Temptation
  52. 52. 5252 The Temptation • Suffix Features – Use last two characters as feature • Gazetters – 8,000 locations, person names, organizations and misc entries from CoNLL 2003 • POS – use POS as a feature for CHUNK & NER • CHUNK – use CHUNK as a feature for SRL
  53. 53. 5353
  54. 54. 5454 Ensembles 10 Neural Network → voting ensemble: voting ten network outputs on a per tag basis joined ensemble: parameters of the combining layer were trained on the existing training set while keeping the networks fixed.
  55. 55. 5555 ConclusionConclusion • Achievements – “All purpose" neural network architecture for NLP tagging – Limit task-specic engineering – Rely on very large unlabeled datasets – We do not plan to stop here • Critics – Why forgetting NLP expertise for neural network training skills? • NLP goals are not limited to existing NLP task • Excessive task-specic engineering is not desirable – Why neural networks? • Scale on massive datasets • Discover hidden representations • Most of neural network technology existed in 1997 (Bottou, 1997)
  56. 56. 5656 Thank you!

×