9. IR Quick Intro
• Doc 1: “I did enact Julius Caesar: I was killed i’
the Capitol; Brutus killed me.”
• Doc 2: “So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious:”
26. Analysis
• From long continuous text to individual
tokens/words used for indexing
27. Analysis
• Text -> Tokenizer -> (TokenFilter)* -> Tokens
28. Tokenizer
• Splits main text into words, by whitespace,
punctuation, other rules
• Text: “So, it has come to this!”
• Tokens: [ “So”, “it”, “has”, “come”, “to”, “this” ]
29. Token Filters
• Change existing tokens or add new ones
• Case-Folding
• Synonyms
• Stemming
30. Token Filters
• Text: “The Pandorica was constructed to
ensure the safety of the Alliance.”
• Tokens: [“The”, “Pandorica”, “was”,
“constructed”, “to”, “ensure”, “the”, “safety”,
“of”, “the”, “Alliance” ]
• Filtered: [ “pandorica”, “was”, “construct”,
“to”, “ensure”, “safe”, “of”, “alliance” ]