DevoxxFR 2024 Reproducible Builds with Apache Maven
Â
NLP and LSA getting started
1. Latent semantic analysis (LSA) is a technique in natural
language processing, in particular in vectorial semantics,
of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts
related to the documents and terms.
Wikipedia
Latent semantic analysis
Getting started
2. Natural language processing (NLP) is a field of computer
science, artificial intelligence, and linguistics concerned with the
interactions between computers and human (natural) languages.
Wikipedia
Natural language processing could be divided in 4 phases:
Grammar analysis
Lexical analysis
Semantic analysis
Syntactic analysis
Apache OpenNLP
Machine learning based toolkit
for the processing of natural
language text.
http://opennlp.apache.org/
LSA
LSA could be seen as a part of NLP
3. Apache OpenNLP usage examples:
Lexical analysis
Grammar analysis
Syntactic analysis
Part-of-speech tagging
Tokenization
Chunker - Parser
NOTE:
Before the lexical analysis is possible to
use a sentences analysis tool: sentence
detector (Apache OpenNLP).
4. Supervised machine learning concepts
INPUT DATA
(ex: wikipedia corpus)
Humans produce a finite set of
couples (INPUT,OUTPUT).
It represents the training set.
It can be seen as discrete
function.
Machine learning algorithm
(ex:linear regretion, maximum
entropy, perceptron)
MODEL
OUTPUT DATA
(ex:corpus POSTagged)
Machine produces a model.
It can be seen as a continuous function.
INPUT DATA
(ex: just a document)
OUTPUT DATA
(that document
POSTagged)
Input data are taken
from an infinte set.
Machine, using model
and input, produces
the expected output.
5. LSA assumes that words that are close in
meaning will occur in similar pieces of text.
LSA is a method for discovering hidden
concepts in document data.
LSA key concepts
Doc 2
Doc 3
Doc 4
Doc 1
Set of documents, each
document contains
several words.
LSA algorithm takes docs and words and
evaluates vectors in a semantic vectorial
space using:
âą A documents/words matrix
âą Singular value decomposition (SVD)
word1word2
doc1
doc2
doc3
doc4
Semantic vectorial space.
Word1 and word2 are close,
it means that their (latent)
meaning is related.
6. Example:
Doc 2
Doc 3
Doc 4
Doc 1
Doc1 Doc2 Doc3 Doc4
Word1 1 0 1 0
Word2 1 0 1 1
Word3 0 1 0 1
âŠ
Words/document matrix
1: there are occurrences of
the i-word in the j-doc.
0: there are not occurrences
of the i-word in the j-doc.
The matrix dimension is very
big (thousands of
words, hundreds of
documents).
Matrix SVD decomposition
To reduce the matrix dimension
Semantic Vector or JLSI
libraries:
âą SVD decomposition.
âą Build the vectorial
semantic space.
word1word2
doc1
doc2
doc4
UIMA to manage the solution
8. Some snipptes and console commands
OpenNLP has a command line tool which is used to train the models.
Trained Model
9. Models and document
to manage
This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered.
Sentences
tokens
tags
Document that is
sentence detected,
tokenized and
POSTaggered, and that
could be, for example,
indexed in a search
engine like Apache Solr.
10. Note that the lucene-core is
a hierarchical dependency.
.bat file to load the classpath
SemanticVectors has two main functions:
1. Building wordSpace models.
To build the wordSpace model Semantic Vector
needs indexes created by Apache Lucene.
2. Searching through the vectors in such models.
Es: Bible chapter Indexed by Lucene
11. 1. Building wordSpace models using pitt.search.semanticvectors.LSA class from
the index created by Apache Lucene (from a bible chapter).
In this example the Bible
chapter contains 29
documents, and in total
there are 2460 terms.
Semantic Vector builds:
1. 29 vectors that represent the documents (docvector.bin)
2. 2460 vectors that represent the terms (termvector.bin)
This two files represent the wordSpace.
Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection
instead of LSA to reduce the dimensional representation.
12. 2. Searching through docVector and termVector
2.1 Searching for Documents using Terms
Search for document vectors closest to the vector âAbrahamâ:
13. 2.2 Using a document file as a source of queries
Find terms most closely related to Chapter 1 of Chronicles:
14. 2.3 Search a general word
Find terms most closely related to âAbrahamâ.