SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Latent semantic analysis (LSA) is a technique in natural
language processing, in particular in vectorial semantics,
of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts
related to the documents and terms.
Wikipedia
Latent semantic analysis
Getting started
Natural language processing (NLP) is a field of computer
science, artificial intelligence, and linguistics concerned with the
interactions between computers and human (natural) languages.
Wikipedia
Natural language processing could be divided in 4 phases:
Grammar analysis
Lexical analysis
Semantic analysis
Syntactic analysis
Apache OpenNLP
Machine learning based toolkit
for the processing of natural
language text.
http://opennlp.apache.org/
LSA
LSA could be seen as a part of NLP
Apache OpenNLP usage examples:
Lexical analysis
Grammar analysis
Syntactic analysis
Part-of-speech tagging
Tokenization
Chunker - Parser
NOTE:
Before the lexical analysis is possible to
use a sentences analysis tool: sentence
detector (Apache OpenNLP).
Supervised machine learning concepts
INPUT DATA
(ex: wikipedia corpus)
Humans produce a finite set of
couples (INPUT,OUTPUT).
It represents the training set.
It can be seen as discrete
function.
Machine learning algorithm
(ex:linear regretion, maximum
entropy, perceptron)
MODEL
OUTPUT DATA
(ex:corpus POSTagged)
Machine produces a model.
It can be seen as a continuous function.
INPUT DATA
(ex: just a document)
OUTPUT DATA
(that document
POSTagged)
Input data are taken
from an infinte set.
Machine, using model
and input, produces
the expected output.
LSA assumes that words that are close in
meaning will occur in similar pieces of text.
LSA is a method for discovering hidden
concepts in document data.
LSA key concepts
Doc 2
Doc 3
Doc 4
Doc 1
Set of documents, each
document contains
several words.
LSA algorithm takes docs and words and
evaluates vectors in a semantic vectorial
space using:
‱ A documents/words matrix
‱ Singular value decomposition (SVD)
word1word2
doc1
doc2
doc3
doc4
Semantic vectorial space.
Word1 and word2 are close,
it means that their (latent)
meaning is related.
Example:
Doc 2
Doc 3
Doc 4
Doc 1
Doc1 Doc2 Doc3 Doc4
Word1 1 0 1 0
Word2 1 0 1 1
Word3 0 1 0 1


Words/document matrix
1: there are occurrences of
the i-word in the j-doc.
0: there are not occurrences
of the i-word in the j-doc.
The matrix dimension is very
big (thousands of
words, hundreds of
documents).
Matrix SVD decomposition
To reduce the matrix dimension
Semantic Vector or JLSI
libraries:
‱ SVD decomposition.
‱ Build the vectorial
semantic space.
word1word2
doc1
doc2
doc4
UIMA to manage the solution
Online references:
http://opennlp.apache.org/documentation/manual/opennlp.html
https://code.google.com/p/semanticvectors/
http://hlt.fbk.eu/en/technology/jlsi
http://uima.apache.org/
http://en.wikipedia.org/wiki/Singular_value_decomposition
http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
Coursera video references:
http://www.coursera.org/course/nlangp
http://www.coursera.org/course/ml
Some snipptes and console commands
OpenNLP has a command line tool which is used to train the models.
Trained Model
Models and document
to manage
This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered.
Sentences
tokens
tags
Document that is
sentence detected,
tokenized and
POSTaggered, and that
could be, for example,
indexed in a search
engine like Apache Solr.
Note that the lucene-core is
a hierarchical dependency.
.bat file to load the classpath
SemanticVectors has two main functions:
1. Building wordSpace models.
To build the wordSpace model Semantic Vector
needs indexes created by Apache Lucene.
2. Searching through the vectors in such models.
Es: Bible chapter Indexed by Lucene
1. Building wordSpace models using pitt.search.semanticvectors.LSA class from
the index created by Apache Lucene (from a bible chapter).
In this example the Bible
chapter contains 29
documents, and in total
there are 2460 terms.
Semantic Vector builds:
1. 29 vectors that represent the documents (docvector.bin)
2. 2460 vectors that represent the terms (termvector.bin)
This two files represent the wordSpace.
Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection
instead of LSA to reduce the dimensional representation.
2. Searching through docVector and termVector
2.1 Searching for Documents using Terms
Search for document vectors closest to the vector ”Abraham”:
2.2 Using a document file as a source of queries
Find terms most closely related to Chapter 1 of Chronicles:
2.3 Search a general word
Find terms most closely related to “Abraham”.
2.4 Comparing words
Compare “abraham” with “Isaac”.
Compare “abraham” with “massimo”.

Weitere Àhnliche Inhalte

Was ist angesagt?

Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projectsPranali Gedam-Khobragade
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and HadoopRahul Agarwal
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.Institute of Technology Telkom
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualizationTapan Gautam
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Was ist angesagt? (20)

Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projects
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Data cleaning and visualization
Data cleaning and visualizationData cleaning and visualization
Data cleaning and visualization
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

Andere mochten auch

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Singular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionSingular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionAishwarya K. M.
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisMercy Livingstone
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsiRyan Wang
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learningSanjib Basak
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 

Andere mochten auch (9)

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Singular Value Decomposition Image Compression
Singular Value Decomposition Image CompressionSingular Value Decomposition Image Compression
Singular Value Decomposition Image Compression
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
 
Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 

Ähnlich wie NLP and LSA getting started

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search EngineShikha Gupta
 
Semantic web
Semantic webSemantic web
Semantic webtariq1352
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAndrea Porter
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionTakeshi Morita
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIsRecean Denis
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebEditor IJCATR
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModelingSardhendu Mishra
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Robert Monné
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than DataAmit Sheth
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic WebRob Paok
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

Ähnlich wie NLP and LSA getting started (20)

Lucene
LuceneLucene
Lucene
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Indexing in Search Engine
Indexing in Search EngineIndexing in Search Engine
Indexing in Search Engine
 
Semantic web
Semantic webSemantic web
Semantic web
 
An Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic WebAn Annotation Framework For The Semantic Web
An Annotation Framework For The Semantic Web
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision Reflection
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIs
 
Semantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic WebSemantic Annotation: The Mainstay of Semantic Web
Semantic Annotation: The Mainstay of Semantic Web
 
SNSW CO3.pptx
SNSW CO3.pptxSNSW CO3.pptx
SNSW CO3.pptx
 
Spotlight
SpotlightSpotlight
Spotlight
 
NLP todo
NLP todoNLP todo
NLP todo
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)Topic Mining on disaster data (Robert Monné)
Topic Mining on disaster data (Robert Monné)
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
G04124041046
G04124041046G04124041046
G04124041046
 

KĂŒrzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

KĂŒrzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

NLP and LSA getting started

  • 1. Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Wikipedia Latent semantic analysis Getting started
  • 2. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Wikipedia Natural language processing could be divided in 4 phases: Grammar analysis Lexical analysis Semantic analysis Syntactic analysis Apache OpenNLP Machine learning based toolkit for the processing of natural language text. http://opennlp.apache.org/ LSA LSA could be seen as a part of NLP
  • 3. Apache OpenNLP usage examples: Lexical analysis Grammar analysis Syntactic analysis Part-of-speech tagging Tokenization Chunker - Parser NOTE: Before the lexical analysis is possible to use a sentences analysis tool: sentence detector (Apache OpenNLP).
  • 4. Supervised machine learning concepts INPUT DATA (ex: wikipedia corpus) Humans produce a finite set of couples (INPUT,OUTPUT). It represents the training set. It can be seen as discrete function. Machine learning algorithm (ex:linear regretion, maximum entropy, perceptron) MODEL OUTPUT DATA (ex:corpus POSTagged) Machine produces a model. It can be seen as a continuous function. INPUT DATA (ex: just a document) OUTPUT DATA (that document POSTagged) Input data are taken from an infinte set. Machine, using model and input, produces the expected output.
  • 5. LSA assumes that words that are close in meaning will occur in similar pieces of text. LSA is a method for discovering hidden concepts in document data. LSA key concepts Doc 2 Doc 3 Doc 4 Doc 1 Set of documents, each document contains several words. LSA algorithm takes docs and words and evaluates vectors in a semantic vectorial space using: ‱ A documents/words matrix ‱ Singular value decomposition (SVD) word1word2 doc1 doc2 doc3 doc4 Semantic vectorial space. Word1 and word2 are close, it means that their (latent) meaning is related.
  • 6. Example: Doc 2 Doc 3 Doc 4 Doc 1 Doc1 Doc2 Doc3 Doc4 Word1 1 0 1 0 Word2 1 0 1 1 Word3 0 1 0 1 
 Words/document matrix 1: there are occurrences of the i-word in the j-doc. 0: there are not occurrences of the i-word in the j-doc. The matrix dimension is very big (thousands of words, hundreds of documents). Matrix SVD decomposition To reduce the matrix dimension Semantic Vector or JLSI libraries: ‱ SVD decomposition. ‱ Build the vectorial semantic space. word1word2 doc1 doc2 doc4 UIMA to manage the solution
  • 8. Some snipptes and console commands OpenNLP has a command line tool which is used to train the models. Trained Model
  • 9. Models and document to manage This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered. Sentences tokens tags Document that is sentence detected, tokenized and POSTaggered, and that could be, for example, indexed in a search engine like Apache Solr.
  • 10. Note that the lucene-core is a hierarchical dependency. .bat file to load the classpath SemanticVectors has two main functions: 1. Building wordSpace models. To build the wordSpace model Semantic Vector needs indexes created by Apache Lucene. 2. Searching through the vectors in such models. Es: Bible chapter Indexed by Lucene
  • 11. 1. Building wordSpace models using pitt.search.semanticvectors.LSA class from the index created by Apache Lucene (from a bible chapter). In this example the Bible chapter contains 29 documents, and in total there are 2460 terms. Semantic Vector builds: 1. 29 vectors that represent the documents (docvector.bin) 2. 2460 vectors that represent the terms (termvector.bin) This two files represent the wordSpace. Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection instead of LSA to reduce the dimensional representation.
  • 12. 2. Searching through docVector and termVector 2.1 Searching for Documents using Terms Search for document vectors closest to the vector ”Abraham”:
  • 13. 2.2 Using a document file as a source of queries Find terms most closely related to Chapter 1 of Chronicles:
  • 14. 2.3 Search a general word Find terms most closely related to “Abraham”.
  • 15. 2.4 Comparing words Compare “abraham” with “Isaac”. Compare “abraham” with “massimo”.