SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Semantic Annotation
of Documents
Team 20
Sharvil Katariya
Rohit SVK
Nikhil Chavanke
Mentor - Priya RadhaKrishna Course Instructor - Vasudev Varma
Problem Statement
➢ Semantic Annotation of documents - To annotate a document with a
Wikipedia article that matches its contents most closely.
➢ Assigning wikipedia topics to any document using
○ Random forests
○ Gradient Boosting Classifier
○ Support Vector Machines
○ Logistic Regression
○ or any popular classification algorithm.
Methodology
➢ Data Pre - Processing.
➢ Training word and paragraph vectors.
➢ Training new vectors.
➢ Training the classifier using supervised learning.
➢ Testing the classifier
Pre - Processing
➢ This phase involved mapping the field of the Research Paper to a
higher or more broader topic.
➢ The preprocessing phase also involved the
○ Removal of Stopwords.
○ Stripping of excess whitespaces.
○ Removing Punctuations.
○ Removing Tags from text, etc.
➢ Splitting the data into train, test texts.
Training word and paragraph vectors
➢ Word2vec representation is used to train words in the corpus
➢ Doc2vec representation is used to train paragraphs in the corpus
➢ Dimension, which can be tweaked, is set to 400
➢ Epochs can be set to 50 for deep learning. An epoch is just a measure
of the number of times all of the training vectors are used once to
update the weights.
➢ The abstract in the corpus is represented in the vector space model.
Training and testing the classifier
➢ Classifier takes list of arrays, that is computed model vectors,
corresponding to its labels.
➢ The randomly split training and testing data, enables the classifier to
use these model vectors for the subsequent training of the classifier.
➢ The model vectors of the testing data are sent to the classifier and
compared with the labels associated with the testing data, with the
help of various evaluation parameters.
Corpus Used
➢ The Dataset used is that of ACM research papers.
➢ Each Datapoint in the Dataset contains
○ Title
○ Abstract
○ Authors
○ Location
○ Timestamp
○ Conference
○ Index Number
➢ Number of Research Paper with abstract: 247543
➢ Test-Size: 10% of the entire dataset, which is split randomly.
Feature Selection
➢ The dimensions of the vector representation of the paragraph is taken
as the features of the data and trained.
➢ It is up to our novelty to set the no of features.
➢ More the number of features, the more is the result obtained on the
trained data.
➢ However, one must ensure to avoid problems of Data Overfitting.
Architecture Models
➢ As stated in the paper, there are 2 architectures, continuous bag of
words based (CBOW ) and the other skip-gram (PV-DBOW) based
➢ Word2Vec and Doc2Vec are trained using individual and both
architectures and the results are visualised.
➢ Doc2vec is similar to Word2Vec, except now we represent not only
words, but entire sentences and documents.
➢ Doc2Vec enables us to represent an entire sentence using a fixed-
length vector and proceeding to run all our standard classification
algorithms.
Architecture for CBOW and Skip-gram method
CBOW forces the neural network to predict current word with the help of surrounding words, and
Skip-Gram forces the neural net to predict surrounding words of the current word.
Training is essentially a classic back-propagation method with a few optimization and
approximation tricks (e.g. hierarchical softmax).
Architecture for Doc2vec
➢
➢
Distributed Memory (DM) model Distributed Bag of Words (DBOW) model
Architecture for Doc2vec
➢ DM (Distributed Memory ) attempts to predict a word given its previous
words and a paragraph vector. Even though the context window moves
across the text, the paragraph vector does not (hence distributed
memory) and allows for some word-order to be captured.
➢ DBOW (Distributed Bag of Words) predicts a random group of words in
a paragraph given only its paragraph vector
Evaluation Parameters
Mean Average Precision (MAP)
➢ Useful for multiple relevance.
➢ Mean average precision (MAP) is the Average of the average precision
value (average of the precision values at the points at which each
relevant document is retrieved ) for a set of queries.
➢ If a relevant document never gets retrieved, we assume the precision
corresponding to that relevant doc to be zero
Evaluation Parameters
Normalized discounted cumulative gain (NDCG)
➢ NDCG measures the performance of a recommendation system based
on the graded relevance of the recommended entities.
➢ Uses graded relevance as a measure of the usefulness, or gain, from
examining a document.
➢ Gain is accumulated starting at the top of the ranking and may be
reduced, or discounted, at lower ranks.
➢ Discount Function used
➢ DCG values are often normalized by comparing the DCG at each rank
with the DCG value for the perfect ranking.
➢ It varies from 0.0 to 1.0, with 1.0 being the ideal ranking of the entities.
Comparison Graphs - Different Architectures (ACM)
Comparison Graphs - Different Architectures(ACM)
Comparison Graphs - Different Models (IMDB
Dataset)
Skip Gram CBOW
Challenges Faced
➢ Multiclass and Multilabel Data, where the set of classes scales with the
number of available training examples.
○ In this type of problem, the standard assumption of having a fixed set of classes is too
simplistic, and straightforward generalizations of methods for binary classification
(such as multi class SVM) may be impractical.
○ We used the One-vs-all classifier, where we fitting one classifier per class. For each
classifier, the class is fitted against all the other classes. In addition to its
computational efficiency (only n_classes classifiers are needed), one advantage of this
approach is its interpretability. Since each class is represented by one and one
classifier only, it is possible to gain knowledge about the class by inspecting its
corresponding classifier.
References
➢ https://radimrehurek.com/gensim/models/doc2vec.html
➢ http://rare-technologies.com/Doc2Vec-tutorial/
➢ https://github.com/piskvorky/gensim
➢ Quoc V. Le, and Tomas Mikolov, “Distributed Representations of
Sentences and Documents ICML”, 2014
➢ Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient
Estimation of Word Representations in Vector Space. In Proceedings of
Workshop at ICLR”, 2013.
➢ http://web.stanford.edu/class/cs276/handouts/EvaluationNew-
handout-6-per.pdf
Resources Link
Project Webpage: http://rohitsakala.github.
io/semanticAnnotationAcmCategories/
Source Code Repository: https://github.
com/rohitsakala/semanticAnnotationAcmCategories
Video: https://youtu.be/706HJteh1xc
Slides: https://www.slideshare.net/secret/ELAqfEHI6F0uDq
Any Questions?
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
report.doc
report.docreport.doc
report.docbutest
 
Deep Neural Networks in Text Classification using Active Learning
Deep Neural Networks in Text Classification using Active LearningDeep Neural Networks in Text Classification using Active Learning
Deep Neural Networks in Text Classification using Active LearningMirsaeid Abolghasemi
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 

Was ist angesagt? (20)

lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Text categorization
Text categorizationText categorization
Text categorization
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
report.doc
report.docreport.doc
report.doc
 
Deep Neural Networks in Text Classification using Active Learning
Deep Neural Networks in Text Classification using Active LearningDeep Neural Networks in Text Classification using Active Learning
Deep Neural Networks in Text Classification using Active Learning
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Clustering
ClusteringClustering
Clustering
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Data clustering
Data clustering Data clustering
Data clustering
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 

Andere mochten auch

Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15Akash Kant
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2VecKouhei Nakaji
 
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...Yujiro Terazawa
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP ApplicationsSamiur Rahman
 
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...BOAZ Bigdata
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남Eunjeong (Lucy) Park
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Presentation on Memorandum of Association
Presentation on Memorandum of AssociationPresentation on Memorandum of Association
Presentation on Memorandum of AssociationNaveen Chopra
 

Andere mochten auch (11)

Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP Applications
 
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...
[분석] 모바일 sns 사용자들의 감성 용어 사전 제작 및 공인대상 감성...
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Presentation on Memorandum of Association
Presentation on Memorandum of AssociationPresentation on Memorandum of Association
Presentation on Memorandum of Association
 

Ähnlich wie IRE Semantic Annotation of Documents

Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment reviewLalit Jain
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Classification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenMonis Javed
 
Text Processing Framework for Hindi
Text Processing Framework for HindiText Processing Framework for Hindi
Text Processing Framework for HindiUtsav Chokshi
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesSimon Lia-Jonassen
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 

Ähnlich wie IRE Semantic Annotation of Documents (20)

Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Classification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or Evergreen
 
Text Processing Framework for Hindi
Text Processing Framework for HindiText Processing Framework for Hindi
Text Processing Framework for Hindi
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 

Kürzlich hochgeladen

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Kürzlich hochgeladen (20)

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

IRE Semantic Annotation of Documents

  • 1. Semantic Annotation of Documents Team 20 Sharvil Katariya Rohit SVK Nikhil Chavanke Mentor - Priya RadhaKrishna Course Instructor - Vasudev Varma
  • 2. Problem Statement ➢ Semantic Annotation of documents - To annotate a document with a Wikipedia article that matches its contents most closely. ➢ Assigning wikipedia topics to any document using ○ Random forests ○ Gradient Boosting Classifier ○ Support Vector Machines ○ Logistic Regression ○ or any popular classification algorithm.
  • 3. Methodology ➢ Data Pre - Processing. ➢ Training word and paragraph vectors. ➢ Training new vectors. ➢ Training the classifier using supervised learning. ➢ Testing the classifier
  • 4. Pre - Processing ➢ This phase involved mapping the field of the Research Paper to a higher or more broader topic. ➢ The preprocessing phase also involved the ○ Removal of Stopwords. ○ Stripping of excess whitespaces. ○ Removing Punctuations. ○ Removing Tags from text, etc. ➢ Splitting the data into train, test texts.
  • 5. Training word and paragraph vectors ➢ Word2vec representation is used to train words in the corpus ➢ Doc2vec representation is used to train paragraphs in the corpus ➢ Dimension, which can be tweaked, is set to 400 ➢ Epochs can be set to 50 for deep learning. An epoch is just a measure of the number of times all of the training vectors are used once to update the weights. ➢ The abstract in the corpus is represented in the vector space model.
  • 6. Training and testing the classifier ➢ Classifier takes list of arrays, that is computed model vectors, corresponding to its labels. ➢ The randomly split training and testing data, enables the classifier to use these model vectors for the subsequent training of the classifier. ➢ The model vectors of the testing data are sent to the classifier and compared with the labels associated with the testing data, with the help of various evaluation parameters.
  • 7. Corpus Used ➢ The Dataset used is that of ACM research papers. ➢ Each Datapoint in the Dataset contains ○ Title ○ Abstract ○ Authors ○ Location ○ Timestamp ○ Conference ○ Index Number ➢ Number of Research Paper with abstract: 247543 ➢ Test-Size: 10% of the entire dataset, which is split randomly.
  • 8. Feature Selection ➢ The dimensions of the vector representation of the paragraph is taken as the features of the data and trained. ➢ It is up to our novelty to set the no of features. ➢ More the number of features, the more is the result obtained on the trained data. ➢ However, one must ensure to avoid problems of Data Overfitting.
  • 9. Architecture Models ➢ As stated in the paper, there are 2 architectures, continuous bag of words based (CBOW ) and the other skip-gram (PV-DBOW) based ➢ Word2Vec and Doc2Vec are trained using individual and both architectures and the results are visualised. ➢ Doc2vec is similar to Word2Vec, except now we represent not only words, but entire sentences and documents. ➢ Doc2Vec enables us to represent an entire sentence using a fixed- length vector and proceeding to run all our standard classification algorithms.
  • 10. Architecture for CBOW and Skip-gram method CBOW forces the neural network to predict current word with the help of surrounding words, and Skip-Gram forces the neural net to predict surrounding words of the current word. Training is essentially a classic back-propagation method with a few optimization and approximation tricks (e.g. hierarchical softmax).
  • 11. Architecture for Doc2vec ➢ ➢ Distributed Memory (DM) model Distributed Bag of Words (DBOW) model
  • 12. Architecture for Doc2vec ➢ DM (Distributed Memory ) attempts to predict a word given its previous words and a paragraph vector. Even though the context window moves across the text, the paragraph vector does not (hence distributed memory) and allows for some word-order to be captured. ➢ DBOW (Distributed Bag of Words) predicts a random group of words in a paragraph given only its paragraph vector
  • 13. Evaluation Parameters Mean Average Precision (MAP) ➢ Useful for multiple relevance. ➢ Mean average precision (MAP) is the Average of the average precision value (average of the precision values at the points at which each relevant document is retrieved ) for a set of queries. ➢ If a relevant document never gets retrieved, we assume the precision corresponding to that relevant doc to be zero
  • 14. Evaluation Parameters Normalized discounted cumulative gain (NDCG) ➢ NDCG measures the performance of a recommendation system based on the graded relevance of the recommended entities. ➢ Uses graded relevance as a measure of the usefulness, or gain, from examining a document. ➢ Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks. ➢ Discount Function used ➢ DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking. ➢ It varies from 0.0 to 1.0, with 1.0 being the ideal ranking of the entities.
  • 15. Comparison Graphs - Different Architectures (ACM)
  • 16. Comparison Graphs - Different Architectures(ACM)
  • 17. Comparison Graphs - Different Models (IMDB Dataset) Skip Gram CBOW
  • 18. Challenges Faced ➢ Multiclass and Multilabel Data, where the set of classes scales with the number of available training examples. ○ In this type of problem, the standard assumption of having a fixed set of classes is too simplistic, and straightforward generalizations of methods for binary classification (such as multi class SVM) may be impractical. ○ We used the One-vs-all classifier, where we fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier.
  • 19. References ➢ https://radimrehurek.com/gensim/models/doc2vec.html ➢ http://rare-technologies.com/Doc2Vec-tutorial/ ➢ https://github.com/piskvorky/gensim ➢ Quoc V. Le, and Tomas Mikolov, “Distributed Representations of Sentences and Documents ICML”, 2014 ➢ Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR”, 2013. ➢ http://web.stanford.edu/class/cs276/handouts/EvaluationNew- handout-6-per.pdf
  • 20. Resources Link Project Webpage: http://rohitsakala.github. io/semanticAnnotationAcmCategories/ Source Code Repository: https://github. com/rohitsakala/semanticAnnotationAcmCategories Video: https://youtu.be/706HJteh1xc Slides: https://www.slideshare.net/secret/ELAqfEHI6F0uDq