1. AUTOMATED HELPDESK
FINAL YEAR PROJECT (7TH SEM)
SUBMITTED BY
NIKHIL PATHANIA
PARTHA PRATIM KURMI
PRANAV SHARMA
RISHABH KUMAR
SOURAV KUMAR PAUL
2. PRESENTATION TIMELINE
Theoretical NLP
Knowledge Base
Design
By – Pranav Sharma
Practical NLP
Application
Forming of Tokens
By – Rishabh Kumar
Clustering
By – Sourav Kr Paul
Tensorflow
By – Nikhil Pathania
Query Model
By – Partha Pratim
Kurmi
3. PROJECT TIMELINE
Problem Formulation
Sep-2016
Literature Survey
Sept-Oct 2016
Design Methodology
Nov-2016
Synchronizing
Modules
Nov-2016
Basic Implementation
Jan- Feb 2017
Working Model
Mar-2017
Accuracy
Improvements
Mar- Apr 2017
4. PROBLEM STATEMENT
Automate the task of customer centers.
AIM - Build a system to answer questions like
"How to recharge my mobile?" - PayTM
"How to pay my bills?" - PayTM
"Why is my refund not credited?" - Book My Show
7. DATA EXTRACTION MODELS
WHY NLP?
• 3 steps process.
• Extends with clustering.
• Fast, accurate.
NLP
• 4 step process.
• No extension with clustering.
• Smaller domain.
PATTERN MATCHING
Example
Knowledge Base - “The CEO of IBM is Samuel Palmisano.”
Query - “Who is the CEO of IBM?”
Format - Q is A
15. NLTK ( NATURAL LANGUAGE TOOLKIT )
• Suite of libraries.
• Python Support.
• Few libraries which we will be using are :-
• Lexical analysis.
• Parts of speech tagger
21. STEMMING:-
Word = Stem + Affixes
Example:- playing = play(stem) + ing(affixes)
TARGET:- Removing affixes from word (called stemming)
E.g. plays, playing, playful all reduced to 'play'
Library in NLTK :- PorterStemmer
22. EXAMPLE :-
From Stop words removal :-
['Recharge', 'mobile', 'visiting', 'link']
After Stemming :-
['Recharge', 'mobile', 'visit', 'link'] // input for clustering is
generated
23. POS TAGGING:-
POS (part of speech) = Category of Tokens in linguistics, such as
verb noun etc.
Target :- Tag the tokens with the POS with a universal format.
24. EXAMPLE :-
From Stemming:-
['Recharge', 'mobile', 'visit', 'link']
After POS Tagging:-
[('Recharge', 'NN')]
[('mobile', 'NN')]
[('visit', 'VBG')]
[('link', 'NN')]
27. DOCUMENT CLUSTERING – WHAT AND
WHY?
• Unsupervised document organization
• Automatic topic organization
• Topic extraction
• Fast Information retrieval and filtering
28. EXAMPLES
• Web document clustering for search users.
• QA document clustering to solve common problems and questions.
30. CLUSTERING
• Algorithm
• Find k (most dissimilar) documents
• Assign them as k centroid
• Until no change
• For each document
• Find the most similar cluster
• Use cosine similarity fn
• Recalculate the centroid of each cluster
• Stop If no document was reassigned
31. K-MEANS USING JACCARD DISTANCE
MEASURE
• Problems in Simple K-Means Procedure.
• Greedy Algorithm
• Doesn't guarantee the best solution.
• JACCARD Distance Measure
• Find k most dissimilar document.
32. OUTPUT OF PREPROCESSING
• Possible text documents are :
• Recharge mobile visit link
• Recharge landline visit link
• Cancel ticket process
• Add money wallet
33. CALCULATING TF-IDF VECTORS
• Term Frequency – Inverse Document Frequency
• (Weight) Ranks the importance
• Terms frequent in Document and rare in Set
• Ex: College name NITS. - name is frequent but not rare.
53. RECOMMENDATION ENGINE
• Recommendation Engine analyzes available data to answer the
questions
• The various steps are:
1. Data collection
2. Preprocessing and Transformations
3. Classifier Ensemble
54. PREPROCESSING AND TRANSFORMATIONS
• The training set is taken consisting of FAQs, past forums etc.
• Given a question, we want to deduce its genre from the texts
• Only the text of the question is extracted.
• Feature selection to evaluate the importance of a word using
TF-IDF
55. PREPROCESSING AND TRANSFORMATIONS
• Training set derived from the key parts of speech in each
sentence
Example How to recharge my mobile
Part of Speech Verb Noun Object
Decision label Task Electronics
57. CLASSIFIER ENSEMBLE
• Ensemble modelling is used for classification using three classifiers
• Naïve Bayesian using FAQ training set
• POS Naïve Bayesian
• Threshold Biasing classifier
58. ENSEMBLE STRUCTURE
• Learning algorithm that uses multiple classifiers
• Classify using a weighted vote for their decisions
• The classifier having better precision is considered
59. RESULTS
• Documents are hand-tagged with the genres
• In the Ensemble approach, we use a bag approach
• The count of genres is taken into account
• The top tallied genre is used to generate result
• Answer is "recharge mobile visit link"
62. CONCLUSION AND OUTCOMES
The outcomes of this project can be formulated (but not limited to) in
the following points :-
1. Complete Designed Architecture.
2. Proper modules and uses defined.
3. Model solution to the problem.
Hence we would like to conclude that the theoretical and survey aspect
of the problem is complete. We have selected the best tech solutions
after surveying for all existing alternatives. Thus, a working model is
soon to be expected from the team.
63. LITERATURE SURVEY
Seria
l No
Paper Title Authors
1 Natural Language Annotations for Question
Answering
Boris Katz, Gary Borchardt and Sue
Felshin
2 Using English for Indexing and Retrieving Katz, Boris
3 Recommendation engine: Matching
individual/group profiles for better shopping
experience
Sanjeev Kulkarni, Ashok M. Sanpal,
Ravindra R. Mudholkar, kiran Kumari
4 Recommendation engine for Reddit Hoang Nguyen, Rachel Richards,
C.C. Chan, Kathy J. Liszka
5 TensorFlow: Large-Scale Machine Learning on
Heterogeneous Distributed Systems
Mart´ın Abadi, Ashish Agarwal, Paul
Barham, Eugene Brevdo
6 Executing a program
on the MIT tagged-token dataflow architecture.
IEEE Trans. Comput., 1990.
Arvind and Rishiyur S. Nikhil
64. LITERATURE SURVEY
Serial
No
Paper Title Author
7 An efficient K-Means Algorithm integrated with Jaccard
Distance Measure for Document Clustering
Mushfeq-Us-Saleheen
Shameem, Raihana Ferdous
8 An Intelligent Similarity Measure for Effective Text
Document Clustering
M.L.AISHWARYA1
Department of Computer
Science , K.SELVI2
9 K Means Clustering with Tf-idf Weights Jonathan Zong
10 Comparison Between K-Mean and Hierarchical
Algorithm
Using Query Redirection
Manpreet kaur , Usvir Kaur
11 Question Answering System on Education Acts Using
NLP Techniques
Dr.M.M. Raghuwanshi
Professor , Department Of
Computer Science and
Technology
65. LITERATURE SURVEY
Serial
No
Paper Title Author
12 Affective – Hierarchical Classification of Text – An
Approach Using NLP Toolkit
Dr.R.Venkatesan Asst.Prof-
III/CSE
13 Building high-level features using large scale
unsupervised
learning. In ICML’2012, 2012.
Quoc Le, Marc’Aurelio
Ranzato, Rajat Monga, and
Andrew
Ng.
14 Preprocessing Techniques for Text Mining - An
Overview
Dr. S. Vijayarani1, Ms. J.
Ilamathi, Ms. Nithya, Assistant
Professor, M. Phil Research
Scholar,
Department of Computer
Science
For example, as amount of online information are increasing rapidly, users as well as
Information retrieval system needed to classify the desired document against a specific query.
- web document clustering for search users.
- QA document clustering to solve common problems and questions.
- hierarchical based algorithm, which includes single link, complete linkage, group average.
- Applications – Online And Offline
- Online applications are usually constrained by efficiency problems when compared to offline applications.
- hierarchical algorithms produce more in-depth information for detailed analyses,
- K-means are more efficient and provide sufficient information for most purposes
- ratio of the number of occurrences of a word in its document to the total number of words in its document.
- fraction of the document that is a particular term.
-ratio of the number of documents in the corpus to the number of documents containing the given term.
-Inverting the document frequency by taking the logarithm assigns a higher weight to rarer terms
Ex : College name NITS. - name is frequent but not rare.
Where- used for conducting research and deploying ML into productions. Wide range of applications in fields like NLP, Recommendation Engine, geographical information extraction and computational drug discovery.
Nodes: instantiation of an operation(multiple input and output)
Tensor: arbitrary dimensional array(output to input flow)
Client programs interact with tensor flow by creating a session.
Initially there are no nodes and edges
Session interface supports extend and run
Extend supports augmenting edges and nodes to the graph
Run which takes set of output names that need to be computed
Variable is a special kind of operations that returns a handle to persistent mutable tensor that survives across the execution of the graph
Single: Nodes of graph are executed in order that respects the dependencies between the nodes.
Multi device: Deciding which device to place for computation for each node of graph
Managing communication of data across device boundaries.
Feasiblility of device:
Greedy heuristics by choosing the one which give best results
Kernel should implement particular operation
Any cross edge from x to y is replaced by send and receive node.
Data Parallel Training
One simple technique for speeding up SGD is to parallelize the computation of gradient for a mini-batch across mini-batch elements