SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Category & Training Texts Selection for
Scientific Article Categorization in
an Expert Search System
By
Gan Keng Hoon*, Chua San Thai,
Khoh Zhuo Yan, Goh Kau Yang
School of Computer Sciences,
Universiti Sains Malaysia
Motivation
Scientific articles are produced as results of research.
Organizing scientific articles into subject areas or topics
help in discovery, navigation etc.
Motivation
Microsoft Academic
Motivation
Google Scholar
Motivation
Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa ,
(2015),"Facet-value extraction scheme from textual contents in XML
data“.
Scope
Application oriented research
Expert Search System
DBLP Dataset
School of Computer Sciences, USM
Goal
Improving the categorization of scientific articles
For
Capturing expert’s expertise based on their publications.
Enable category filtering during search.
Existing Approaches
Labelled Scientific Article
Supervised Learning method to train and test
Feature Selection
Bags of Words, Ngram, POS, Term Frequency, TFIDF
This research
Train with Labelled Scientific Related Domain Texts
Test with Scientific Article
Research Justification
Avoid the use of large number of labelled training texts
Focusing on differentiating good training texts sources.
Use reasonable small number of training texts to build
subject category model.
Process of category model construction on
scientific article domain.
Feature Selection
Feature Term Generation
N-gram technique is used to generate potential term candidates from the training text. E.g.
D = “Search engine is an artificial intelligence system.”
2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] =>
artificial intelligence [5] => intelligence system)
Features Selection by TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword
weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as
features. This method penalizes the term when it occurs in different training texts. The TF-
IDF values are computed as
𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
= 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
× 𝑙𝑙𝑙𝑙 𝑙𝑙
𝑁𝑁𝐷𝐷
𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the
total number of document.
Transfer Training Approach
Intuition
If the training texts are representative enough to cover the concept of a
category, hence this training sets can be obtained from any sources that share
similar concepts or semantics.
Criteria
Sharing same or partially similar categories between two texts source.
The categories must bear the same concept or meaning.
The training source must be comprehensive to cover a category’s concept.
The training source must be available but not the testing source.
This approach is particular useful when the resources of unseen texts are not
readily available.
Training and Testing Category Model
The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category,
𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of
features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶
The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛,
the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶
Feature Similarity Scoring
The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of
features set of category model is viewed as a set of vectors in a vector space. Each term will have its
own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the
deviation angle between the vectors as follows.
𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 =
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
is the feature vector of a new document.
Evaluation Settings
Performance Metric
Scientific article is correctly assigned to a category or otherwise.
Expert judgement to evaluate.
Training Texts
Title and Abstract are used.
Tasks
Common (30 general cat) vs. Common + Specific Categories (30
general cat + 12 domain specific )
Automated Selection of Training Texts vs. Manual
Evaluation Results
Common categories
+ Automated
training texts (%)
Common and specific
categories + Automated
training texts (%)
Common and specific
categories + Manual
training texts (%)
Expert 1 62.50 68.75 81.25
Expert 2 46.67 46.67 53.33
Expert 3 33.33 33.33 66.67
Expert 4 33.33 41.67 41.67
Expert 5 43.75 37.50 28.13
(Average) (43.92) (45.59) (54.21)
Conclusion
Possibility
To train a category model using training texts from one source and apply
them on a different source.
Challenge
Selection of training texts as they could influence the accuracy of trained
model.
Limitation
Selection of categories, whereby the selected set is too little to cover the
domain’s (e.g. Computer Science) research area.
Thank You
For more of our work, please visit ir.cs.usm.my
Email me at khgan@usm.my

Weitere ähnliche Inhalte

Was ist angesagt?

Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weighting
ijistjournal
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing research
smumbahelp
 

Was ist angesagt? (19)

Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)
 
N045038690
N045038690N045038690
N045038690
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weighting
 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methods
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing research
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering technique
 
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and RetrievalA Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Student
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research Design
 

Ähnlich wie Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
slides
slidesslides
slides
butest
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
butest
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Education
grainne
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
butest
 

Ähnlich wie Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System (20)

Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
syllabus-CBR.pdf
syllabus-CBR.pdfsyllabus-CBR.pdf
syllabus-CBR.pdf
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
slides
slidesslides
slides
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
G04124041046
G04124041046G04124041046
G04124041046
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Education
 
02 course design analysis phase
02 course design   analysis phase02 course design   analysis phase
02 course design analysis phase
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis Techniques
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Semi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetSemi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term Set
 

Mehr von Gan Keng Hoon

Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
Gan Keng Hoon
 

Mehr von Gan Keng Hoon (17)

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using R
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdf
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdf
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdf
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business Intelligence
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise Bibliographies
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 

Kürzlich hochgeladen

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Kürzlich hochgeladen (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

  • 1. Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System By Gan Keng Hoon*, Chua San Thai, Khoh Zhuo Yan, Goh Kau Yang School of Computer Sciences, Universiti Sains Malaysia
  • 2. Motivation Scientific articles are produced as results of research. Organizing scientific articles into subject areas or topics help in discovery, navigation etc.
  • 5. Motivation Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa , (2015),"Facet-value extraction scheme from textual contents in XML data“.
  • 6. Scope Application oriented research Expert Search System DBLP Dataset School of Computer Sciences, USM Goal Improving the categorization of scientific articles For Capturing expert’s expertise based on their publications. Enable category filtering during search.
  • 7. Existing Approaches Labelled Scientific Article Supervised Learning method to train and test Feature Selection Bags of Words, Ngram, POS, Term Frequency, TFIDF This research Train with Labelled Scientific Related Domain Texts Test with Scientific Article
  • 8. Research Justification Avoid the use of large number of labelled training texts Focusing on differentiating good training texts sources. Use reasonable small number of training texts to build subject category model.
  • 9. Process of category model construction on scientific article domain.
  • 10. Feature Selection Feature Term Generation N-gram technique is used to generate potential term candidates from the training text. E.g. D = “Search engine is an artificial intelligence system.” 2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] => artificial intelligence [5] => intelligence system) Features Selection by TF-IDF Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as features. This method penalizes the term when it occurs in different training texts. The TF- IDF values are computed as 𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 × 𝑙𝑙𝑙𝑙 𝑙𝑙 𝑁𝑁𝐷𝐷 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the total number of document.
  • 11. Transfer Training Approach Intuition If the training texts are representative enough to cover the concept of a category, hence this training sets can be obtained from any sources that share similar concepts or semantics. Criteria Sharing same or partially similar categories between two texts source. The categories must bear the same concept or meaning. The training source must be comprehensive to cover a category’s concept. The training source must be available but not the testing source. This approach is particular useful when the resources of unseen texts are not readily available.
  • 12. Training and Testing Category Model The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category, 𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛, the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶 Feature Similarity Scoring The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of features set of category model is viewed as a set of vectors in a vector space. Each term will have its own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the deviation angle between the vectors as follows. 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 = 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 is the feature vector of a new document.
  • 13. Evaluation Settings Performance Metric Scientific article is correctly assigned to a category or otherwise. Expert judgement to evaluate. Training Texts Title and Abstract are used. Tasks Common (30 general cat) vs. Common + Specific Categories (30 general cat + 12 domain specific ) Automated Selection of Training Texts vs. Manual
  • 14. Evaluation Results Common categories + Automated training texts (%) Common and specific categories + Automated training texts (%) Common and specific categories + Manual training texts (%) Expert 1 62.50 68.75 81.25 Expert 2 46.67 46.67 53.33 Expert 3 33.33 33.33 66.67 Expert 4 33.33 41.67 41.67 Expert 5 43.75 37.50 28.13 (Average) (43.92) (45.59) (54.21)
  • 15. Conclusion Possibility To train a category model using training texts from one source and apply them on a different source. Challenge Selection of training texts as they could influence the accuracy of trained model. Limitation Selection of categories, whereby the selected set is too little to cover the domain’s (e.g. Computer Science) research area.
  • 16. Thank You For more of our work, please visit ir.cs.usm.my Email me at khgan@usm.my