SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Author- Paper Identification
Problem
Guided By
Prof Duc Tran
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Problem Statement
To determine the correct author from the author’s dataset
for a particular paper.
Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy
author profiles
Challenge is to determine which papers in an author
profile were truly written by a given author
Data Points
The data points include all papers written by an author, his
affiliation (University, Technical Society, Groups).
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)
Machine Learning Task
• Feature Engineering
• Algorithms
• Model Tuning
• Results
• Evaluation
Steps Taken to Solve Problem
 Data preprocessing and cleaning
 Feature engineering
 Choose a model - Random Forest/Gradient Boost
Model
 Building the model
 Evaluating the results
 Extracting feature values
 Building the model using the modified train file
 Evaluating the results
 Tuning the model
Data preprocessing and cleaning
Issues with data
The csv files needed cleaning
Few had attributes spilled over 3 rows
Some rows had more attributes than the required
number of attributes
Special characters caused issue
Wrote a Perl script to Clean data and format it
Issues with data-I
Issues with data-II
Feature Engineering Steps
• Aggregation: combining multiple features into one.
How did we use: Elaborated train file with AuthorID, PaperID and
Confirmation combined with name and affiliation from Author file.
• Discretization: Converting continuous features or variables to
discretized or nominal features
How did we use: The year the paper is published. The max and min
years the author was actively publishing papers.
• Construction: Creating new features out of original ones
How did we use: Keywords in Paper
Author features
• distance between the author names in paper-author and author files
• matched substring ratio between the author names in paper-author and author files
• keywords used by a particular author(less weight)
• count keywords for author
• count the no of co-authors for a given co-author
• weighted TF-IDF measure of all author keywords inside author's papers
• count different papers of author
• years during which the author wrote many papers
• number of times an author is repeated (sum for distinct ids)
• list of distinct ids assigned to the same author
Paper features
• year of paper
• count authors of paper
• count duplicated papers for paper
• count duplicated authors for paper
• count keywords in paper
• how many time the exact same set of authors is repeated in different papers
(without
duplicates)
Paper-Author features
• correct affiliation from the table PaperAuthor: binary feature
• which year the author publishes (for the first year papers of author this feature
equals 1 for the second year papers - 2 and so on)
• count sources: number of times pair author-paper is appeared in the table
PaperAuthor
• are names in PaperAuthor table and Author table the same
Machine Learning Models Used
Models used earlier
• K-means clustering
• ZeroR
• Tree J-48
• Naïve Bayes
Machine Learning Models Used
• RandomForest
• Gradient Boost
Feature values extraction
Count of paper for each author
Maximum active year for given author
Maximum active year for given author
Jaccard distance between author name in author file and
paper author file
Jaccard distance between affiliation in author file and paper
author file
The year paper was published
Build Random forest using weka
With the elaborated train file
Build Random forest using Mahout
Build Gradient Boost using H20
Evaluation Metrics
• Accuracy
• Error percentage
Lessons Learned!
When you know what to find out exactly in the provided data ,use
supervised learning model as classification rather than choosing
unsupervised learning model such as clustering.
When you want to find patterns or structures in the provided data use
unsupervised dlearning models such as clustering.
Try building model using the provided train file but it might not give you
better results always. You can try to modify it using the existing data but
making sure you do no change it.
Choosing the features is the most important thing and we can extract the
feature values from the given data and use it to build the model.
Choosing features from different data points will give better results than just
choosing them from only one.
Thank you!!

Weitere ähnliche Inhalte

Andere mochten auch

منسق جديد عرض اللغة العربية
منسق جديد عرض اللغة العربيةمنسق جديد عرض اللغة العربية
منسق جديد عرض اللغة العربيةnorah medrek
 
UX vs. UI? (SPA)
UX vs. UI? (SPA)UX vs. UI? (SPA)
UX vs. UI? (SPA)Koombea
 
How To Deliver a Project With a 150% Advance
How To Deliver a Project With a 150% AdvanceHow To Deliver a Project With a 150% Advance
How To Deliver a Project With a 150% AdvanceKoombea
 
Chateau Margaux case study
Chateau Margaux case studyChateau Margaux case study
Chateau Margaux case studyMila Ioffe
 
Infix prefix postfix expression -conversion
Infix  prefix postfix expression -conversionInfix  prefix postfix expression -conversion
Infix prefix postfix expression -conversionSyed Mustafa
 
Infix to Prefix (Conversion, Evaluation, Code)
Infix to Prefix (Conversion, Evaluation, Code)Infix to Prefix (Conversion, Evaluation, Code)
Infix to Prefix (Conversion, Evaluation, Code)Ahmed Khateeb
 
Research identification of the problem
Research  identification of the problemResearch  identification of the problem
Research identification of the problemGunjan Verma
 
Infix to-postfix examples
Infix to-postfix examplesInfix to-postfix examples
Infix to-postfix examplesmua99
 
Swift for back end: A new generation of full stack languages?
Swift for back end: A new generation of full stack languages?Swift for back end: A new generation of full stack languages?
Swift for back end: A new generation of full stack languages?Koombea
 
UX UI - Principles and Best Practices 2014-2015
UX UI - Principles and Best Practices 2014-2015UX UI - Principles and Best Practices 2014-2015
UX UI - Principles and Best Practices 2014-2015Harsh Wardhan Dave
 
UX & UI Design : les tendances pour 2017
UX & UI Design : les tendances pour 2017UX & UI Design : les tendances pour 2017
UX & UI Design : les tendances pour 2017NiceToMeetYou
 
UX and UI - Designing for Mobile
UX and UI - Designing for MobileUX and UI - Designing for Mobile
UX and UI - Designing for MobileBuiltByHQ
 
Making UX Matter to Your Company
Making UX Matter to Your CompanyMaking UX Matter to Your Company
Making UX Matter to Your CompanyWendy Johansson
 
Practical UX Methods - as presented at FOWD 2014
Practical UX Methods - as presented at FOWD 2014Practical UX Methods - as presented at FOWD 2014
Practical UX Methods - as presented at FOWD 2014Patrick McNeil
 
Software requirements specification
Software  requirements specificationSoftware  requirements specification
Software requirements specificationKrishnasai Gudavalli
 
Design Thinking Process
Design Thinking ProcessDesign Thinking Process
Design Thinking ProcessHeyy Gus
 
Library management system presentation
Library management system presentation Library management system presentation
Library management system presentation Smit Patel
 

Andere mochten auch (20)

منسق جديد عرض اللغة العربية
منسق جديد عرض اللغة العربيةمنسق جديد عرض اللغة العربية
منسق جديد عرض اللغة العربية
 
UX vs. UI? (SPA)
UX vs. UI? (SPA)UX vs. UI? (SPA)
UX vs. UI? (SPA)
 
How To Deliver a Project With a 150% Advance
How To Deliver a Project With a 150% AdvanceHow To Deliver a Project With a 150% Advance
How To Deliver a Project With a 150% Advance
 
Chateau Margaux case study
Chateau Margaux case studyChateau Margaux case study
Chateau Margaux case study
 
Infix prefix postfix expression -conversion
Infix  prefix postfix expression -conversionInfix  prefix postfix expression -conversion
Infix prefix postfix expression -conversion
 
Infix to Prefix (Conversion, Evaluation, Code)
Infix to Prefix (Conversion, Evaluation, Code)Infix to Prefix (Conversion, Evaluation, Code)
Infix to Prefix (Conversion, Evaluation, Code)
 
Research identification of the problem
Research  identification of the problemResearch  identification of the problem
Research identification of the problem
 
Infix to-postfix examples
Infix to-postfix examplesInfix to-postfix examples
Infix to-postfix examples
 
Infix to postfix
Infix to postfixInfix to postfix
Infix to postfix
 
Infix to postfix conversion
Infix to postfix conversionInfix to postfix conversion
Infix to postfix conversion
 
Swift for back end: A new generation of full stack languages?
Swift for back end: A new generation of full stack languages?Swift for back end: A new generation of full stack languages?
Swift for back end: A new generation of full stack languages?
 
UX UI - Principles and Best Practices 2014-2015
UX UI - Principles and Best Practices 2014-2015UX UI - Principles and Best Practices 2014-2015
UX UI - Principles and Best Practices 2014-2015
 
How to Teach UX Design
How to Teach UX DesignHow to Teach UX Design
How to Teach UX Design
 
UX & UI Design : les tendances pour 2017
UX & UI Design : les tendances pour 2017UX & UI Design : les tendances pour 2017
UX & UI Design : les tendances pour 2017
 
UX and UI - Designing for Mobile
UX and UI - Designing for MobileUX and UI - Designing for Mobile
UX and UI - Designing for Mobile
 
Making UX Matter to Your Company
Making UX Matter to Your CompanyMaking UX Matter to Your Company
Making UX Matter to Your Company
 
Practical UX Methods - as presented at FOWD 2014
Practical UX Methods - as presented at FOWD 2014Practical UX Methods - as presented at FOWD 2014
Practical UX Methods - as presented at FOWD 2014
 
Software requirements specification
Software  requirements specificationSoftware  requirements specification
Software requirements specification
 
Design Thinking Process
Design Thinking ProcessDesign Thinking Process
Design Thinking Process
 
Library management system presentation
Library management system presentation Library management system presentation
Library management system presentation
 

Ähnlich wie Author paper identification problem

Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentationPooja Mishra
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfjill734733
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
Author paper midterm
Author paper midtermAuthor paper midterm
Author paper midtermPooja Mishra
 
DSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesDSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesAtmire
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Beyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research ArticlesBeyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research ArticlesMaya Hristakeva
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...Muhammad Imran
 
Registering content to enable connections - Rachael Lammey
Registering content to enable connections - Rachael LammeyRegistering content to enable connections - Rachael Lammey
Registering content to enable connections - Rachael LammeyCrossref
 
Measuring Scientific Productivity
Measuring Scientific ProductivityMeasuring Scientific Productivity
Measuring Scientific ProductivityMuruli N. Tarikere
 
Analysing Large Citation Network
Analysing Large Citation NetworkAnalysing Large Citation Network
Analysing Large Citation NetworkMilad Alshomary
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
20160607 citation4software panel
20160607 citation4software panel20160607 citation4software panel
20160607 citation4software panelDaniel S. Katz
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapersdarthvader42
 
role of computer in research
role of computer in researchrole of computer in research
role of computer in researchkpgandhi
 
Reports and DITA Metrics IXIASOFT User Conference 2016
Reports and DITA Metrics IXIASOFT User Conference 2016Reports and DITA Metrics IXIASOFT User Conference 2016
Reports and DITA Metrics IXIASOFT User Conference 2016IXIASOFT
 

Ähnlich wie Author paper identification problem (20)

Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdf
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
Author paper midterm
Author paper midtermAuthor paper midterm
Author paper midterm
 
DSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesDSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable Entities
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Beyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research ArticlesBeyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research Articles
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Registering content to enable connections - Rachael Lammey
Registering content to enable connections - Rachael LammeyRegistering content to enable connections - Rachael Lammey
Registering content to enable connections - Rachael Lammey
 
Measuring Scientific Productivity
Measuring Scientific ProductivityMeasuring Scientific Productivity
Measuring Scientific Productivity
 
Analysing Large Citation Network
Analysing Large Citation NetworkAnalysing Large Citation Network
Analysing Large Citation Network
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
20160607 citation4software panel
20160607 citation4software panel20160607 citation4software panel
20160607 citation4software panel
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 
role of computer in research
role of computer in researchrole of computer in research
role of computer in research
 
Reports and DITA Metrics IXIASOFT User Conference 2016
Reports and DITA Metrics IXIASOFT User Conference 2016Reports and DITA Metrics IXIASOFT User Conference 2016
Reports and DITA Metrics IXIASOFT User Conference 2016
 

Kürzlich hochgeladen

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 

Kürzlich hochgeladen (20)

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 

Author paper identification problem

  • 1. Author- Paper Identification Problem Guided By Prof Duc Tran Team : Karthik Reddy Vakati Nachammai C Pooja Mishra
  • 2. Problem Statement To determine the correct author from the author’s dataset for a particular paper. Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles Challenge is to determine which papers in an author profile were truly written by a given author
  • 3. Data Points The data points include all papers written by an author, his affiliation (University, Technical Society, Groups).  Paper-Author -( PaperId , AuthorId, Name, Affiliation) The meta data includes journals written by him and conferences attended by an author.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)
  • 4. Machine Learning Task • Feature Engineering • Algorithms • Model Tuning • Results • Evaluation
  • 5. Steps Taken to Solve Problem  Data preprocessing and cleaning  Feature engineering  Choose a model - Random Forest/Gradient Boost Model  Building the model  Evaluating the results  Extracting feature values  Building the model using the modified train file  Evaluating the results  Tuning the model
  • 6. Data preprocessing and cleaning Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the required number of attributes Special characters caused issue Wrote a Perl script to Clean data and format it
  • 9. Feature Engineering Steps • Aggregation: combining multiple features into one. How did we use: Elaborated train file with AuthorID, PaperID and Confirmation combined with name and affiliation from Author file. • Discretization: Converting continuous features or variables to discretized or nominal features How did we use: The year the paper is published. The max and min years the author was actively publishing papers. • Construction: Creating new features out of original ones How did we use: Keywords in Paper
  • 10. Author features • distance between the author names in paper-author and author files • matched substring ratio between the author names in paper-author and author files • keywords used by a particular author(less weight) • count keywords for author • count the no of co-authors for a given co-author • weighted TF-IDF measure of all author keywords inside author's papers • count different papers of author • years during which the author wrote many papers • number of times an author is repeated (sum for distinct ids) • list of distinct ids assigned to the same author
  • 11. Paper features • year of paper • count authors of paper • count duplicated papers for paper • count duplicated authors for paper • count keywords in paper • how many time the exact same set of authors is repeated in different papers (without duplicates)
  • 12. Paper-Author features • correct affiliation from the table PaperAuthor: binary feature • which year the author publishes (for the first year papers of author this feature equals 1 for the second year papers - 2 and so on) • count sources: number of times pair author-paper is appeared in the table PaperAuthor • are names in PaperAuthor table and Author table the same
  • 13. Machine Learning Models Used Models used earlier • K-means clustering • ZeroR • Tree J-48 • Naïve Bayes
  • 14. Machine Learning Models Used • RandomForest • Gradient Boost
  • 15. Feature values extraction Count of paper for each author Maximum active year for given author Maximum active year for given author Jaccard distance between author name in author file and paper author file Jaccard distance between affiliation in author file and paper author file The year paper was published
  • 16. Build Random forest using weka With the elaborated train file
  • 17. Build Random forest using Mahout
  • 18. Build Gradient Boost using H20
  • 20. Lessons Learned! When you know what to find out exactly in the provided data ,use supervised learning model as classification rather than choosing unsupervised learning model such as clustering. When you want to find patterns or structures in the provided data use unsupervised dlearning models such as clustering. Try building model using the provided train file but it might not give you better results always. You can try to modify it using the existing data but making sure you do no change it. Choosing the features is the most important thing and we can extract the feature values from the given data and use it to build the model. Choosing features from different data points will give better results than just choosing them from only one.