SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Similarity  measurement:Folksonomyvs.LSA Preliminary Results
The Tripartite structure of tagging Folksonomy is a set of triples < user , tag, object> A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources.
Del.icio.us Tag distribution Tag distribution Log-log Tag distribution After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964.  All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887.
LSA Processing Workflow in R tm = textmatrix(‘dir/‘) tm = lw_logtf(tm) * gw_idf(tm) space = lsa(tm, dims=dimcalc_share()) as.textmatrix(tm)
LSA corpus preparing A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups.   Stemming and Stop-word removal The processed corpus:14,993,620 tokens, 259,464 types of words.   Only words with frequency more than 100 were kept to be entered into a word-by-document matrix.  There were 1047 words with frequency more than 100.  The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document . Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words). 
LSA document length distribution The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .
Three similarity measurements Tag Co-coccurrence counts Tag vector cosine similarity LSA
Similarity Measurement Tag Co-occurrence Counts:     1)simple count: how many times two tags are used by the same user to annotate the same resource     2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.
Distribution of Tag co-occurrence Counts (simple counts)
Distribution of Tag co-occurrence Counts (normalized)
Measurement 2: Cosine Similarity Based on the co-occurrence vector of each tag with every other tag. Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two  X and Y are the co-occurrence vectors of two distinct tags.
Distribution of Tag Cosine Similarity
Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence counts
Results The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]
Correlation
Qualitative Insight – ‘linguistics’ Top 10 “linguistics” related words according to 5 measurements
Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosine P=  0.7547
Similarity  Measurement  Preliminary Results

Weitere Àhnliche Inhalte

Was ist angesagt?

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704IJERA Editor
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingJay Luker
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrievaliosrjce
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 

Was ist angesagt? (17)

Ghost
GhostGhost
Ghost
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
L0261075078
L0261075078L0261075078
L0261075078
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social Media
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrieval
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 

Andere mochten auch

Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_maMojisola Erdt née Anjorin
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Mojisola Erdt née Anjorin
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinMojisola Erdt née Anjorin
 
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Mojisola Erdt née Anjorin
 
Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Mojisola Erdt née Anjorin
 
Otot Manusia
Otot ManusiaOtot Manusia
Otot ManusiaOrlin Moria
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries鍟èȘ  é™łéŸèȘ 
 

Andere mochten auch (9)

Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
 
Eval rec algo_crowdsourcing__icalt_2014_ma
Eval rec algo_crowdsourcing__icalt_2014_maEval rec algo_crowdsourcing__icalt_2014_ma
Eval rec algo_crowdsourcing__icalt_2014_ma
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
 
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
 
Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...
 
Rs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_maRs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_ma
 
Otot Manusia
Otot ManusiaOtot Manusia
Otot Manusia
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 

Ähnlich wie Similarity Measurement Preliminary Results

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityStop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityInovex GmbH
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approachdinesh_joshy
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomiesdermotte
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDatamining Tools
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
 
unit-4.ppt
unit-4.pptunit-4.ppt
unit-4.pptMsRAMYACSE
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 

Ähnlich wie Similarity Measurement Preliminary Results (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
K017367680
K017367680K017367680
K017367680
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityStop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
 
Token
TokenToken
Token
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
unit-4.ppt
unit-4.pptunit-4.ppt
unit-4.ppt
 
unit 4.ppt
unit 4.pptunit 4.ppt
unit 4.ppt
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 

KĂŒrzlich hochgeladen

Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

KĂŒrzlich hochgeladen (20)

Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls đŸ„° 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Similarity Measurement Preliminary Results

  • 2. The Tripartite structure of tagging Folksonomy is a set of triples < user , tag, object> A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources.
  • 3. Del.icio.us Tag distribution Tag distribution Log-log Tag distribution After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964. All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887.
  • 4. LSA Processing Workflow in R tm = textmatrix(‘dir/‘) tm = lw_logtf(tm) * gw_idf(tm) space = lsa(tm, dims=dimcalc_share()) as.textmatrix(tm)
  • 5. LSA corpus preparing A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups. Stemming and Stop-word removal The processed corpus:14,993,620 tokens, 259,464 types of words. Only words with frequency more than 100 were kept to be entered into a word-by-document matrix. There were 1047 words with frequency more than 100. The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document . Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words). 
  • 6. LSA document length distribution The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .
  • 7. Three similarity measurements Tag Co-coccurrence counts Tag vector cosine similarity LSA
  • 8. Similarity Measurement Tag Co-occurrence Counts: 1)simple count: how many times two tags are used by the same user to annotate the same resource 2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.
  • 9. Distribution of Tag co-occurrence Counts (simple counts)
  • 10. Distribution of Tag co-occurrence Counts (normalized)
  • 11. Measurement 2: Cosine Similarity Based on the co-occurrence vector of each tag with every other tag. Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two X and Y are the co-occurrence vectors of two distinct tags.
  • 12. Distribution of Tag Cosine Similarity
  • 13. Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence counts
  • 14. Results The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Qualitative Insight – ‘linguistics’ Top 10 “linguistics” related words according to 5 measurements
  • 22. Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosine P= 0.7547

Hinweis der Redaktion

  1. Spearman’s rank correlation