SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Adaptive Web-page Content Identification Author: John Gibson, Ben Wellner, Susan Lubar Publication: WIDM’2007 Presenter: Jhih-Ming Chen 1
Outline Introduction Content Identification as Sequence Labeling Models Conditional Random Fields Maximum Entropy Classifier Maximum Entropy Markov Models Data Harvesting and Annotation Dividing into Blocks Experimental Setup Results and Analysis Conclusion and Feature Work 2
Introduction Web pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements. This paper’s goal: Detect identical and near duplicate articles within a set of web pages from a conglomerate of websites. Why do this work? Provide input for an application such as a Natural Language tool or for an index for a search engine. Re-display it on a small screen such as a cell phone or PDA. 3
Introduction Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format. Shortcoming: When the page format changes, the extractor is likely to break. It is labor intensive. Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written. Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task. In general, the site-specific extractors are unworkable as a long-term solution. The approach described in this paper is meant to overcome these issues. 4
Introduction The data set for this work consisted of web pages from 27 different news sites. Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem. i.e. each document is broken into a sequence of blocks and the task is to label each block as Content or NotContent. The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%. 5
Content Identification as Sequence Labeling Problem description Identify the portions of news-source web-pages that contain relevant content – i.e. the news article itself. Two general ways to approach this problem: Boundary Detection Method Identify positions where content begins and ends. Sequence Labeling Method Divide the original Web document into a sequence of some appropriately sized units or blocks. Categorize each block as Content or NotContent. 6
Content Identification as Sequence Labeling In this work, the authors focus largely on the sequence labeling method. Shortcomings of boundary detection methods: A number of web pages contain ‘noncontiguous’ content. The paragraphs of the article body have other page content, such as advertisements, interspersed. Boundary detection methods aren’t able to nicely model transitions from Content to NotContentback to Content. If a boundary is not identified at all, an entire section of content can be missed. When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones. Itmay be possible to sub-sample negative examples or identifysome reasonable set of candidate boundaries and train a classifieron those, but this complicates matters greatly. 7
Models This section describes the three statistical sequence labeling models employed in the experiments: Conditional Random Fields (CRF) Maximum Entropy Classifiers (MaxEnt) Maximum Entropy Markov Models (MEMM) 8
Conditional Random Fields Let x= x1, x2, …, xnbe a sequence of observations, such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”. Given a set of possible output values (i.e., labels), sequence CRFs define the conditional probability of a label sequence y = y1, y2, …, ynas: Zxis a normalization term overall possible label sequences. Current position is i. Current and previous labels are yiand yi-1 Often, the range of feature functions fkis {0,1}. Associated with each feature function, fk, is a learned parameter λkthat captures the strength and polarity of the correlation between the label transition, yi-1to yi. 9
Conditional Random Fields Training D ={(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}. A set of training data consisting of a set of pairs ofsequences. The model parameters are learned by maximizing the conditional log-likelihood of the training data. which is simply the sum of the log-probabilities assigned by the model to each label-observation sequence pair: The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting. 10 Regularization term
Conditional Random Fields Decoding (Testing) Given a trained model, the problem of decoding in sequence CRFs involves finding the most likely label sequence for a given observation sequence. There are NMpossible label sequences. N is the number of labels. M is the length of the sequence. Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels. 11
Maximum Entropy Classifier MaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to: In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence. 12
Maximum Entropy Markov Models MEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration. Viterbi decoding is employed as with CRFs to find the best label sequence. MEMMs are prone to various biases that can reduce their accuracy. Do not normalize over the entire sequence like CRFs. 13
Data Harvesting and Annotation 1620 labeled documents from 27 of the sites. 388 distinctarticles within the 1620 documents. This large number ofduplicate and near duplicate documents introduced bias into ourtraining set. 14
Data Dividing into Blocks Sanitize the raw HTML and transform it intoXHTML. Exclude all words insidestyle and script tags. Tokenize the document. Divide up theentire sequence of <lex> into smaller sequences called blocks. except the following:<a>,<span>, <strong>, ...etc. Wrap each block as tightly as possible with a <span>. Create features based on the material from the <span> tags in each block. There is a large skew of NotContent vs. Content blocks. 234,436 NotContentblocks. 24,388 Content blocks. 15
Experimental Setup 16
Experimental Setup Data Set Creation The authors ran four separate cross validation experiments to measure the bias introduced by duplicate articles and mixed sources. Duplicates, Mixed Sources. Split documents with 75% in the training set and 25% in the testing set. No Duplicates, Mixed Sources. This split prevented duplicate documents from spanning the training and testing boundary. Duplicates Allowed, Separate Sources. Create four separate bundles each containing 6 or 7 sources. Fill the bundles in round-robin fashion by selecting a source at random. Three bundles were assigned to training and one to testing. No Duplicates, Separate Sources. Ensure that no duplicates crossed the training/test set boundary. 17
Results and Analysis Each of the threedifferent models, CRF, MEMM and MaxEnt are evaluated oneach of the four different data setups. All results are a weightedaverage of four-fold cross validation. 18
Results and Analysis Feature Type Analysis Shows results for the CRF when individually removing each class of feature using the “No Duplicates; Separate Sources” data set. 19
Results and Analysis Contiguous vs.Non-Contiguous performance Document level results. 20
Results and Analysis Amount of Training Data Show the results for the CRF with varying quantities of training data using the “No Duplicates; Separate Sources” experimentalsetup. 21
Results and Analysis Error Analysis NotContent blocks found interspersed within portions of Content being falsely labeled as Content. Sometimes section headers would be incorrectly singled out as NotContent. The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent. 22
Conclusion and Feature Work Sequence labeling emerged as the clear winner with CRF edging out MEMM. MaxEnt was only competitive on the easiest of the four data sets and is not a viable alternative to site-specific wrappers. Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs. Another interesting avenue: semi-Markov CRFs. 23

Weitere ähnliche Inhalte

Was ist angesagt?

Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and contentIJCSEA Journal
 
Lecture 13 – comparative modeling
Lecture 13 – comparative modelingLecture 13 – comparative modeling
Lecture 13 – comparative modelingRAJAN ROLTA
 
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...Martin Chapman
 
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...Martin Chapman
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcUSD Bioinformatics
 
Iaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd Iaetsd
 

Was ist angesagt? (7)

Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and content
 
Lecture 13 – comparative modeling
Lecture 13 – comparative modelingLecture 13 – comparative modeling
Lecture 13 – comparative modeling
 
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
 
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Iaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection forIaetsd an enhanced feature selection for
Iaetsd an enhanced feature selection for
 

Ähnlich wie Adaptive web page content identification

Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...IEEEMEMTECHSTUDENTSPROJECTS
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
Information Extraction
Information ExtractionInformation Extraction
Information Extractionbutest
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesMartin Chapman
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidmining Content
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...IEEEGLOBALSOFTTECHNOLOGIES
 
A probabilistic approach to string transformation
A probabilistic approach to string transformationA probabilistic approach to string transformation
A probabilistic approach to string transformationIEEEFINALYEARPROJECTS
 
Software engg. pressman_ch-11
Software engg. pressman_ch-11Software engg. pressman_ch-11
Software engg. pressman_ch-11Dhairya Joshi
 
Extracting article text from the web with maximum subsequence segmentation
Extracting article text from the web with maximum subsequence segmentationExtracting article text from the web with maximum subsequence segmentation
Extracting article text from the web with maximum subsequence segmentationJhih-Ming Chen
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 

Ähnlich wie Adaptive web page content identification (20)

IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
 
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
2014 IEEE JAVA DATA MINING PROJECT A probabilistic approach to string transfo...
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
ppt
pptppt
ppt
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Phenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable PhenotypesPhenoflow: An Architecture for Computable Phenotypes
Phenoflow: An Architecture for Computable Phenotypes
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...
JAVA 2013 IEEE DATAMINING PROJECT A probabilistic approach to string transfor...
 
A probabilistic approach to string transformation
A probabilistic approach to string transformationA probabilistic approach to string transformation
A probabilistic approach to string transformation
 
Software engg. pressman_ch-11
Software engg. pressman_ch-11Software engg. pressman_ch-11
Software engg. pressman_ch-11
 
Extracting article text from the web with maximum subsequence segmentation
Extracting article text from the web with maximum subsequence segmentationExtracting article text from the web with maximum subsequence segmentation
Extracting article text from the web with maximum subsequence segmentation
 
Mdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuningMdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuning
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 

Mehr von Jhih-Ming Chen

Content extraction via tag ratios
Content extraction via tag ratiosContent extraction via tag ratios
Content extraction via tag ratiosJhih-Ming Chen
 
Comments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionComments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionJhih-Ming Chen
 
Financial Comic Information Retrieval System
Financial Comic Information Retrieval SystemFinancial Comic Information Retrieval System
Financial Comic Information Retrieval SystemJhih-Ming Chen
 
Progress Report 20091002
Progress Report 20091002Progress Report 20091002
Progress Report 20091002Jhih-Ming Chen
 
Progress Report 090820 2
Progress Report 090820 2Progress Report 090820 2
Progress Report 090820 2Jhih-Ming Chen
 

Mehr von Jhih-Ming Chen (6)

Content extraction via tag ratios
Content extraction via tag ratiosContent extraction via tag ratios
Content extraction via tag ratios
 
Comments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extractionComments oriented blog summarization by sentence extraction
Comments oriented blog summarization by sentence extraction
 
Financial Comic Information Retrieval System
Financial Comic Information Retrieval SystemFinancial Comic Information Retrieval System
Financial Comic Information Retrieval System
 
Ghost
GhostGhost
Ghost
 
Progress Report 20091002
Progress Report 20091002Progress Report 20091002
Progress Report 20091002
 
Progress Report 090820 2
Progress Report 090820 2Progress Report 090820 2
Progress Report 090820 2
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Adaptive web page content identification

  • 1. Adaptive Web-page Content Identification Author: John Gibson, Ben Wellner, Susan Lubar Publication: WIDM’2007 Presenter: Jhih-Ming Chen 1
  • 2. Outline Introduction Content Identification as Sequence Labeling Models Conditional Random Fields Maximum Entropy Classifier Maximum Entropy Markov Models Data Harvesting and Annotation Dividing into Blocks Experimental Setup Results and Analysis Conclusion and Feature Work 2
  • 3. Introduction Web pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements. This paper’s goal: Detect identical and near duplicate articles within a set of web pages from a conglomerate of websites. Why do this work? Provide input for an application such as a Natural Language tool or for an index for a search engine. Re-display it on a small screen such as a cell phone or PDA. 3
  • 4. Introduction Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format. Shortcoming: When the page format changes, the extractor is likely to break. It is labor intensive. Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written. Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task. In general, the site-specific extractors are unworkable as a long-term solution. The approach described in this paper is meant to overcome these issues. 4
  • 5. Introduction The data set for this work consisted of web pages from 27 different news sites. Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem. i.e. each document is broken into a sequence of blocks and the task is to label each block as Content or NotContent. The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%. 5
  • 6. Content Identification as Sequence Labeling Problem description Identify the portions of news-source web-pages that contain relevant content – i.e. the news article itself. Two general ways to approach this problem: Boundary Detection Method Identify positions where content begins and ends. Sequence Labeling Method Divide the original Web document into a sequence of some appropriately sized units or blocks. Categorize each block as Content or NotContent. 6
  • 7. Content Identification as Sequence Labeling In this work, the authors focus largely on the sequence labeling method. Shortcomings of boundary detection methods: A number of web pages contain ‘noncontiguous’ content. The paragraphs of the article body have other page content, such as advertisements, interspersed. Boundary detection methods aren’t able to nicely model transitions from Content to NotContentback to Content. If a boundary is not identified at all, an entire section of content can be missed. When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones. Itmay be possible to sub-sample negative examples or identifysome reasonable set of candidate boundaries and train a classifieron those, but this complicates matters greatly. 7
  • 8. Models This section describes the three statistical sequence labeling models employed in the experiments: Conditional Random Fields (CRF) Maximum Entropy Classifiers (MaxEnt) Maximum Entropy Markov Models (MEMM) 8
  • 9. Conditional Random Fields Let x= x1, x2, …, xnbe a sequence of observations, such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”. Given a set of possible output values (i.e., labels), sequence CRFs define the conditional probability of a label sequence y = y1, y2, …, ynas: Zxis a normalization term overall possible label sequences. Current position is i. Current and previous labels are yiand yi-1 Often, the range of feature functions fkis {0,1}. Associated with each feature function, fk, is a learned parameter λkthat captures the strength and polarity of the correlation between the label transition, yi-1to yi. 9
  • 10. Conditional Random Fields Training D ={(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}. A set of training data consisting of a set of pairs ofsequences. The model parameters are learned by maximizing the conditional log-likelihood of the training data. which is simply the sum of the log-probabilities assigned by the model to each label-observation sequence pair: The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting. 10 Regularization term
  • 11. Conditional Random Fields Decoding (Testing) Given a trained model, the problem of decoding in sequence CRFs involves finding the most likely label sequence for a given observation sequence. There are NMpossible label sequences. N is the number of labels. M is the length of the sequence. Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels. 11
  • 12. Maximum Entropy Classifier MaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to: In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence. 12
  • 13. Maximum Entropy Markov Models MEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration. Viterbi decoding is employed as with CRFs to find the best label sequence. MEMMs are prone to various biases that can reduce their accuracy. Do not normalize over the entire sequence like CRFs. 13
  • 14. Data Harvesting and Annotation 1620 labeled documents from 27 of the sites. 388 distinctarticles within the 1620 documents. This large number ofduplicate and near duplicate documents introduced bias into ourtraining set. 14
  • 15. Data Dividing into Blocks Sanitize the raw HTML and transform it intoXHTML. Exclude all words insidestyle and script tags. Tokenize the document. Divide up theentire sequence of <lex> into smaller sequences called blocks. except the following:<a>,<span>, <strong>, ...etc. Wrap each block as tightly as possible with a <span>. Create features based on the material from the <span> tags in each block. There is a large skew of NotContent vs. Content blocks. 234,436 NotContentblocks. 24,388 Content blocks. 15
  • 17. Experimental Setup Data Set Creation The authors ran four separate cross validation experiments to measure the bias introduced by duplicate articles and mixed sources. Duplicates, Mixed Sources. Split documents with 75% in the training set and 25% in the testing set. No Duplicates, Mixed Sources. This split prevented duplicate documents from spanning the training and testing boundary. Duplicates Allowed, Separate Sources. Create four separate bundles each containing 6 or 7 sources. Fill the bundles in round-robin fashion by selecting a source at random. Three bundles were assigned to training and one to testing. No Duplicates, Separate Sources. Ensure that no duplicates crossed the training/test set boundary. 17
  • 18. Results and Analysis Each of the threedifferent models, CRF, MEMM and MaxEnt are evaluated oneach of the four different data setups. All results are a weightedaverage of four-fold cross validation. 18
  • 19. Results and Analysis Feature Type Analysis Shows results for the CRF when individually removing each class of feature using the “No Duplicates; Separate Sources” data set. 19
  • 20. Results and Analysis Contiguous vs.Non-Contiguous performance Document level results. 20
  • 21. Results and Analysis Amount of Training Data Show the results for the CRF with varying quantities of training data using the “No Duplicates; Separate Sources” experimentalsetup. 21
  • 22. Results and Analysis Error Analysis NotContent blocks found interspersed within portions of Content being falsely labeled as Content. Sometimes section headers would be incorrectly singled out as NotContent. The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent. 22
  • 23. Conclusion and Feature Work Sequence labeling emerged as the clear winner with CRF edging out MEMM. MaxEnt was only competitive on the easiest of the four data sets and is not a viable alternative to site-specific wrappers. Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs. Another interesting avenue: semi-Markov CRFs. 23