SlideShare ist ein Scribd-Unternehmen logo
1 von 20
2-Layered HMMs for Search Interface Segmentation Ritu Khare (Under the Supervision of Dr Yuan An, Assistant Professor, iSchool) 1
Order of Presentation 2 Background Deep Web What is Search Interface Understanding? What is Interface Segmentation? Why is Segmentation Challenging? Our Approach for Segmentation Interface Representation HMM: The Artificial Designer 2-Layered Approach Architecture Experimentation Parameters Result Contributions Future Work References
Background: Deep Web What is Deep Web:   The data that exists on the Web but is not returned by search engines through traditional crawling and indexing.  The primary way to access this data is by filling up HTML forms on search interfaces.  Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale  Researchers have many goals for the deep Web:  design intra-domain meta-search engines [22, 8, 15, 5, 21] increase content visibility on existing search engines [17, 12] derive ontologies from search interfaces [1].   A pre-requisite to attain these goals is an understanding of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation(slide 5) portion of the problem of search interface understanding. 3
Background: What is Search Interface Understanding? 4 Understanding semantics of a search interface (shown in figure) is an intricate process [4] . It involves 4 stages.  Representation: A suitable interface representation scheme is chosen; semantic labels (slide 8) to be assigned to interface components are decided.  An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form.  Parsing: Components are parsed into a suitable structure.  Segmentation: The interface components are assigned semantic labels , and related components are grouped together  The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next) are also answered in this stage.  Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.
What is Interface Segmentation? 5 This project focuses on Segmentation, the 3rd stage of this process. Figure shows a segmented interface. The related components are grouped together.  The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).
Why is Segmentation Challenging? From a user’s (or designer’s) standpoint,  By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query.  On the other hand, a machine is unable to “see” a segment due to the following reasons:  The components that are visually close to each other might be located very far apart in the HTML source code,  A machine does not implicitly have any search experience that can be leveraged to identify a segment boundary.   This project aims to investigate whether a machine can “learn” how to understand and segment an interface.  Existing works have two shortcomings:  they [9,13,17] do not group all related components together i.e. do not create complete segments.  they [23, 7] use rules and heuristics to segment a search interface.  These techniques have problems  in handling scalability and heterogeneity [10].  6
Our Approach for Segmentation  We incorporate the first-hand implicit knowledge using which a human designer is assumed to have designed an interface.  This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction).   We visualize segmentation as a two-folded problem  Identification of boundaries of logical attributes (slide 9) Assignment of semantic labels (attribute-name, operator, and operand described in slide 9) to interface components.   7
Interface Representation 8 In figure, each component of the lower segment is marked with a label, which we term as a semantic label.  The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint.  Search Entity Logical Attribute Logical Attribute Operand Operator Attribute-name
Interface Representation Attribute-name: Attribute-name denotes the criteria available for searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”.  Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database.  Operator: The user may also be given an option of specifying the operator that further qualifies an operand.  Filling up an HTML form is similar to writing SQL queries.  Assuming the underlying database table name is “Gene”, the SQL queries for figure would be:  SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’;  SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’;  Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute.  9
HMM: The Artificial Designer 10 We assume that an HMM can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components.  The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator) of the component in mind.  Knowledge of Semantic Labels Bag of   Components Search Interface Designing 2-Layered HMM (Artificial Designer) Segments & Tagged Components Decoding
HMM: The Artificial Designer While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components). The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface.  11 Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RadioButton Group Textbox
2-Layered HMM The problem of decoding that we address in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers.  The first layer T-HMM tags each component with appropriate semantic labels (attribute-name, operator, and operand).  The second layer S-HMM segments the interface into logical attributes. 12
2-Layered Approach Architecture DOM-TREE PARSING Training Interfaces T-HMM Manually tagged State Sequences T-HMM Specs Test interfaces Predicted State Sequences T-HMM TRAINING T-HMM TESTING S-HMM S-HMM Specs Test interfaces S-HMM TRAINING Manually tagged State Sequences S-HMM TESTING Predicted State Sequences 13
Experimentation Parameters Data-Set: 200 interfaces (NAR collection) http://www3.oup.co.uk/nar/database/c/ Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order . Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples. Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively. 14
Results 15
Contributions We studied a challenging stage (segmentation) of the process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together.  We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces.  The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics.  Moreover, we tested our method on a less-explored domain (biology), and found promising results.  16
Future Work 17 To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute.  To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains.  To improve the degree of automation we want to investigate the use of Baum Welch training algorithm.  To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .
References 18 Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach.Informatica, 18(4), 511-534.  Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36.  Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224.  Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58.  He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 217-228.  He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the ACM, 50(5), 94-101.  He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133 - 155.  He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273.  Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong.  Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction, 77-91.  Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems 2003 (pp. 997-1013) Springer Berlin / Heidelberg.  Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.
References 19 Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment , Auckland, New Zealand. , 1(1) 684-694.  Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180.  Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55.  Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.  Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138.  Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall.  Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42.  Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg.  Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419.  Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106.  Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118.  Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.
Thank You Questions, Comments, Ideas??? 20

Weitere ähnliche Inhalte

Was ist angesagt?

Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 
DBMS - Relational Model
DBMS - Relational ModelDBMS - Relational Model
DBMS - Relational ModelOvais Imtiaz
 
Chapter 7 relation database language
Chapter 7 relation database languageChapter 7 relation database language
Chapter 7 relation database languageJafar Nesargi
 
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureChapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureKunal Anand
 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)welcometofacebook
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databasesDhani Ahmad
 
Generating requirements analysis models from textual requiremen
Generating requirements analysis models from textual requiremenGenerating requirements analysis models from textual requiremen
Generating requirements analysis models from textual requiremenfortes
 
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...IJwest
 
Cleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbotCleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbotTELKOMNIKA JOURNAL
 
Database Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and HashingDatabase Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and HashingPrabu U
 
Database 3 Conceptual Modeling And Er
Database 3   Conceptual Modeling And ErDatabase 3   Conceptual Modeling And Er
Database 3 Conceptual Modeling And ErAshwani Kumar Ramani
 
Fundamentals of Database Systems Questions and Answers
Fundamentals of Database Systems Questions and AnswersFundamentals of Database Systems Questions and Answers
Fundamentals of Database Systems Questions and AnswersAbdul Rahman Sherzad
 
Availability Assessment of Software Systems Architecture Using Formal Models
Availability Assessment of Software Systems Architecture Using Formal ModelsAvailability Assessment of Software Systems Architecture Using Formal Models
Availability Assessment of Software Systems Architecture Using Formal ModelsEditor IJCATR
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGijasuc
 
5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMSkoolkampus
 
Bank mangement system
Bank mangement systemBank mangement system
Bank mangement systemFaisalGhffar
 
Relational Database Design
Relational Database DesignRelational Database Design
Relational Database DesignPrabu U
 
The relational data model part[1]
The relational data model part[1]The relational data model part[1]
The relational data model part[1]Bashir Rezaie
 

Was ist angesagt? (20)

Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
DBMS - Relational Model
DBMS - Relational ModelDBMS - Relational Model
DBMS - Relational Model
 
Chapter 7 relation database language
Chapter 7 relation database languageChapter 7 relation database language
Chapter 7 relation database language
 
5010
50105010
5010
 
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureChapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and Architecture
 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databases
 
Generating requirements analysis models from textual requiremen
Generating requirements analysis models from textual requiremenGenerating requirements analysis models from textual requiremen
Generating requirements analysis models from textual requiremen
 
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
 
Cleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbotCleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbot
 
Database Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and HashingDatabase Design and the ER Model, Indexing and Hashing
Database Design and the ER Model, Indexing and Hashing
 
ADB introduction
ADB introductionADB introduction
ADB introduction
 
Database 3 Conceptual Modeling And Er
Database 3   Conceptual Modeling And ErDatabase 3   Conceptual Modeling And Er
Database 3 Conceptual Modeling And Er
 
Fundamentals of Database Systems Questions and Answers
Fundamentals of Database Systems Questions and AnswersFundamentals of Database Systems Questions and Answers
Fundamentals of Database Systems Questions and Answers
 
Availability Assessment of Software Systems Architecture Using Formal Models
Availability Assessment of Software Systems Architecture Using Formal ModelsAvailability Assessment of Software Systems Architecture Using Formal Models
Availability Assessment of Software Systems Architecture Using Formal Models
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS5. Other Relational Languages in DBMS
5. Other Relational Languages in DBMS
 
Bank mangement system
Bank mangement systemBank mangement system
Bank mangement system
 
Relational Database Design
Relational Database DesignRelational Database Design
Relational Database Design
 
The relational data model part[1]
The relational data model part[1]The relational data model part[1]
The relational data model part[1]
 

Andere mochten auch

Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)Evan Hamilton
 
5 tips on how to select a prom for your study presentation notes
5 tips on how to select a prom for your study   presentation notes5 tips on how to select a prom for your study   presentation notes
5 tips on how to select a prom for your study presentation notesKeith Meadows
 
PRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical studyPRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical studyKeith Meadows
 
What is that beautiful house?
What is that beautiful house?What is that beautiful house?
What is that beautiful house?GeorginaSV
 
Configure Kettle debug session
Configure Kettle debug sessionConfigure Kettle debug session
Configure Kettle debug sessionAntonio Musarra
 
PHP Documentation APIs on the fly
PHP Documentation APIs on the flyPHP Documentation APIs on the fly
PHP Documentation APIs on the flyAntonio Musarra
 
Guia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOCGuia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOCAlejandro Videla
 
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...Jasso Development & Planning
 
iPad 3 Features Rumors
iPad 3 Features RumorsiPad 3 Features Rumors
iPad 3 Features RumorsRicky Shah
 
There's a Customer Out There with a Bullet for You: Understanding Your Customers
There's a Customer Out There with a Bullet for You: Understanding Your CustomersThere's a Customer Out There with a Bullet for You: Understanding Your Customers
There's a Customer Out There with a Bullet for You: Understanding Your CustomersEvan Hamilton
 
GbDportfolio Marketing+Analytics
GbDportfolio Marketing+AnalyticsGbDportfolio Marketing+Analytics
GbDportfolio Marketing+Analyticsgloriabuonodaly
 

Andere mochten auch (17)

Dissertation Proposal Presentation
Dissertation Proposal Presentation Dissertation Proposal Presentation
Dissertation Proposal Presentation
 
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
Why Community Managers Won't Exist in 5 Years (and why that's a good thing)
 
5 tips on how to select a prom for your study presentation notes
5 tips on how to select a prom for your study   presentation notes5 tips on how to select a prom for your study   presentation notes
5 tips on how to select a prom for your study presentation notes
 
PRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical studyPRO Workshop - Selecting the appropriate PRO for your clinical study
PRO Workshop - Selecting the appropriate PRO for your clinical study
 
A Multi-level Methodology for Developing UML Sequence Diagrams
A Multi-level Methodology for Developing UML Sequence DiagramsA Multi-level Methodology for Developing UML Sequence Diagrams
A Multi-level Methodology for Developing UML Sequence Diagrams
 
What is that beautiful house?
What is that beautiful house?What is that beautiful house?
What is that beautiful house?
 
Configure Kettle debug session
Configure Kettle debug sessionConfigure Kettle debug session
Configure Kettle debug session
 
Prospectus presentation
Prospectus presentation Prospectus presentation
Prospectus presentation
 
PHP Documentation APIs on the fly
PHP Documentation APIs on the flyPHP Documentation APIs on the fly
PHP Documentation APIs on the fly
 
Mike thelwall ritu
Mike thelwall rituMike thelwall ritu
Mike thelwall ritu
 
Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?
 
Guia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOCGuia argentina de tratamiento de la EPOC
Guia argentina de tratamiento de la EPOC
 
Save th
Save thSave th
Save th
 
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
Claudia Jasso-Stevens, Amistades, Inc. Segundo de Febrero Commemoration Invit...
 
iPad 3 Features Rumors
iPad 3 Features RumorsiPad 3 Features Rumors
iPad 3 Features Rumors
 
There's a Customer Out There with a Bullet for You: Understanding Your Customers
There's a Customer Out There with a Bullet for You: Understanding Your CustomersThere's a Customer Out There with a Bullet for You: Understanding Your Customers
There's a Customer Out There with a Bullet for You: Understanding Your Customers
 
GbDportfolio Marketing+Analytics
GbDportfolio Marketing+AnalyticsGbDportfolio Marketing+Analytics
GbDportfolio Marketing+Analytics
 

Ähnlich wie Two Layered HMMs for Search Interface Segmentation

Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463IJRAT
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarityitrejos
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
ArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
ArtForm - Dynamic analysis of JavaScript validation in web forms - PosterArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
ArtForm - Dynamic analysis of JavaScript validation in web forms - PosterDBOnto
 
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisSemi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisIRJET Journal
 
Abstract
AbstractAbstract
Abstractbutest
 
Lec2_Information Integration.ppt
 Lec2_Information Integration.ppt Lec2_Information Integration.ppt
Lec2_Information Integration.pptNaglaaFathy42
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsBob Leithiser
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 

Ähnlich wie Two Layered HMMs for Search Interface Segmentation (20)

HMM-based Artificial Designer for Search Interface Segmentation
HMM-based Artificial Designer for Search Interface SegmentationHMM-based Artificial Designer for Search Interface Segmentation
HMM-based Artificial Designer for Search Interface Segmentation
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
Ak4301197200
Ak4301197200Ak4301197200
Ak4301197200
 
Paper id 25201463
Paper id 25201463Paper id 25201463
Paper id 25201463
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarity
 
Data Science Machine
Data Science Machine Data Science Machine
Data Science Machine
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
ArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
ArtForm - Dynamic analysis of JavaScript validation in web forms - PosterArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
ArtForm - Dynamic analysis of JavaScript validation in web forms - Poster
 
G1803054653
G1803054653G1803054653
G1803054653
 
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisSemi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
 
Abstract
AbstractAbstract
Abstract
 
Lec2_Information Integration.ppt
 Lec2_Information Integration.ppt Lec2_Information Integration.ppt
Lec2_Information Integration.ppt
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specifications
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 

Two Layered HMMs for Search Interface Segmentation

  • 1. 2-Layered HMMs for Search Interface Segmentation Ritu Khare (Under the Supervision of Dr Yuan An, Assistant Professor, iSchool) 1
  • 2. Order of Presentation 2 Background Deep Web What is Search Interface Understanding? What is Interface Segmentation? Why is Segmentation Challenging? Our Approach for Segmentation Interface Representation HMM: The Artificial Designer 2-Layered Approach Architecture Experimentation Parameters Result Contributions Future Work References
  • 3. Background: Deep Web What is Deep Web: The data that exists on the Web but is not returned by search engines through traditional crawling and indexing. The primary way to access this data is by filling up HTML forms on search interfaces. Characteristics[6] :A large proportion of structured databases; Diversity of domains; and its ; Growing scale Researchers have many goals for the deep Web: design intra-domain meta-search engines [22, 8, 15, 5, 21] increase content visibility on existing search engines [17, 12] derive ontologies from search interfaces [1]. A pre-requisite to attain these goals is an understanding of the search interfaces (slide 4). In this project, we propose an approach to address the segmentation(slide 5) portion of the problem of search interface understanding. 3
  • 4. Background: What is Search Interface Understanding? 4 Understanding semantics of a search interface (shown in figure) is an intricate process [4] . It involves 4 stages. Representation: A suitable interface representation scheme is chosen; semantic labels (slide 8) to be assigned to interface components are decided. An interface component is any text or HTML form element (textbox, textarea, selection list, radiobutton, checkbox, file input) that exists inside an HTML form. Parsing: Components are parsed into a suitable structure. Segmentation: The interface components are assigned semantic labels , and related components are grouped together The questions like “Which surrounding text is associated with which form element?” (In figure 2, “Gene ID” is associated with the textbox placed next) are also answered in this stage. Segment-processing: Additional information, such as domain, constraints, and data type, about each segment component is extracted.
  • 5. What is Interface Segmentation? 5 This project focuses on Segmentation, the 3rd stage of this process. Figure shows a segmented interface. The related components are grouped together. The left segment has 7 components. The right segment has 4 components (“cM Position:”, selection list, textbox, and “e.g., “10.0-40.0””).
  • 6. Why is Segmentation Challenging? From a user’s (or designer’s) standpoint, By looking at the visual arrangement of components, and based on past experiences, the user creates a logical boundary around the related components as they appear to belong to the same atomic query. On the other hand, a machine is unable to “see” a segment due to the following reasons: The components that are visually close to each other might be located very far apart in the HTML source code, A machine does not implicitly have any search experience that can be leveraged to identify a segment boundary. This project aims to investigate whether a machine can “learn” how to understand and segment an interface. Existing works have two shortcomings: they [9,13,17] do not group all related components together i.e. do not create complete segments. they [23, 7] use rules and heuristics to segment a search interface. These techniques have problems in handling scalability and heterogeneity [10]. 6
  • 7. Our Approach for Segmentation We incorporate the first-hand implicit knowledge using which a human designer is assumed to have designed an interface. This is accomplished by designing an artificial designer using Hidden Markov Models (refer to week 9’s slides on HMM introduction). We visualize segmentation as a two-folded problem Identification of boundaries of logical attributes (slide 9) Assignment of semantic labels (attribute-name, operator, and operand described in slide 9) to interface components. 7
  • 8. Interface Representation 8 In figure, each component of the lower segment is marked with a label, which we term as a semantic label. The semantic label for a particular component denotes the meaning of the component from a user’s or designer’s standpoint. Search Entity Logical Attribute Logical Attribute Operand Operator Attribute-name
  • 9. Interface Representation Attribute-name: Attribute-name denotes the criteria available for searching a particular entity, e.g. the entity “Genes” can be searched by “Gene ID” and by “Gene Name”. Operand: An attribute-name is usually associated with operand(s), the value(s) entered by the user that is(are) matched against the corresponding field value(s) in the underlying database. Operator: The user may also be given an option of specifying the operator that further qualifies an operand. Filling up an HTML form is similar to writing SQL queries. Assuming the underlying database table name is “Gene”, the SQL queries for figure would be: SELECT * FROM Gene WHERE Gene_ID= ‘PF11_0344’; SELECT * FROM Gene WHERE Gene_Name LIKE ‘maggie’; Logical Attribute: The predicate in the WHERE clause of each query is created by a group of related components. We combine the semantic roles (attribute-name, operator(s), and operand(s)) of these components to create a composite semantic label called logical attribute. Our approach assumes that a segment corresponds to a logical attribute. 9
  • 10. HMM: The Artificial Designer 10 We assume that an HMM can act like a human designer who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the Web page while keeping the semantic role (attribute-name, operand, or operator) of the component in mind. Knowledge of Semantic Labels Bag of Components Search Interface Designing 2-Layered HMM (Artificial Designer) Segments & Tagged Components Decoding
  • 11. HMM: The Artificial Designer While the components are observable, their semantic roles appear hidden to a machine. The proceeding of one semantic label by another is similar to the transitioning of HMM states. In the figure, Ovals=states (semantic labels); Rectangles= emitted symbols (components). The designing ability is provided by training the HMM with suitable algorithms. Once an HMM is trained, it can be used for the decoding process i.e. for explaining the design of a given search interface. 11 Attribute Name Operand Operator Attribute Name Operand Text (Gene ID) Textbox Text (Gene Name) RadioButton Group Textbox
  • 12. 2-Layered HMM The problem of decoding that we address in this paper is two-folded involving segmentation as well as assignment of semantic labels to components. Hence, we employ a layered HMM [14] with 2 layers. The first layer T-HMM tags each component with appropriate semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical attributes. 12
  • 13. 2-Layered Approach Architecture DOM-TREE PARSING Training Interfaces T-HMM Manually tagged State Sequences T-HMM Specs Test interfaces Predicted State Sequences T-HMM TRAINING T-HMM TESTING S-HMM S-HMM Specs Test interfaces S-HMM TRAINING Manually tagged State Sequences S-HMM TESTING Predicted State Sequences 13
  • 14. Experimentation Parameters Data-Set: 200 interfaces (NAR collection) http://www3.oup.co.uk/nar/database/c/ Parsing: DOM-trees [3] of components. Trees were traversed in the depth-first search order . Testing and Training Data: The examples were randomly divided into 20 equal-sized sets. We conducted 20 experiments each having 190 training and 10 testing examples. Testing and Training Algorithms: In both layers, training and testing were performed using Maximum Likelihood method and Viterbi algorithm respectively. 14
  • 16. Contributions We studied a challenging stage (segmentation) of the process of search interface understanding. In the context of deep Web, this is the third formal empirical study (after [23] and [7]) that groups components belonging to the same logical attribute together. We incorporated the first-hand knowledge of the designer for interface segmentation and component tagging. To the best of our knowledge, this is the first work to apply HMMs on deep Web search interfaces. The interface has been represented in terms of the underlying database. This helped in extracting database querying semantics. Moreover, we tested our method on a less-explored domain (biology), and found promising results. 16
  • 17. Future Work 17 To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. To do justice to the balanced domain distribution of the deep Web [6], we want to test this method on interfaces from other less-explored domains. To improve the degree of automation we want to investigate the use of Baum Welch training algorithm. To minimize the zero emission probabilities, we want to investigate the use of Synset-HMM [20] .
  • 18. References 18 Benslimane, S. M., Malki, M., Rahmouni, M. K., & Benslimane, D. (2007). Extracting personalised ontology from data-intensive web application: An HTML forms-based reverse engineering approach.Informatica, 18(4), 511-534. Freitag, D., & Mccallum, A. K. (1999). Information extraction with HMMs and shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 31-36. Gupta , S., Kaiser, G. E., Grimm , P., Chiang, M. F., & Starren, J. (2005). Automating content extraction of HTML documents. World Wide Web, 8(2), 179-224. Halevy, A. Y. (2005, Why your data won't mix: Semantic heterogeneity. Queue, 3, 50-50-58. He, B., & Chang, K. C. (2003). Statistical schema matching across web query interfaces. 2003 ACM SIGMOD International Conference on Management of Data , San Diego, California. 217-228. He, B., Patel, M., Zhang, Z., & Chang, K. C. (2007a). Accessing the deep web. Communications of the ACM, 50(5), 94-101. He, H., Meng, W., Lu, Y., Yu, C., & Wu, Z. (2007b). Towards deeper understanding of the search interfaces of the deep web. World Wide Web, 10(2), 133 - 155. He, H., Meng, W., Yu, C., & Wu, Z. (2004). Automatic integration of web search interfaces with WISE-integrator. The VLDB Journal the International Journal on very Large Data Bases, 13(3), 256-273. Kalijuvee, O., Buyukkokten, O., Garcia-Molina, H., & Paepcke, A. (2001). Efficient web form entry on PDAs. Proceedings of the 10th International Conference on World Wide Web , Hong Kong, Hong Kong. Kushmerick , N. (2002). Finite-state approaches to web information extraction. 3rd Summer Convention on Information Extraction, 77-91. Kushmerick , N. (2003). Learning to invoke web forms. On the move to meaningful internet systems 2003 (pp. 997-1013) Springer Berlin / Heidelberg. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241-1252.
  • 19. References 19 Nguyen, H., Nguyen, T., & Freire, J. (2008). Learning to extract form labels. Proceedings of the VLDB Endowment , Auckland, New Zealand. , 1(1) 684-694. Oliver, N., Garg, A., & Horvitz, E. (2004). Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96(2), 163-180. Pei, J., Hong, J., & Bell, D. (2006). A robust approach to schema matching over web query interfaces. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW'06), Atlanta, Georgia. 46-55. Rabiner, L., R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on very Large Data Bases , Rome, Italy. 129-138. Russell, S., J., & Norvig, P. (2002). Artificial intelligence: Modern approach Prentice Hall. Seymore, K., Mccallum, A. K., & Rosenfeld , R. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, Orlando, Florida. 37-42. Tran-Le, M. S., Vo-Dang , T. T., Ho-Van , Q., & Dang, T. K. (2008). Automatic information extraction from the web: An HMM-based approach. Modeling, simulation and optimization of complex processes (pp. 575-585) Springer Berlin Heidelberg. Wang, J., Wen, J., Lochovsky, F., & Ma, W. (2004). Instance-based schema matching for web databases by domain-specific query probing. Thirtieth International Conference on very Large Data Bases, 30, 408 - 419. Wu, W., Yu, C., Doan, A., & Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data , Paris, France. 95 - 106. Zhang, Z., He, B., & Chang, K. C. (2004). Understanding web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France. 107 - 118. Zhong, P., & Chen, J. (2006). A generalized hidden markov model approach for web information extractionWeb Intelligence, 2006. WI 2006, Hong Kong, China. 709-718.
  • 20. Thank You Questions, Comments, Ideas??? 20