Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Dissertation Defense Presentation
1. A Framework for Mapping
User-designed Forms
to Relational Databases
Dissertation Presentation
November 15 2011
Ritu Khare
COMMITTEE :
Dr. Yuan An (Chair)
Dr. Jiexun Jason Li
Dr. Il-Yeol Song
Dr. Min Song
Dr. Christopher C. Yang
1
2. Presentation Order
1. Motivation
2. Problems
3. Solutions
4. Evaluation
5. Final Remarks
2
4. General Motivation: Database Usability (Sawyer, 1995)
Enable users to SEARCH and Enable users to DESIGN
QUERY databases databases. (Jagadish et al. 2007)
Information Retrieval Form-based DIY and WYSIWYG
Techniques (Liu et al, 2006, Hristidis paradigms
et al., 2003, Catarci, 2000, Jayapandian FormAssembly, ZohoCreator,
and Jagadish, 2006) GoogleForms
Databases still remain unusable from the integration point of view
(Gurses et al., 2009)
4
5. Precise Motivation: Integration of New Needs
New
needs
related to 1) Building of new forms
patient’s
social 2) Integration of new form
habits into back-end
5
6. Research Objective
To develop a mechanism to automatically map
and integrate a user-designed form into
existing structured database.
Assume that a user-designed form is
already acquired
Seek a framework that
merges the semantically matching elements
between forms and databases.
creates new database elements corresponding to
the unmatched form elements.
6
8. A form template represents the
semantic intentions of the designer Problem #1 : Form Understanding
Existing Work
Focus on Search Forms
(Benslimane, et al. 2007, Kaljuviee
et al., 2001)
shorter and simpler than the
data-entry forms. (empirical
finding)
Rules and heuristics
(Zhang et al. 2004, He et al., 2007)
Automatic Extraction of the form semantics not likely to circumvent the
Machine can only read the syntactic patterns ever broadening varieties in
of form elements. A certain layout pattern form topologies
cannot be associated with a semantic
intention.
8
9. Problem#2: Correspondence Discovery
Existing Work
Schema and Ontology
Mapping (Madhavan et al., 2001,
Detect semantically matching Euzenat and Shvaiko, 2005, Rahm and
Bernstein, 2001, An et al. 2005, An et al. 2006)
elements between a form and Mostly semi-automatic
an existing database Not applicable to form to
Challenges database correspondence
discovery
Variety of terms to denote the Heterogeneity between forms and
same concepts. databases
Correspondences are to be used for
Variety of concepts denoted evolving the database; the discovery
process has to keep this requirement
by similar terms into consideration.
Identify and eliminate the
invalid correspondences.
9
10. Problem# 3: Form Integration
Problem#3a: Merging Existing Work
Merging into an existing Form integration (Yang et al.,
database so that the same 2008)
concept is not duplicated and largely manual
the database remains expose the users to the technical
compact. details of the underlying data
Merging increases the model.
potential of having NULL Database integration (Yang et al.
values, i.e., less optimized 2003)
database. provide guidelines.
Judicious Decisions
10
11. Problem# 3: Form Integration
Problem#3b: Birthing
Existing Work:
Extend the database for
Form-based database design
the unmatched form Several methods (Choobineh et al.
1988, Pavicevic et al, 2006, Choobeneh and
elements Venkatraman, 1992, Deklarit, 2008) and
commercial tools (Form assembly,
How to automatically google forms, zohocreator, wufoo)
No empirical evaluation of the
derive the functional resultant databases
dependencies among the Few focus on designing a database
with certain desirable properties,
form elements? e.g., expressiveness (Yang et al, 2008,
Choobineh et al., 1988, Lukovic, et al 2007).
How to translate the These properties do not reflect
complex form patterns? any compliance with the form
semantics and are inadequate
How to evaluate multiple for evaluating the mapping
process.
design alternatives &
pick one?
11
12. Research Questions and System Goals
1. Form Understanding
System Goals:
A model to capture the form 1. To evolve a DB that is high-
semantics quality and optimized as per
Extract this model from a given the form semantics, i.e.,
compliant to the principles
form (Wang and Strong, 1996,
Ramakrishnan and Gehrke, 2002,
2. Correspondence Discovery Silberschatz, et al., 2001, Batini and
Scannapieco, 2006):
Determine semantically
Completeness: All form
equivalent elements b/w form & elements represented in
database database
Incorporate DB evolution Correctness: Form
semantics retained:
requirement during discovery Compactness: Equivalent
process elements merged
3. Form Integration Normalization: 3NF w.r.t.
form’s functional
Resolve merging conflicts while dependencies
maintaining the original form Minimize NULL values in
semantics FKs and Descriptive
attributes
Given a form pattern, derive a
2. To ensure minimalism in the
relational database with required user intervention
12
“desirable” properties
14. Form Representation: Form Tree
The form tree accurately captures the designer's intentions, and
hence the semantic associations among the form elements.
Inspired by hierarchical modeling of forms in existing works
(Dragut et al. 2009, Wu et al. 2009)
14
15. Framework Outline
Form
Understanding Form Tree
and Semantics
Extraction
Correspondenc
Form Tree with e Discovery and
Discovered Validation
Correspondences
Database
Design and Database
Evolution
15
17. Method 1a: Form Tree Generation
I. Tag and 2. Derive Tree
Segment Phase Phase(5 rules)
The approach leverages the probabilistic nature of form design
and develops a 2-layered Hidden Markov Model (HMM)
based artificial designer that has the ability to understand the
semantics of any arbitrarily designed form.
T-HMM: Tagging HMM
S-HMM-Segmentation T-HMM
17
18. Method 1b: Form Term Annotation
Refine semantics by annotating terms
Systematized Nomenclature of Medicine Challenge: Same form term can be
Clinical Terms (SNOMED CT) comprising specified in multiple contexts, i.e.,
360,000 concepts belonging to various semantic categories. The key is to identify
semantic categories. the semantic category for a given term.
We hypothesize that the term context can
ConceptID Description Semantic Category be derived from the structure of the form
tree.
0231832 Respiratory Rate Observable Entity
362508001 Both eyes, entire Body Structure
18
19. Method 1b: Form term annotation
Form Tree
SNOMED CT Choose the
Form Structure Classification best match SNOMED
Term CT
Analyzer Model Semantic concept from
this category Concept
category
SNOMED CT search service
19
20. Method 2: Correspondence Discovery and Validation
Linguistic Exact Concept
Matching Matching
1
2
20
21. Total Heuristics = 4
Method 2: Validation Algorithm
Past Medical X
History History
X
Id HPI Medications SocialHistory
Family
Hx
History of
Meds
X
present
Illness
Oral
Hygiene Appetite
Id Options
radio 1 Good
2 Fair
good poor 3 Poor
Look-up table
21
23. Method 3a: Birthing Algorithm Total Patterns = 12
Principles: High Quality(Complete, Correct, Compact, Normalized) and
Optimization (minimize NULLs)
Traverses the form tree in depth first order
M:1
Tj.ID -> Tj.c
Radiobutton Pattern
Textbox Pattern
Category/subcategory
Pattern Extended RB
Pattern
23
26. Tot. merging scenarios = 8
Method 3b: Merging Algorithm
Compactness Factor(CF): A
Each merger involves a trade-off
configurable value (0,1) that indicates
between compactness and the weightage given to compactness
optimization (min. NULL values)
Null Value Ratio(NVR): A calculated
principles.
value that indicates the potential of
having NULL values in a given table.
New DB Existing DB
NVR = 2/5=0.4
Case a: CF=0.5 Case b: CF=0.3
Final DB
(CF>NVR) (CF<=NVR)
26
More Compact More Optimized
28. System Goals: Principle Compliance & Min. Interventions
Evaluation Goals: Java, Tomcat,
A. How well the system meets the goals? MySQL Server,
yFiles, JSP
B. Impact of framework in accomplishing the goals ?
EM & Viterbi,
cross-
HMM-based tree validation
extraction
SNOMED CT Form Tree
Term Annotation Linguistic
Naïve Bayes Classifier, Similarity
Top-4 classes, SnAPI, =Lucene’s Default
Cross-validation per Corr. Settings
Form Tree with dataset Discovery
Discovered Validation
Correspondences Algorithm
Birthing
Algorithm Database
Merging
28 Algorithm CF=0.7
29. Data
(52 real world forms from 6 medical institutions)
Healthcare : Forms are prevalent, and Information systems are unusable and inflexible.
Dataset Avg. Avg. SNOMED
Terms Inputs CT
Mappability
1 Walk in clinic encounter 32.33 49.33 75.77 %
forms (3 forms) Gold Benchmarks
2 Nursing patient 17.17 33 63.98% 52 Gold Std Trees
admission forms (6 (using a DIY interface that
forms) captures designers’ on-
the-fly semantic decisions)
3 Labor & delivery DB data- 16.14 37.29 58.8 %
entry forms (7 forms) Gold Std Annotations
(4235 form terms were
4 Adult visit encounter 47.83 65.22 56.2% manually studied & 2506
forms (59%) had corr. concept in
SNOMED CT)
(18 forms)
5 Family practice forms 82.61 100.46 59.38% 3 pairs of Gold DBs
(3 datasets were given to
(13 forms) 2 experts. Each expert
6 Child visit encounter 53 67.4 62.21% manually derived the 3
forms databases)
29 (5 forms)
30. Experiment 1: Form Tree
Extraction
97.85% of parent child semantic
associations captured correctly
An average tree with 135 edges
gets generated in 0.08 seconds.
Dataset1 Dataset2 Dataset3 Dataset4 Dataset5 Dataset6
Total Edges 272 362 461 2606 2674 644
Accuracy 95.22% 97.51% 100% 97.58% 98.46% 96.11%
Inaccuracies because of more hierarchical
complexity, i.e., semantic grouping and sub-
grouping.
30
32. Avg time(s)/form
Exp. 2: Form Term Annotation 1.28, 1.77, 2.31,
10.29, 8.12, 3.44
Enhanced all versions by adding
term processing: remove special
character, clinical acronyms
expansion.
Precision only slightly improved
(3-5%)
Recall majorly improved (25%).
Final Precision =0.89, Recall
=0.76
Baseline to Hybrid
Avg. precision Improved by 26%.
Recall no specific pattern
Hybrid to Hybrid++
Avg. Precision improved by 13%
Avg. Recall improved by 17%
Hybrid++: precision 0.86, recall 0.6
Structural knowledge can improve the overall performance.
32 Linguistic Techniques can only impact the recall.
33. Experiment 3: Form to Database Mapping
3a.Linguistic-based 3b. Concept-based 3c. Hybrid
Discovery Discovery Discovery
33
34. Exp 3: Description of evolved databases.
(35 to 450 tables), (Linguistic-based Discovery) (x:element-type
y:# elements)
Mapping Duration
per form:
few ms. to 200s.
34
35. Exp 3: Comparison with Gold Datasets
With Gold 1
With Gold 2 74%(avg.) of the system generated
tables “perfectly match" with the
tables in the gold databases.
Based on the principles of quality
and optimization, the mismatches
could be divided into: Negative and
Positive
System
A Gold DB
Form Pattern Generated DB
Positive
Mismatch
Negative
Mismatch
35
36. Correctness. Completeness,
Exp. 3: Measuring Principle Compliance Normalization, Optimization,
Compactness.
An approx. universal set of merging situations
3a : Linguistic Discovery
DB1 DB2 DB3 DB4 DB5 DB6
> =75% compactness in 4
Linguistic databases.
Discovery
Databases 4, 6: >=20%
rejected due of form features
Concept
Discovery Datasets 4 and 6
Format Diversity: Gender (textbox,
Hybrid radiobuttons - M, F); DOB (single vs.
Discovery multiple textboxes)
Section Scattering
3b: Concept-based 3c: Hybrid
Discovery Discovery
>= 70% compactness in 3 >= 80%
databases. compactness in 4
Datasets 5 & 6: >=33% databases.
undetected
36
38. Results Summary & Implications
Exp3:
Exp1: Form tree Interventions
Form to DB Mapping
generation (6 DBs: 35 to 450 Intervention red. 61%
Accuracy = 0.98
(52 forms) tables, Intervention/form:
0.08s/tree few ms to 200s) ling.:10, con. : 8,
•Supervised Hyb.:13
•Intervention 10/tree for Hybrid approach
cardinality improves scenario Avg. screen rel. =50%
disambiguation identification (19%)
Validation
compactness (13%) Principle Compliance
Algorithm
over pure approaches. 84.5% identical, or
But performs less in Birthing superior to gold DBs
Improve precision terms of interventions & Algorithm
74% compact(hybrid)
(43%) and recall screen relevance. Merging
(29%) over baseline Algorithm
Exp2: Form term
•Tune validation/merging based on form
annotation
Precision= 0.89 features.
(2500 forms)
1 to 11s/form Recall = 0.76 •Birthing algorithm can be refined as per
gold std.
•Sophisticated term
techniques •Interventions & screen relevance can be
•SNOMED CT relationships improved by enhancing validation
38 •Unsupervised learning algorithm
40. Thesis Contributions:
Mapping user-designed form to relational database. (NEW problem)
Form Understanding
New Solution: 2-layered HMM that encodes designers Merging Algorithm
knowledge. First work to apply HMMs on form understanding
Balance b/w compactness &
Highly accurate (98%) and efficient (0.08s per form) optimization
Merged =>70% semantically matching
Form Term Annotation (NEW Problem!) elements in 11/18 cases.
Context-based solution leveraging semantic structure
Key Recommendations
Promising (0.89 precision, 0.76 recall) and efficient (1-11s);
Improves over baseline by 43% in precision and 29% in recall For term annotations, design hybrid
approaches leveraging both linguistics
Correspondence Validation Algorithm and structural semantics.
Heuristic based solution relying on frequent observations For improving database quality, design
approaches leveraging both linguistic
Reduces interventions by avg. 61%. and semantic methods for
correspondence discovery.
Birthing Algorithm Birthing algorithm could be further
Intertwines quality and optimization principles refined in terms of handling radio-button
groups and extended check-boxes to
4 medium (<65 tables) & 2 large (<500 tables)-scale DBs improve database quality.
3 medium-scale DBs intersect(or superior) with gold by 84.5%.
Enhance validation algorithm to further
reduce user interventions and improve
40 screen relevance
41. Limitations – I
Techniques Technique Evaluation
Form Understanding Compare with other
Weak entities, part./card. learning models
constraints. SVM, conditional random
fields, Bayesian networks,
Form Term Annotations CAR
Post coordinated mapping Completeness and
Correspondence Discovery Correctness of Heuristics
Tree design rules, Heuristics
Concatenated matches
for validation and merging,
Merging Algorithm Birthing Form Patterns,
Classification attributes
Detect/eliminate circular
Assumptions
references in database.
Class conditional
independence, Correctness
of most linguistic matching
concept
41 Theoretical Validity of
Birthing Algorithm
42. Limitations - II
Study Experimental Design
Thorough User Studies Map and merge forms from
Can users understand/select
different sources
the right correspondences? Experiments involving both
automatic form tree extraction
Domain Expert Annotator
and term annotation methods.
Large Scale of Databases
Result Evaluation, Gold DB
Limited Time
Implementation
Experimentation
42
43. Future Directions
Electronic Health Record General
Can Clinicians
Turn into an API
Design Forms,
Understand/Identify Amazon SimpleDB
Correspondences
Google Datastore.
Does this framework improve
Data Quality, Patient Diagnosis Leveraging More Form-Related
Legal Perspective Information
HIPPA regulations, Proprietary Past Mappings
systems
Usage frequency
Customize for Form Categories
Designer’s/User’s Domain
Encounter, Walk-in, Regular
Visit, Data-entry Expertise
Use other UMLS terminologies Mapping Maintenance and
Record Conflict Resolution
43
44. Related Publications
Exploiting Semantic Structure for Mapping User-specified Form Terms to
SNOMED CT Concepts
Khare R., An Y., Li J., Song I-Y., Hu X. In the proceedings of 2nd International Health
Informatics Symposium (IHI 2012), Jan 28-30, 2012, Miami, FL, USA.
Automatically Mapping and Integrating Multiple Data Entry Forms into a
Database
An Y., Khare R., Song I-Y., Hu X. In the proceedings of 30th International Conference on
Conceptual Modeling (ER 2011), Oct 31-Nov 3, 2011, Brussels, Belgium.
Can Clinicians Create High-Quality Databases? A Study on A Flexible
Electronic Health Record (fEHR) System
Khare R., An Y., Song I-Y., Hu X., In the proceedings of 1st International Health Informatics
Symposium (IHI 2010), Nov 11-12, 2010, Arlington, VA, USA.
Understanding Deep Web Search Interfaces
Khare R., An Y., Song I-Y. Special Interest Group in Management of Data (SIGMOD) Record,
39(1):33-40, 2010.
An Empirical Study on using Hidden Markov Model for Search Interface
Segmentation
Khare R., and An Y., In the proceedings of 18th International Conference on Information and
Knowledge Management (CIKM 2009), Nov 3-5, 2009, Hong Kong.
44
Form is designed for human consumption. Shorter 10 times – studied on 50 forms from both categories , simpler – hierarchical and repre of database tables (single vs multiple) Explain what is the problem and why its challenging? Syntactic means – formatting and sequence. Patters are infinite and design is so arbittrary that a certain pattern cant be associated with a certain semantic intention.These approaches rely on rendering engines (Gecko, Trident), which makes them browser dependent and inefficient.
to link these elements to the corresponding semantically matching elements of the existing hidden database.Form has values. And longer terms
Whether to merge or not to mergewhether the element in question becomes a new column in a new tablecorresponding to Diagnosis and link the column through foreign key, or do we duplicate this column into the new table and reduce the number of joins.
Make sure everything i.e. the rest of the presentation aligns with this. we seek the answers to these research questions through the development of a system that automatically maps a user-designed form to an existing database.
Prepare obvious answers – how is DOM tree different from semantic tree. Why we generate corres from form tree and then transfer to new database – so that users are presented corres. In terms of the form they had designed. DB-DB integration could be done – but here we leverage semantic form properties. As well.
The input form is represented as an equivalent semantic form tree using a form understandingalgorithm. We adopt a proactive approach to mapping in that we also standardize the formterms using an annotation technique focusing on the healthcare domain. Our solutions to theform understanding and the term annotation algorithms are described in Chapter 9.2. The generated semantic form tree is then studied with respect to the existing database; andthe semantic correspondences between the form tree and the existing database elements arediscovered and validated using user interventions and certain validation rules. This part isdescribed in Chapter 10.3. The form tree with discovered correspondences to the existing database elements is thenmapped and merged with the existing database. In particular, the matching elements aremerged to the target database elements and the new form elements are transformed into newdatabase elements and the existing database is extended using the new database elements.The database design and evolution algorithms are described in Chapter 11.
Approach identifies semantic grouping
the widely used medical terminology.
The HMMs are tailored for data-entry forms, and are aligned with the forms hierarchical complexity thereby providing a high extraction accuracy (Khare and An, 2009)
Who designed the forms? Why not other domains – which other domains? Possible. Have some idea. – opportunity to study whether systems can be improved.
Why does recall decrease – when number of correct predictions decrease on applying the hybrid method. Sometime linguitic approach returns more accurate result.
total number of screens wherein the user suggested to merge the elements over the total number of screens generated as a result of executing the validation algorithm.amount of redundancy minimization performed by the algorithm
Each area indicates the contribution of a form in generating the database elements.The peaks denote the general pattern of forms in a given dataset. Most of the datasets peak atcolumns, implying the most prevalence of textbox fields in the forms. The database 2 peaks atvalues implying the prevalence of select and radiobuttonelds in the forms. The database 5 peaksat foreign keys indicating the prevalence of categories and subcategories in the forms. The broad areas represent the presence of longer forms, and the narrower regions represent the presence ofshorter, or mergeable forms.This does not include the form tree generation time, user intervention time, or the execution of database DDL statements. The duration follows no fixed pattern. It depends multiple factors including the size ofthe form, and the size of the existing database. Lucene indexing helped in controlling the durationand it ranges from a few milliseconds to 200 seconds, even for the large-scale databases such as theones generated from the datasets 4 and 5.
We performed a table-level comparison, We manually analyzed the mismatched tables
At least 50% for all datasets. Huge reduction – many scenarios could be validated were found. 5 options per screen. Screen relevance – very low This denotes that most of the correspondences, identified using the linguistic matching method adopted by Lucene, were not semantically matching, and were hence rejected by the user. The screen relevance was particularly higher (94%) for the dataset 5 that represents the family practice forms. In these forms, the linguistically matching and yet semantically differing terms were not very prevalent. Approved merger for dataset 3, out of all the mergeable form elements, identified by the validation algorithm, 97.29% were merged to a semantically matching database element.
And did we reach all system goals? Specify again. Clearly. Did we reach the system goals?
Our experience of tagging 52 data-entry forms suggests that the training samples can be constructed quickly and easily, as compared to the construction of exhaustive set of rules or heuristicsTo further test the performance of the mapping framework in a heterogeneous environment,