ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
Dissertation Proposal Presentation
1. Modeling and Mappingforms over databases:empowering users to DESIGN databases IN INDUSTRIAL DOMAINS Dissertation Proposal October 07 2010 Ritu Khare 1
2. Database Design by Non-technical Users Why existing methods have not reached the industrial domains? MOTIVATION 2
3.
4. Why existing methods are unfit for industrial domains? No provision to modify or extend an existing database Translation(Forward Engineering) Method is not reported. Not tested on non-technical users Databases are required to evolve w.r.t. new user needs Data and Database Quality is important quality leads to productivity. (Batini and Scannapieco, 2006) Users have no background in data modeling and databases 4 Existing Applications Features of Industrial Domains
5. Proposed System and Research Goals Opportunity: Forms Example: Form to Database Mapping Challenges in Mapping THE PROPOSAL 5
6. Proposed System and Research Goals 6 Proposed System: An application to model and map user needs into an existing database Goals: Modeling: “Usable” medium for users to model needs Efficiency, Effectiveness, Adoption Mapping: The resultant database should be high-quality, i.e. should satisfy: (Silberschatz et al. 2001, Batini and Scannapieco, 2006, Batini et al. 1992) Normalization Completeness Compactness Correctness
7. Opportunity: Forms 7 MODELING: Data-entry Forms provide a good communication medium for users to specify their data collection needs. (Choobineh et al. 1988, Embley, 1989) MAPPING: Important information on databases could be retrieved by analyzing forms (Choobineh and Mannino, 1988). Search forms provide a useful way in determining the underlying database(Benslimane, 2007) (Covered in Candidacy Exam) Data-entry forms provide key guidelines in designing a prospective database(Mannino and Choobineh, 1984).
8. The proposed application: An Example Patient VitalSigns design Clinician New Needs New User Designed Form Existing Database Evolved Database Form to Database Mapping 8 Form Modeling NEW PROBLEM!
9. Uniqueness of “Form to Database” Mapping Two structures are similar. Mapping involves only schema elements (no values). Do not consider schema /database evolution when there are unmapped elements. Semiautomatic Mapping Discovery How to reconcile the differences in structures and semantics? How to detect the form(or need) components (including values) which already exist in the database? Database Evolution How to extend database based on new elements in the form? How to automatically determine functional dependencies and cardinalities from a form? 9 Schema Mapping (Rahm and Bernstein 2001) Form to Database Mapping
11. 1. Form Design Interface 11 SIMPLE! 1. Terminology (intuitive) 2. Features(form patterns) Supporting Text Format Title Unit Category Field Subcategory Extended Checkbox option Subfield Condition Simple Form Advanced Form
12. 1. Form Design Interface 12 Input: User actions (based on data collection needs) Output: Form Enter the Title “Patient Encounter Form” Enter the category “Patient” Enter the field “Name” Pick a format “textbox” Enter the field “Age” …
13. Defining High-Quality Guiding Principles(with respect to a given form) 13 Completeness Every form element has a place in database Correctness For each correspondence the form element and the database element refer to the same real-world element (has matching labels and contexts). Compactness Every database element occurs just once. Normalization The database is in 3NF
14. A Simple Approach. 14 Lose grouping information Lose form values 3. Heterogeneous attributes placed in same relation. Generated database is incomplete and not in 3NF (low-quality)! So we propose a tree representation to form.
15. 2. Tree Generation Definition: Form Tree 15 Input: Form Output: Form Tree Previous works have proposed a similar tree representation for search forms.(Dragut et al. 09, Wu et al. 09) 1) data-entry forms. 2) format nodes to improve DB quality. 3) different representation for checkboxes and radiobuttons.
16. Form to Database Mapping 16 Existing Database Form Tree Map and Merge??? Main challenges: 1discovering a mapping between two heterogeneous structures 2. merging new elements into existing database 3.Birthing Form Tree New Database Graph Existing Database Graph Existing Database 4. Classification MAP MERGE 5. Extension
18. Definition: Mapping Correspondences 18 Direct correspondence Indirect Correspondence (Value collected on form element is stored in database element)
19. 3. Birthing(term adopted from Jagadish et al. 2007) 19 Input: Form Tree Output: New Database Graph
24. 4. Database Graph Classification 24 Classify each node to see if it pre-exists in the existing database or not.i.e. to find whether it “maps” or not. New Database Graph Existing DB Graph Existing DB
25. 4. Database Graph ClassificationAlgorithm 25 Problem: Finding Matching Nodes between new(DGn) and existing database graph(DGe). Algorithm For each table node tnin DGn Let te be the label-matching table node in DGe If two table nodes tnand te “match”(TableMatchalgo) Tag tn i.e., mark this node as a matching/mapped node Tag all matching column and value nodes(ColumnMatchalgo) Else Rename the table
26. 4. Database Graph ClassificationTableMatch Algorithm 26 Two table nodes “match” if Their labels match Null-value column ratio(NCR) < tolerance-threshold (efficiency consideration – minimize null value possibilities during data collection) NCR = number of unmatched columns(as per ColumnMatch) in either table (whichever is higher) / size of union set of columns in both tables
27. Example: NULL Value Column(NCR) Calculation 27 NCR= 2/5 =0.4 map If tolerance-threshold = 0.5(high) If tolerance-threshold = 0.3(low) When using Form1, 2 columns will have null values When using form 2, 1 columnwil have null values
28. 4. Database Graph ClassificationColumnMatch Algorithm 28 Two non-key column nodes “match” if their Labels /names are same Data types are same Not null constraints are same Two foreign key column nodes “match” if They both point to the same table nodes as determined by TableMatch algorithm
29. 5. Extension of the Existing Database 29 Add unmapped tables, columns, and values
31. Usability Evaluation – User Study 5 nurse professionals. No knowledge of database Moderate computer users Familiar with Paper-based Forms 2 Tasks Build task Replicate a paper-based form on the system Model and build task Model and build a given need (in natural language) into a form using the system interface. 2 rounds (form scale = no. of steps to design a form) Round 1: Small scale needs Avg. form scale = 17 Generated Avg. 4.2 relations, 5.8 non-key attributes, 1.8 values, and 3.2 foreign key references Round 2: Large scale needs Avg. form scale 47.4 Generated Avg. 6.2 relations, 13.8 attributes, 10.4 values, and 4.6 foreign key references 31 Participants and Tasks Study Settings
32. 32 MEASUREMENTS Duration Ratio = Time(in min)/ Form Scale(#of steps to build form) Assistance Ratio = # of assistances sought/ Form Scale(#of steps to build form) Outliers: P3: considered design alternatives(high duration ratio) P5: had difficulty in form terminology(needed more assistance)
33. Findings Effectiveness: In 19/20 cases, participants finished the tasks with 100% effectiveness. The unsuccessful case: a building error committed by a participant who skipped a component while building forms. Efficiency: Duration ranged from 1 to 9 minutes for simple small-scale needs, and 7 to 19 minutes for advanced long-scale needs. Exception: A participant who considered several design alternatives . System Adoption Efficiency : consistently improved from round 1 to round 2. Confidence: Very confident for specifying small-scale needs for both the tasks. Improved from round 1 to round 2 for the build task. Did not improve for model-and-build task, from round 1 to round 2. Understanding: improved greatly in round 2. They started synthesizing their knowledge of form concepts and domain knowledge to consider different design alternatives. 33 Comparison with a Related Work Appforge (Yang et al. 2008): Users are required to create forms and expressive views and are exposed to the existing schema. In our work, users only create forms and mapping is handled by system.
34.
35. Analyzing Inaccuracies and System Enhancement 35 M:M M:M Added another layer of interaction : to disambiguate cardinality between 2 entities. Result: All the databases were identical to respective gold standard databases. Inference: The mapping algorithms have the ability to generate databases in industrial domains.
36. Mapping Experiment Set 2 36 For each domain Performed mapping experiments with at least 5 different sequences of forms (representing diff. merging situations. ) Result: All the databases generated from different sequences are identical to each other and to the gold standard databases. Inference: The mapping algorithms have the ability to evolve databases in industrial domains in a variety of merging situations
37. Current and Predicted Contributions 37 Introducing the Form to Database Mapping Algorithms driven by data-quality principles Mapping experiments on 5 domains System has the potential to generate high-quality databases in industrial settings solely based on user-designed forms and user-provided domain knowledge. to evolve existing databases in a variety of merging situations. Usability Study System has the potential to be adopted by non-technical users while providing them efficiency and effectiveness in form modeling.
Using the Form Design Interface, users can design simple as well as advanced forms. To make it usable for non-tech. users, we have kept the interface simple in terms of terminology as well as design. Terminology means – terms used are simple and commonplace – features supported are present in various data-entry forms – which users might already be familiar with. E.g. terms used are