poster-Jing-09302014

Abstract
Toward Data Cleaning on Date Field of Scanned Books
Jing Xie
Center for Intelligent Information Retrieval, School of Computer Science,
University of Massachusetts Amherst, 140 Governors Drive, Amherst, MA 01003-9264
Contact info: jingxie@cs.umass.edu
Introduction
Book digitization has been an important problem for many years. The existing large-scale scanned book collections used have many shortcomings for data-driven research. In
particular, researchers have been hampered by the inaccuracy of descriptive metadata in large scale scanned book collections. For example, 12/19/1977, 12/19/77, Dec 19, 1977 point
to the same publishing date of a book. However, the books with 12/19/77 or Dec 19, 1977 date field might be unsearchable and lost due to their non-standardized date format. How to
clean these “dirty” dates’ data and transform them to good data is the main focus of this work.
Data cleaning is a relatively new research field. Although current marketplace and technology for data cleaning are heavily focused on customer lists [1], some research groups are
developing methods related to data cleaning. Galhardas proposed AJAX, a data cleaning tool to address the problem of duplicate identification and elimination [2]. Srikant developed
special data mining approaches, which is relevant but not limited to data cleaning [3]. There is very little basic research directly aiming at methods to support powerful tools that can
automatically clean dates data.
Our work focuses on data cleaning on dates with non-standardized formats and we use the metadata from a large Internet Archive scanned book collection. The error types of date
formats have been determined first. The method has then been created to automatically transform these non-standardized dates to date with accepted format. The result indicates that
our method is very effective at cleaning dirty dates field.
ConclusionReference
1. R. Kimball. Dealing with Dirty Data, DBMS, vol. 9, no. 10, Sept.
1996, p. 55
2. H. Galhardas, D. Florescu, D. Shasha, E. Simon, AJAX: An
Extensible Data Cleaning Tool. Proc. ACM SIGMOD Conf., 2000, p.
590.
3. R. Srikant and R. Agrawal, Mining Generalized Association Rules.
Proc. 21st VLDB conf., 1995.
How to curate the date field How to narrow down the date range
Formats of the “dirty” data
Curate the
date field
Narrow down
the date
range
• Missing
• [c1915]
• [n.d.]
• 1841-1958
• 1911
• February 2-3, 1996
• [19--]
• 1912, c1904
• 1904-07, 1847-54
• June, 1890
• 18 -19
• [1900?]-[27]
• 200-?-
• 1793, 1773, 1731, 1789, 1763
• [187-?]-[19--]
• M DCC LVIII. [1758]
• M DCC LVIII
• Metadata source: Internet Archive scanned book collection, 50000 metadata from
different contributors
• The date field with different formats in the metadata, which is called “dirty” data
• “dirty” data impedes information retrieval and affects the retrieval result.
• How to clean the “dirty” data is a big challenge in the real world
• Our research is focused on the methods to support powerful tools that can
automatically clean the dates metadata
Identify the
different format
of the date
Create “from” “to”
date including year,
month and day
Parse the date field
and complete the
missing field
Build a model
associated with the
title to calculate
P(year/title)
Calculate P(year|title)
using (year, title)
statistics
Select the standard to
estimate P(year/title)
good or not. If not,
change to the year with
the highest P(year/title)
1. Standardize the date format and make all metadata
retrievable.
2. Build a model to narrow down the date range, which makes
the curated date more accurate.
Acknowledgement
This work is advised by Prof. James Allan

poster-Jing-09302014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie poster-Jing-09302014

Ähnlich wie poster-Jing-09302014 (20)

poster-Jing-09302014