SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Downloaden Sie, um offline zu lesen
Abstract
Toward Data Cleaning on Date Field of Scanned Books
Jing Xie
Center for Intelligent Information Retrieval, School of Computer Science,
University of Massachusetts Amherst, 140 Governors Drive, Amherst, MA 01003-9264
Contact info: jingxie@cs.umass.edu
Introduction
Book digitization has been an important problem for many years. The existing large-scale scanned book collections used have many shortcomings for data-driven research. In
particular, researchers have been hampered by the inaccuracy of descriptive metadata in large scale scanned book collections. For example, 12/19/1977, 12/19/77, Dec 19, 1977 point
to the same publishing date of a book. However, the books with 12/19/77 or Dec 19, 1977 date field might be unsearchable and lost due to their non-standardized date format. How to
clean these “dirty” dates’ data and transform them to good data is the main focus of this work.
Data cleaning is a relatively new research field. Although current marketplace and technology for data cleaning are heavily focused on customer lists [1], some research groups are
developing methods related to data cleaning. Galhardas proposed AJAX, a data cleaning tool to address the problem of duplicate identification and elimination [2]. Srikant developed
special data mining approaches, which is relevant but not limited to data cleaning [3]. There is very little basic research directly aiming at methods to support powerful tools that can
automatically clean dates data.
Our work focuses on data cleaning on dates with non-standardized formats and we use the metadata from a large Internet Archive scanned book collection. The error types of date
formats have been determined first. The method has then been created to automatically transform these non-standardized dates to date with accepted format. The result indicates that
our method is very effective at cleaning dirty dates field.
ConclusionReference
1. R. Kimball. Dealing with Dirty Data, DBMS, vol. 9, no. 10, Sept.
1996, p. 55
2. H. Galhardas, D. Florescu, D. Shasha, E. Simon, AJAX: An
Extensible Data Cleaning Tool. Proc. ACM SIGMOD Conf., 2000, p.
590.
3. R. Srikant and R. Agrawal, Mining Generalized Association Rules.
Proc. 21st VLDB conf., 1995.
How to curate the date field How to narrow down the date range
Formats of the “dirty” data
Curate the
date field
Narrow down
the date
range
• Missing
• [c1915]
• [n.d.]
• 1841-1958
• 1911
• February 2-3, 1996
• [19--]
• 1912, c1904
• 1904-07, 1847-54
• June, 1890
• 18 -19
• [1900?]-[27]
• 200-?-
• 1793, 1773, 1731, 1789, 1763
• [187-?]-[19--]
• M DCC LVIII. [1758]
• M DCC LVIII
• Metadata source: Internet Archive scanned book collection, 50000 metadata from
different contributors
• The date field with different formats in the metadata, which is called “dirty” data
• “dirty” data impedes information retrieval and affects the retrieval result.
• How to clean the “dirty” data is a big challenge in the real world
• Our research is focused on the methods to support powerful tools that can
automatically clean the dates metadata
Identify the
different format
of the date
Create “from” “to”
date including year,
month and day
Parse the date field
and complete the
missing field
Build a model
associated with the
title to calculate
P(year/title)
Calculate P(year|title)
using (year, title)
statistics
Select the standard to
estimate P(year/title)
good or not. If not,
change to the year with
the highest P(year/title)
1. Standardize the date format and make all metadata
retrievable.
2. Build a model to narrow down the date range, which makes
the curated date more accurate.
Acknowledgement
This work is advised by Prof. James Allan

Weitere ähnliche Inhalte

Ähnlich wie poster-Jing-09302014

An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...Keith.May
 
Digital History and Big Data: text mining historical documents on trade in th...
Digital History and Big Data: text mining historical documents on trade in th...Digital History and Big Data: text mining historical documents on trade in th...
Digital History and Big Data: text mining historical documents on trade in th...Beatrice Alex
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsGramener
 
Spatial Data, KML, and the University Web
Spatial Data, KML, and the University WebSpatial Data, KML, and the University Web
Spatial Data, KML, and the University WebGlennon Alan
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptxfahadusman23
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data mining
Data miningData mining
Data miningSilicon
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 

Ähnlich wie poster-Jing-09302014 (20)

An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
Digital History and Big Data: text mining historical documents on trade in th...
Digital History and Big Data: text mining historical documents on trade in th...Digital History and Big Data: text mining historical documents on trade in th...
Digital History and Big Data: text mining historical documents on trade in th...
 
Ayasdi Case Study
Ayasdi Case StudyAyasdi Case Study
Ayasdi Case Study
 
Ayasdi: Demystifying the Unknown
Ayasdi: Demystifying the UnknownAyasdi: Demystifying the Unknown
Ayasdi: Demystifying the Unknown
 
Data mining
Data miningData mining
Data mining
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Spatial Data, KML, and the University Web
Spatial Data, KML, and the University WebSpatial Data, KML, and the University Web
Spatial Data, KML, and the University Web
 
Data Mining - Presentation.pptx
Data Mining - Presentation.pptxData Mining - Presentation.pptx
Data Mining - Presentation.pptx
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
My3prep
My3prepMy3prep
My3prep
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Datamining
DataminingDatamining
Datamining
 
Data mining
Data miningData mining
Data mining
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Data processing
Data processingData processing
Data processing
 

poster-Jing-09302014

  • 1. Abstract Toward Data Cleaning on Date Field of Scanned Books Jing Xie Center for Intelligent Information Retrieval, School of Computer Science, University of Massachusetts Amherst, 140 Governors Drive, Amherst, MA 01003-9264 Contact info: jingxie@cs.umass.edu Introduction Book digitization has been an important problem for many years. The existing large-scale scanned book collections used have many shortcomings for data-driven research. In particular, researchers have been hampered by the inaccuracy of descriptive metadata in large scale scanned book collections. For example, 12/19/1977, 12/19/77, Dec 19, 1977 point to the same publishing date of a book. However, the books with 12/19/77 or Dec 19, 1977 date field might be unsearchable and lost due to their non-standardized date format. How to clean these “dirty” dates’ data and transform them to good data is the main focus of this work. Data cleaning is a relatively new research field. Although current marketplace and technology for data cleaning are heavily focused on customer lists [1], some research groups are developing methods related to data cleaning. Galhardas proposed AJAX, a data cleaning tool to address the problem of duplicate identification and elimination [2]. Srikant developed special data mining approaches, which is relevant but not limited to data cleaning [3]. There is very little basic research directly aiming at methods to support powerful tools that can automatically clean dates data. Our work focuses on data cleaning on dates with non-standardized formats and we use the metadata from a large Internet Archive scanned book collection. The error types of date formats have been determined first. The method has then been created to automatically transform these non-standardized dates to date with accepted format. The result indicates that our method is very effective at cleaning dirty dates field. ConclusionReference 1. R. Kimball. Dealing with Dirty Data, DBMS, vol. 9, no. 10, Sept. 1996, p. 55 2. H. Galhardas, D. Florescu, D. Shasha, E. Simon, AJAX: An Extensible Data Cleaning Tool. Proc. ACM SIGMOD Conf., 2000, p. 590. 3. R. Srikant and R. Agrawal, Mining Generalized Association Rules. Proc. 21st VLDB conf., 1995. How to curate the date field How to narrow down the date range Formats of the “dirty” data Curate the date field Narrow down the date range • Missing • [c1915] • [n.d.] • 1841-1958 • 1911 • February 2-3, 1996 • [19--] • 1912, c1904 • 1904-07, 1847-54 • June, 1890 • 18 -19 • [1900?]-[27] • 200-?- • 1793, 1773, 1731, 1789, 1763 • [187-?]-[19--] • M DCC LVIII. [1758] • M DCC LVIII • Metadata source: Internet Archive scanned book collection, 50000 metadata from different contributors • The date field with different formats in the metadata, which is called “dirty” data • “dirty” data impedes information retrieval and affects the retrieval result. • How to clean the “dirty” data is a big challenge in the real world • Our research is focused on the methods to support powerful tools that can automatically clean the dates metadata Identify the different format of the date Create “from” “to” date including year, month and day Parse the date field and complete the missing field Build a model associated with the title to calculate P(year/title) Calculate P(year|title) using (year, title) statistics Select the standard to estimate P(year/title) good or not. If not, change to the year with the highest P(year/title) 1. Standardize the date format and make all metadata retrievable. 2. Build a model to narrow down the date range, which makes the curated date more accurate. Acknowledgement This work is advised by Prof. James Allan