a poster presented at the LDAV 2011 IEEE Symposium on Large-Scale Data Analysis and Visualization, in Providence, RI on Oct. 24, 2011. Authors: Jefferson Heard, RENCI & Richard Marciano, UNC/SALT lab. Research funded by the NARA Information Services / Applied Research division (Continuing grant NSF/OCI-0848296)
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
A system for scalable visualization of geographic archival records
1. A system for scalable visualization of geographic archival records
Jefferson R. Heard and Richard J. Marciano
Renaissance Computing Institute (RENCI) and Sustainable Archives & Leveraging Technologies lab (SALT)
University of North Carolina at Chapel Hill
ABSTRACT 3 APPROACH
We present a system that visualizes large collections of archival The next section will discuss our approaches to archiving,
geographic records. This system is comprised of a data grid indexing, and visualization.
containing a 60TB test collection gleaned from the US National
Archives, and three web-applications: an indexer and two web and 3.1 Archiving
mobile-device based visualizations focusing on collection Archiving is done through the IRODS system [4]. It provides the
understanding in a geographic context. ability to write rules that are run on files and directories as they
are entered into the system, and it allows for extensible metadata
KEYWORDS: Visualization, archival records, large collections. collocated with a file or directory itself. IRODS additionally
forms a Data Grid [5] that can be federated and expanded as
INDEX TERMS: Big data, data-intensive research, preservation requirements grow or policies require.
All of our record groups were copied into a central iRODS
1 INTRODUCTION repository, built on top of a DataDirect Networks DDN9900
The visualization of large collections of documents has had a storage rack and managed by a metadata catalog, iCAT. We
significant amount of attention over the last few years. The considered using a federated grid, but it was determined that for
problem of indexing and visually browsing archival records, while performance in visualization and indexing, it was best to collocate
it can be said to include the above problem, is more complex. compute and data resources.
Archival metadata includes file attributes, location, provenance,
etc. Thus archival records are complex semi-structured data, and 3.2 Indexing
scaling to millions or billions of records is not trivial.
An important special case of archiving is that of large archives !9$"&!#("<*& 89+(#"&@A*"<)(9<)&
of geographic records. These are common in the governmental 71,"$%& 4;"<=/%"$)& 71,"$%:2& 71,"$%89+(#"& >?9<"6/;&
!#("<*&5(-"&
N&
collections we have studied in the CI-BER project,
CyberInfrastructure for Billions of Electronic Records [1]. Each
G7GHIJ"+&C#("<*&$"K,")*)&
geographic record may contain large amounts of metadata within J"+/;;&5"$L"$&
O& G;;I9<#(<"&/</#%*(C)&)LC)& 39#")IGCC"))& JM5& J85& J!5& '5.0&
that is not readily indexed by common methods. In this paper, we
present a system for indexing and web-based visualization of this 89-"#&K,"$(")&
F/*C?&G</#%*(C)&
kind of archive in a scalable fashion using RENCI’s Geoanalytics
.2D& 5*/*(C&'/*/& '/*/!,+"& >%$/B(-& 5"<)9$!9##"C*(9<&
cyber-infrastructure [2]. '()*$(+,*"-&
P& '/*/&89-"#)&
>9)*625& 89<E9'F&
2 PROBLEM DESCRIPTION
The CI-BER project is about scaling archiving systems to handle G</#%*(C)&;$9C"))")& G$C?(L(<E&
archives of billions of electronic records. We have built a testbed !"#"$%&'()*$(+,*"-&./)0&1,","&&
@A*"$</#&
'/*/&
collection called the CI-BER Testbed [1] that currently contains
Q&
over 60 Terabytes of archival records from the US Government’s
!9B;,*"&3")9,$C")& 234'5&'/*/&6$(-&
National Archives and Records Administration (NARA). These
cover hundreds of different agencies and currently comprise
roughly 60M archival records. Throughout these archives are
Figure 1. Geoanalytics Architecture
large chunks of geographic data.
Geographic data falls into roughly two categories: vector and
raster. For these there are several dozen file formats. Some are Indexing happens through RENCI’s Geoanalytics[2] cyber-
no longer readable, but many can be opened using open source infrastructure, chosen because it provides facilities for managing
tools like GDAL[3]. In addition to different formats, there are large amounts of geographic data. Its architecture is briefly
thousands of geographic projections be used by different datasets. described in Figure 1. We take advantage of its distributed task
Our problem is to be able to interactively visualize the metadata queue, Celery [6], and its document-oriented data store,
from these records and get a clear picture of what physical areas MongoDB [1] to handle our indexing process. The indexing
these collections cover, allowing a user to “drill-down” to the process is started through a web-application.
actual file if desired. Our indexer has thus far indexed the largest of the geographic
data collections, around 12TB of data. The indexer is incremental
in nature and can be run on new collections as they are
incorporated into the archive. Incremental indexing does not
effect on the availability of visualizations on the already indexed
data.
2. The indexing architecture in Figure 2 scales to multiple touch-enabled mobile device) or clicking on a box in the tree-map
machines and CPUs. Our current indexer uses five four-CPU shows the bounding box of all the files in that box and shows a
machines, each with a single 1GBit network interface to the grid listing of all the geographic metadata for its directory, or in the
to index data. The indexing process is thus: case of a single record, the metadata for that record.
789%48:;8<2=%%
!% >5984?<>1@A>BCC8>DB@%
)% E5+8<B;4>8F&%
0>1@% *%
)%
+% -@68H%
,%
G5C8<%2B%5@68H%
G5C284%1@6%5@68H%3-0%!1@656128<%
!"#% !"#% !"#% !"#% !"#% !"#%
!"#$% &% !"#$% &% !"#$% &%
(% '% (% '% (% '%
5C<% 5I82%
Figure 3. The bottom-up visualization.
-+./0%/121%3456%
The “top down” visualization begins with an OpenLayers physical
map, and allows the user to navigate, pan, and zoom, then draw a
Figure 2. Indexer architecture bounding box. Once drawn, the bounding box lists the collections
in a list on the left. The user can then tap on a collection and see
1. Request to index a collection stored in IRODS.
the subdirectories in that collection, and can continue to “drill
2. The indexer identifies a set of nodes in the
down” until he or she hits an actual metadata record. If the user
Geoanalytics cluster to perform the indexing, and has
taps on a metadata record, the detailed accounting of the metadata
them start a new IRODS session.
for that record replaces the map.
3. The indexer asks one node to perform the “crawl”
task, which recursively iterates the collection.
4. The “crawl” task marks potential GIS files and
archives containing them (tarballs, zipfiles), and
queues them with Celery to be indexed.
5. All other nodes pull items of the indexing queue and
perform the following:
a. iget the resource
b. Optionally unarchive the resource
c. Identify GIS files.
d. Identify a program that opens the file,
transform it to lat/lon, and index.
3.3 Visualization
For archival purposes, understanding the context of a document is
critical. Collection understanding [7] is the task of developing
tools that help the user comprehend the collection as a whole and
contextualize documents’ place in that whole. We chose to focus
on the collection understanding task because of the size of the Figure 4. The “top-down” visuaiization on the iPad
collections we were given and because we wanted to build tools
that would be broadly applicable to other collections. 4 CONCLUSION
To create tools that can be used by a wide audience, we chose We have presented a system that indexes and visualizes large
to create web-based visualizations that can be reformatted to archival record sets containing geographic data. We have an
appear on mobile devices, such as the iPad and iPhone 4. We have indexer that can scale to use multiple CPUs on a cluster of
created two visualizations which represent “bottom up” and “top
machines and two web-based interactive visualizations that show
down” views of a geographic collection. this index in a geographic context. Our future work will include
The “bottom up” visualization shown in Figure 3 allows a user unifying these visual interfaces and providing statistics on the
to start with a collection, shown as a tree-map similar to the scalability of the indexer relative to data grid size. This project is
visualization in [9]. The user is presented with a tree-map funded by NSF/OCI grant 0848296 as part a cooperative research
containing grey, red, yellow, and blue boxes. Each box agreement between the NARA’s Applied Research division, the
corresponds to a directory in the collection, which may contain a National Science Foundation (NSF), and the University of North
number of subdirectories. Each box is scaled to the number of Carolina at Chapel Hill. Project Director is Richard Marciano
files it contains. The colors correspond to entries containing with visualization expert Jeff Heard. Project collaborators include
vector records only (red), raster only (blue), both raster and vector Stan Ahalt, Leesa Brieger, Chien-Yi Hou, Arcot Rajasekar, Sarah
(yellow), and no geographic files (gray). Next to the tree-map is a Lippincott, Brendan O’Connell, and Sheau-Yen Chen.
physical map, provided by OpenLayers [10]. Tapping (on a
3. REFERENCES
[1] CI-BER: CyberInfrastructure for Billions of Electronic Records,
http://ci-ber.blogspot.com/
[2] J. Heard. The Geoanalytics system. http://www.renci.org/. Technical
Note, Renaissance Computing Institute. 2011.
[3] The Open Source Geospatial Foundation. GDAL/OGR.
http://gdal.org. 2011.
[4] Introduction to iRODS. https://www.irods.org/
index.php/Introduction_to_iRODS.
[5] “The Grid: Blueprint for a New Computing”: A book edited by I.
Foster, C. Kesselman, Pub. Morgan Kaufmann, San Francisco, 1999.
Chapter 5, “Data Intensive Computing”, R. Moore, C. Baru, R.
Marciano, A. Rajasekar, M. Wan.
[6] The Celery Group. Celery. http://celeryproject.org. 2011
[7] The MongoDB Group. MongoDB. http://www.mongodb.org.
2011Chang, M., Leggett, J.J., Furuta, R., Kerne, A., Williams, J.P.,
Burns, S.A., and Blas, R.G. Collection Understanding [visualization
tools in information retrieval]. Proceedings of the 2004 Joint
ACM/IEEE Conference on Digital Libraries, pages 334-342. IEEE
Press, June 2004.
[8] B Shneiderman. Tree visualization with tree maps: 2-d space-filling
approach. ACM Transactions on Graphics (TOG), Volume 11,
Number 1, pages 91-99. 1992.
[9] Jiu, W., Esteva, M., and Dott S.J. Visualization for Archival
Appraisal of Large Digital Collections. Proceedings of the IS&T
Archiving Conference 2010 (The Hague), pages 157-162. 2010.
[10] The Open Source Geospatial Foundation. OpenLayers.
http://openlayers.org. 2011.