SlideShare a Scribd company logo
1 of 3
Download to read offline
A system for scalable visualization of geographic archival records

                                               Jefferson R. Heard and Richard J. Marciano
           Renaissance Computing Institute (RENCI) and Sustainable Archives & Leveraging Technologies lab (SALT)
                                                 University of North Carolina at Chapel Hill


ABSTRACT                                                                3                           APPROACH
We present a system that visualizes large collections of archival       The next section will discuss our approaches to archiving,
geographic records. This system is comprised of a data grid             indexing, and visualization.
containing a 60TB test collection gleaned from the US National
Archives, and three web-applications: an indexer and two web and        3.1      Archiving
mobile-device based visualizations focusing on collection               Archiving is done through the IRODS system [4]. It provides the
understanding in a geographic context.                                  ability to write rules that are run on files and directories as they
                                                                        are entered into the system, and it allows for extensible metadata
KEYWORDS: Visualization, archival records, large collections.           collocated with a file or directory itself. IRODS additionally
                                                                        forms a Data Grid [5] that can be federated and expanded as
INDEX TERMS: Big data, data-intensive research, preservation            requirements grow or policies require.
                                                                           All of our record groups were copied into a central iRODS
1     INTRODUCTION                                                      repository, built on top of a DataDirect Networks DDN9900
The visualization of large collections of documents has had a           storage rack and managed by a metadata catalog, iCAT. We
significant amount of attention over the last few years. The            considered using a federated grid, but it was determined that for
problem of indexing and visually browsing archival records, while       performance in visualization and indexing, it was best to collocate
it can be said to include the above problem, is more complex.           compute and data resources.
Archival metadata includes file attributes, location, provenance,
etc. Thus archival records are complex semi-structured data, and        3.2                          Indexing
scaling to millions or billions of records is not trivial.
   An important special case of archiving is that of large archives                                                  !9$"&!#("<*&                           89+(#"&@A*"<)(9<)&

of geographic records. These are common in the governmental                                            71,"$%&       4;"<=/%"$)&      71,"$%:2&        71,"$%89+(#"&       >?9<"6/;&
                                                                                                                                                                                         !#("<*&5(-"&
                                                                            N&
collections we have studied in the CI-BER project,
CyberInfrastructure for Billions of Electronic Records [1]. Each
                                                                                                                                                             G7GHIJ"+&C#("<*&$"K,")*)&
geographic record may contain large amounts of metadata within                                                                                                                           J"+/;;&5"$L"$&
                                                                            O&                       G;;I9<#(<"&/</#%*(C)&)LC)&     39#")IGCC"))&    JM5&        J85&     J!5&   '5.0&
that is not readily indexed by common methods. In this paper, we
present a system for indexing and web-based visualization of this                                                                                            89-"#&K,"$(")&
                                                                                 F/*C?&G</#%*(C)&




kind of archive in a scalable fashion using RENCI’s Geoanalytics
                                                                                                        .2D&        5*/*(C&'/*/&    '/*/!,+"&       >%$/B(-&         5"<)9$!9##"C*(9<&
cyber-infrastructure [2].                                                                                                                                                                '()*$(+,*"-&
                                                                         P&                                                                                                              '/*/&89-"#)&
                                                                                                               >9)*625&                                89<E9'F&
2     PROBLEM DESCRIPTION
The CI-BER project is about scaling archiving systems to handle                                                G</#%*(C)&;$9C"))")&                          G$C?(L(<E&
archives of billions of electronic records. We have built a testbed                                                            !"#"$%&'()*$(+,*"-&./)0&1,","&&
                                                                                                                                                                                             @A*"$</#&
                                                                                                                                                                                              '/*/&
collection called the CI-BER Testbed [1] that currently contains
                                                                            Q&
over 60 Terabytes of archival records from the US Government’s
                                                                                                               !9B;,*"&3")9,$C")&                           234'5&'/*/&6$(-&
National Archives and Records Administration (NARA). These
cover hundreds of different agencies and currently comprise
roughly 60M archival records. Throughout these archives are
                                                                                                               Figure 1. Geoanalytics Architecture
large chunks of geographic data.
   Geographic data falls into roughly two categories: vector and
raster. For these there are several dozen file formats. Some are        Indexing happens through RENCI’s Geoanalytics[2] cyber-
no longer readable, but many can be opened using open source            infrastructure, chosen because it provides facilities for managing
tools like GDAL[3]. In addition to different formats, there are         large amounts of geographic data. Its architecture is briefly
thousands of geographic projections be used by different datasets.      described in Figure 1. We take advantage of its distributed task
   Our problem is to be able to interactively visualize the metadata    queue, Celery [6], and its document-oriented data store,
from these records and get a clear picture of what physical areas       MongoDB [1] to handle our indexing process. The indexing
these collections cover, allowing a user to “drill-down” to the         process is started through a web-application.
actual file if desired.                                                    Our indexer has thus far indexed the largest of the geographic
                                                                        data collections, around 12TB of data. The indexer is incremental
                                                                        in nature and can be run on new collections as they are
                                                                        incorporated into the archive. Incremental indexing does not
                                                                        effect on the availability of visualizations on the already indexed
                                                                        data.
The indexing architecture in Figure 2 scales to multiple                                                       touch-enabled mobile device) or clicking on a box in the tree-map
machines and CPUs. Our current indexer uses five four-CPU                                                         shows the bounding box of all the files in that box and shows a
machines, each with a single 1GBit network interface to the grid                                                  listing of all the geographic metadata for its directory, or in the
to index data. The indexing process is thus:                                                                      case of a single record, the metadata for that record.

                                                                                       789%48:;8<2=%%
                                                          !%                       >5984?<>1@A>BCC8>DB@%
                                                          )%                           E5+8<B;4>8F&%

                       0>1@%                              *%
                                                          )%
                                                          +%                                     -@68H%
                                                          ,%

                      G5C8<%2B%5@68H%
                                                         G5C284%1@6%5@68H%3-0%!1@656128<%


              !"#%             !"#%               !"#%              !"#%              !"#%                 !"#%
    !"#$%              &%               !"#$%                  &%          !"#$%                 &%
               (%               '%                 (%                '%                (%                   '%




     5C<%    5I82%


                                                                                                                         Figure 3. The bottom-up visualization.
                                                -+./0%/121%3456%
                                                                                                                  The “top down” visualization begins with an OpenLayers physical
                                                                                                                  map, and allows the user to navigate, pan, and zoom, then draw a
                     Figure 2. Indexer architecture                                                               bounding box. Once drawn, the bounding box lists the collections
                                                                                                                  in a list on the left. The user can then tap on a collection and see
            1.       Request to index a collection stored in IRODS.
                                                                                                                  the subdirectories in that collection, and can continue to “drill
            2.       The indexer identifies a set of nodes in the
                                                                                                                  down” until he or she hits an actual metadata record. If the user
                     Geoanalytics cluster to perform the indexing, and has
                                                                                                                  taps on a metadata record, the detailed accounting of the metadata
                     them start a new IRODS session.
                                                                                                                  for that record replaces the map.
            3.       The indexer asks one node to perform the “crawl”
                     task, which recursively iterates the collection.
            4.       The “crawl” task marks potential GIS files and
                     archives containing them (tarballs, zipfiles), and
                     queues them with Celery to be indexed.
            5.       All other nodes pull items of the indexing queue and
                     perform the following:
                           a. iget the resource
                           b. Optionally unarchive the resource
                           c. Identify GIS files.
                           d. Identify a program that opens the file,
                               transform it to lat/lon, and index.

3.3      Visualization
For archival purposes, understanding the context of a document is
critical. Collection understanding [7] is the task of developing
tools that help the user comprehend the collection as a whole and
contextualize documents’ place in that whole. We chose to focus
on the collection understanding task because of the size of the                                                            Figure 4. The “top-down” visuaiization on the iPad
collections we were given and because we wanted to build tools
that would be broadly applicable to other collections.                                                            4     CONCLUSION
   To create tools that can be used by a wide audience, we chose                                                     We have presented a system that indexes and visualizes large
to create web-based visualizations that can be reformatted to                                                     archival record sets containing geographic data. We have an
appear on mobile devices, such as the iPad and iPhone 4. We have                                                  indexer that can scale to use multiple CPUs on a cluster of
created two visualizations which represent “bottom up” and “top
                                                                                                                  machines and two web-based interactive visualizations that show
down” views of a geographic collection.                                                                           this index in a geographic context. Our future work will include
   The “bottom up” visualization shown in Figure 3 allows a user                                                  unifying these visual interfaces and providing statistics on the
to start with a collection, shown as a tree-map similar to the                                                    scalability of the indexer relative to data grid size. This project is
visualization in [9]. The user is presented with a tree-map                                                       funded by NSF/OCI grant 0848296 as part a cooperative research
containing grey, red, yellow, and blue boxes.              Each box                                               agreement between the NARA’s Applied Research division, the
corresponds to a directory in the collection, which may contain a                                                 National Science Foundation (NSF), and the University of North
number of subdirectories. Each box is scaled to the number of                                                     Carolina at Chapel Hill. Project Director is Richard Marciano
files it contains. The colors correspond to entries containing                                                    with visualization expert Jeff Heard. Project collaborators include
vector records only (red), raster only (blue), both raster and vector                                             Stan Ahalt, Leesa Brieger, Chien-Yi Hou, Arcot Rajasekar, Sarah
(yellow), and no geographic files (gray). Next to the tree-map is a                                               Lippincott, Brendan O’Connell, and Sheau-Yen Chen.
physical map, provided by OpenLayers [10]. Tapping (on a
REFERENCES
[1]  CI-BER: CyberInfrastructure for Billions of Electronic Records,
     http://ci-ber.blogspot.com/
[2] J. Heard. The Geoanalytics system. http://www.renci.org/. Technical
     Note, Renaissance Computing Institute. 2011.
[3] The Open Source Geospatial Foundation. GDAL/OGR.
     http://gdal.org. 2011.
[4] Introduction          to      iRODS.           https://www.irods.org/
     index.php/Introduction_to_iRODS.
[5] “The Grid: Blueprint for a New Computing”: A book edited by I.
     Foster, C. Kesselman, Pub. Morgan Kaufmann, San Francisco, 1999.
     Chapter 5, “Data Intensive Computing”, R. Moore, C. Baru, R.
     Marciano, A. Rajasekar, M. Wan.
[6] The Celery Group. Celery. http://celeryproject.org. 2011
[7] The MongoDB Group. MongoDB. http://www.mongodb.org.
     2011Chang, M., Leggett, J.J., Furuta, R., Kerne, A., Williams, J.P.,
     Burns, S.A., and Blas, R.G. Collection Understanding [visualization
     tools in information retrieval]. Proceedings of the 2004 Joint
     ACM/IEEE Conference on Digital Libraries, pages 334-342. IEEE
     Press, June 2004.
[8] B Shneiderman. Tree visualization with tree maps: 2-d space-filling
     approach. ACM Transactions on Graphics (TOG), Volume 11,
     Number 1, pages 91-99. 1992.
[9] Jiu, W., Esteva, M., and Dott S.J. Visualization for Archival
     Appraisal of Large Digital Collections. Proceedings of the IS&T
     Archiving Conference 2010 (The Hague), pages 157-162. 2010.
[10] The Open Source Geospatial Foundation. OpenLayers.
     http://openlayers.org. 2011.

More Related Content

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

A system for scalable visualization of geographic archival records

  • 1. A system for scalable visualization of geographic archival records Jefferson R. Heard and Richard J. Marciano Renaissance Computing Institute (RENCI) and Sustainable Archives & Leveraging Technologies lab (SALT) University of North Carolina at Chapel Hill ABSTRACT 3 APPROACH We present a system that visualizes large collections of archival The next section will discuss our approaches to archiving, geographic records. This system is comprised of a data grid indexing, and visualization. containing a 60TB test collection gleaned from the US National Archives, and three web-applications: an indexer and two web and 3.1 Archiving mobile-device based visualizations focusing on collection Archiving is done through the IRODS system [4]. It provides the understanding in a geographic context. ability to write rules that are run on files and directories as they are entered into the system, and it allows for extensible metadata KEYWORDS: Visualization, archival records, large collections. collocated with a file or directory itself. IRODS additionally forms a Data Grid [5] that can be federated and expanded as INDEX TERMS: Big data, data-intensive research, preservation requirements grow or policies require. All of our record groups were copied into a central iRODS 1 INTRODUCTION repository, built on top of a DataDirect Networks DDN9900 The visualization of large collections of documents has had a storage rack and managed by a metadata catalog, iCAT. We significant amount of attention over the last few years. The considered using a federated grid, but it was determined that for problem of indexing and visually browsing archival records, while performance in visualization and indexing, it was best to collocate it can be said to include the above problem, is more complex. compute and data resources. Archival metadata includes file attributes, location, provenance, etc. Thus archival records are complex semi-structured data, and 3.2 Indexing scaling to millions or billions of records is not trivial. An important special case of archiving is that of large archives !9$"&!#("<*& 89+(#"&@A*"<)(9<)& of geographic records. These are common in the governmental 71,"$%& 4;"<=/%"$)& 71,"$%:2& 71,"$%89+(#"& >?9<"6/;& !#("<*&5(-"& N& collections we have studied in the CI-BER project, CyberInfrastructure for Billions of Electronic Records [1]. Each G7GHIJ"+&C#("<*&$"K,")*)& geographic record may contain large amounts of metadata within J"+/;;&5"$L"$& O& G;;I9<#(<"&/</#%*(C)&)LC)& 39#")IGCC"))& JM5& J85& J!5& '5.0& that is not readily indexed by common methods. In this paper, we present a system for indexing and web-based visualization of this 89-"#&K,"$(")& F/*C?&G</#%*(C)& kind of archive in a scalable fashion using RENCI’s Geoanalytics .2D& 5*/*(C&'/*/& '/*/!,+"& >%$/B(-& 5"<)9$!9##"C*(9<& cyber-infrastructure [2]. '()*$(+,*"-& P& '/*/&89-"#)& >9)*625& 89<E9'F& 2 PROBLEM DESCRIPTION The CI-BER project is about scaling archiving systems to handle G</#%*(C)&;$9C"))")& G$C?(L(<E& archives of billions of electronic records. We have built a testbed !"#"$%&'()*$(+,*"-&./)0&1,","&& @A*"$</#& '/*/& collection called the CI-BER Testbed [1] that currently contains Q& over 60 Terabytes of archival records from the US Government’s !9B;,*"&3")9,$C")& 234'5&'/*/&6$(-& National Archives and Records Administration (NARA). These cover hundreds of different agencies and currently comprise roughly 60M archival records. Throughout these archives are Figure 1. Geoanalytics Architecture large chunks of geographic data. Geographic data falls into roughly two categories: vector and raster. For these there are several dozen file formats. Some are Indexing happens through RENCI’s Geoanalytics[2] cyber- no longer readable, but many can be opened using open source infrastructure, chosen because it provides facilities for managing tools like GDAL[3]. In addition to different formats, there are large amounts of geographic data. Its architecture is briefly thousands of geographic projections be used by different datasets. described in Figure 1. We take advantage of its distributed task Our problem is to be able to interactively visualize the metadata queue, Celery [6], and its document-oriented data store, from these records and get a clear picture of what physical areas MongoDB [1] to handle our indexing process. The indexing these collections cover, allowing a user to “drill-down” to the process is started through a web-application. actual file if desired. Our indexer has thus far indexed the largest of the geographic data collections, around 12TB of data. The indexer is incremental in nature and can be run on new collections as they are incorporated into the archive. Incremental indexing does not effect on the availability of visualizations on the already indexed data.
  • 2. The indexing architecture in Figure 2 scales to multiple touch-enabled mobile device) or clicking on a box in the tree-map machines and CPUs. Our current indexer uses five four-CPU shows the bounding box of all the files in that box and shows a machines, each with a single 1GBit network interface to the grid listing of all the geographic metadata for its directory, or in the to index data. The indexing process is thus: case of a single record, the metadata for that record. 789%48:;8<2=%% !% >5984?<>1@A>BCC8>DB@% )% E5+8<B;4>8F&% 0>1@% *% )% +% -@68H% ,% G5C8<%2B%5@68H% G5C284%1@6%5@68H%3-0%!1@656128<% !"#% !"#% !"#% !"#% !"#% !"#% !"#$% &% !"#$% &% !"#$% &% (% '% (% '% (% '% 5C<% 5I82% Figure 3. The bottom-up visualization. -+./0%/121%3456% The “top down” visualization begins with an OpenLayers physical map, and allows the user to navigate, pan, and zoom, then draw a Figure 2. Indexer architecture bounding box. Once drawn, the bounding box lists the collections in a list on the left. The user can then tap on a collection and see 1. Request to index a collection stored in IRODS. the subdirectories in that collection, and can continue to “drill 2. The indexer identifies a set of nodes in the down” until he or she hits an actual metadata record. If the user Geoanalytics cluster to perform the indexing, and has taps on a metadata record, the detailed accounting of the metadata them start a new IRODS session. for that record replaces the map. 3. The indexer asks one node to perform the “crawl” task, which recursively iterates the collection. 4. The “crawl” task marks potential GIS files and archives containing them (tarballs, zipfiles), and queues them with Celery to be indexed. 5. All other nodes pull items of the indexing queue and perform the following: a. iget the resource b. Optionally unarchive the resource c. Identify GIS files. d. Identify a program that opens the file, transform it to lat/lon, and index. 3.3 Visualization For archival purposes, understanding the context of a document is critical. Collection understanding [7] is the task of developing tools that help the user comprehend the collection as a whole and contextualize documents’ place in that whole. We chose to focus on the collection understanding task because of the size of the Figure 4. The “top-down” visuaiization on the iPad collections we were given and because we wanted to build tools that would be broadly applicable to other collections. 4 CONCLUSION To create tools that can be used by a wide audience, we chose We have presented a system that indexes and visualizes large to create web-based visualizations that can be reformatted to archival record sets containing geographic data. We have an appear on mobile devices, such as the iPad and iPhone 4. We have indexer that can scale to use multiple CPUs on a cluster of created two visualizations which represent “bottom up” and “top machines and two web-based interactive visualizations that show down” views of a geographic collection. this index in a geographic context. Our future work will include The “bottom up” visualization shown in Figure 3 allows a user unifying these visual interfaces and providing statistics on the to start with a collection, shown as a tree-map similar to the scalability of the indexer relative to data grid size. This project is visualization in [9]. The user is presented with a tree-map funded by NSF/OCI grant 0848296 as part a cooperative research containing grey, red, yellow, and blue boxes. Each box agreement between the NARA’s Applied Research division, the corresponds to a directory in the collection, which may contain a National Science Foundation (NSF), and the University of North number of subdirectories. Each box is scaled to the number of Carolina at Chapel Hill. Project Director is Richard Marciano files it contains. The colors correspond to entries containing with visualization expert Jeff Heard. Project collaborators include vector records only (red), raster only (blue), both raster and vector Stan Ahalt, Leesa Brieger, Chien-Yi Hou, Arcot Rajasekar, Sarah (yellow), and no geographic files (gray). Next to the tree-map is a Lippincott, Brendan O’Connell, and Sheau-Yen Chen. physical map, provided by OpenLayers [10]. Tapping (on a
  • 3. REFERENCES [1] CI-BER: CyberInfrastructure for Billions of Electronic Records, http://ci-ber.blogspot.com/ [2] J. Heard. The Geoanalytics system. http://www.renci.org/. Technical Note, Renaissance Computing Institute. 2011. [3] The Open Source Geospatial Foundation. GDAL/OGR. http://gdal.org. 2011. [4] Introduction to iRODS. https://www.irods.org/ index.php/Introduction_to_iRODS. [5] “The Grid: Blueprint for a New Computing”: A book edited by I. Foster, C. Kesselman, Pub. Morgan Kaufmann, San Francisco, 1999. Chapter 5, “Data Intensive Computing”, R. Moore, C. Baru, R. Marciano, A. Rajasekar, M. Wan. [6] The Celery Group. Celery. http://celeryproject.org. 2011 [7] The MongoDB Group. MongoDB. http://www.mongodb.org. 2011Chang, M., Leggett, J.J., Furuta, R., Kerne, A., Williams, J.P., Burns, S.A., and Blas, R.G. Collection Understanding [visualization tools in information retrieval]. Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, pages 334-342. IEEE Press, June 2004. [8] B Shneiderman. Tree visualization with tree maps: 2-d space-filling approach. ACM Transactions on Graphics (TOG), Volume 11, Number 1, pages 91-99. 1992. [9] Jiu, W., Esteva, M., and Dott S.J. Visualization for Archival Appraisal of Large Digital Collections. Proceedings of the IS&T Archiving Conference 2010 (The Hague), pages 157-162. 2010. [10] The Open Source Geospatial Foundation. OpenLayers. http://openlayers.org. 2011.