SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Visualizing Digital Collections at
               Archive-It
                    Kalpesh Padia

            Director:      Michele C. Weigle
            Committee:     Michael L. Nelson
                           Ravi Mukkamala


7/20/2012            MS Thesis - August 2012   1
Agenda
    Introduction
    Motivation
    Related Work
    Collection Retrieval and Processing
    Visualizations
    Case Studies
    Future Work
    Conclusion
7/20/2012          MS Thesis - August 2012   2
INTRODUCTION AND
   MOTIVATION
7/20/2012   MS Thesis - August 2012   3
Digital Archives




http://www.loc.gov/index.html
http://digitalcollections.library.yale.edu/
7/20/2012                                     MS Thesis - August 2012   4
Archive-It




            http://archive-it.org/
7/20/2012    MS Thesis - August 2012   5
Archive-It Collection Hierarchy
                                                      Collection
                Root                                    Title




              Level 1                       Category 1        Category n




              Level 2              Web page 1         Web page n




            Level 3 (Leaf   Archived         Archived
              Nodes)        Version 1        Version n




7/20/2012                   MS Thesis - August 2012                        6
Exploring Archive-It Collections




            http://archive-it.org/collections/1068
7/20/2012             MS Thesis - August 2012        7
Exploring Archive-It Collections




            http://archive-it.org/collections/1068
7/20/2012             MS Thesis - August 2012        8
Exploring Archive-It Collections




            http://archive-it.org/collections/1068
7/20/2012             MS Thesis - August 2012        9
Exploring Archive-It Collections




            http://wayback.archive-it.org/1068/*/http://acda.co/
7/20/2012                          MS Thesis - August 2012         10
Exploring Archive-It Collections




            http://archive-it.org/collections/1068
7/20/2012             MS Thesis - August 2012        11
Exploring Archive-It Collections




            http://archive-it.org/collections/2836
7/20/2012            MS Thesis - August 2012         12
Drawbacks
    No visual feedback

    Discovering individual pages is difficult

    Optional metadata and categorization

    Collection structure known only to curator

7/20/2012            MS Thesis - August 2012     13
Contribution
    Interactive visualizations
       Treemap
       Time cloud
       Bubble chart
       Image plot
       Wordle
       Timeline
    Temporal exploration of collections
    Uncover collection structure
7/20/2012              MS Thesis - August 2012   14
RELATED WORK


7/20/2012   MS Thesis - August 2012   15
Microsoft Pivot




            http://www.microsoft.com/silverlight/pivotviewer/
7/20/2012                    MS Thesis - August 2012            16
Page History Explorer




                                            A. Jatowt, Y. Kawai, and K.
                                            Tanaka, “Visualizing
                                            Historical Content of Web
                                            Pages,” in Proceedings of the
                                            17th international conference
                                            on World Wide Web,2008.

7/20/2012         MS Thesis - August 2012                             17
3D Wall




            http://www.webarchive.org.uk/ukwa/wall/Blogs
7/20/2012               MS Thesis - August 2012            18
Treemap




            Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in
                                                                   proceedings of the 2nd conference on Visualization '91
7/20/2012                  MS Thesis - August 2012                                                                      19
Series Browser




            M. Whitelaw, “Visualising Archival Collections: The Visible Archive
            Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.
7/20/2012                    MS Thesis - August 2012                              20
A1 Explorer




            M. Whitelaw, “Visualising Archival Collections: The Visible Archive
            Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009.

7/20/2012                    MS Thesis - August 2012                              21
EASY




                     Scharnhorst et.al. “Looking at a digital research data archive Visual
                        interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.3200


7/20/2012   MS Thesis - August 2012                                                   22
Wordle




              . Jonathan Feinberg, http://wordle.net/ , Dogear
7/20/2012   MS Thesis - August 2012                         23
DATA RETRIEVAL AND
   PROCESSING
7/20/2012   MS Thesis - August 2012   24
11 Collections, 2K+ Web pages, 70K+ Mementos




7/20/2012         MS Thesis - August 2012    25
Data Retrieval & Processing
    Retrieval:
       Screen scrape
       Copy collection hierarchy
       Store page content


    Processing:
           Calculate TF and TF-IDF
           Generate screenshots
           Generate wordles
           Rule-based categorization
           Construct JSON

7/20/2012                     MS Thesis - August 2012   26
No categorization




7/20/2012       MS Thesis - August 2012                                     27

                                      http://www.archive-it.org/collections/2836
Improper Categorization




7/20/2012          MS Thesis - August 2012                                     28
                                         http://www.archive-it.org/collections/2323
Rule based categorization



                                                     News Web pages

                                                    Blogs



                                                   Social Media

                                                    Videos

7/20/2012           MS Thesis - August 2012                                     29

                                          http://www.archive-it.org/collections/2836
Special URI and TLD based categorization




                                                 Pakistani news web
                                                 pages




7/20/2012        MS Thesis - August 2012                                     30

                                       http://www.archive-it.org/collections/2836
VISUALIZATIONS


7/20/2012   MS Thesis - August 2012   31
Treemap




7/20/2012   MS Thesis - August 2012   32
Time Cloud




7/20/2012    MS Thesis - August 2012   33
Bubble Chart, Image Plot & Timeline




7/20/2012        MS Thesis - August 2012   34
CASE STUDIES


7/20/2012   MS Thesis - August 2012   35
1. Collection Building and Growth




7/20/2012    MS Thesis - August 2012   36
2. Re-Categorization
   (Pakistan Flood: no categorization)




7/20/2012      MS Thesis - August 2012   37
2. Re-Categorization
     (Pakistan Flood: after categorization)




7/20/2012         MS Thesis - August 2012   38
3. Collection Synopsis




7/20/2012         MS Thesis - August 2012   39
3. Collection Synopsis




7/20/2012         MS Thesis - August 2012   40
3. Collection Synopsis




7/20/2012         MS Thesis - August 2012   41
3. Collection Synopsis




7/20/2012         MS Thesis - August 2012   42
3. Collection Synopsis




7/20/2012         MS Thesis - August 2012   43
4. Theme Tracking




7/20/2012       MS Thesis - August 2012   44
4. Theme Tracking




7/20/2012       MS Thesis - August 2012   45
4. Theme Tracking




7/20/2012       MS Thesis - August 2012   46
4. Theme Tracking




7/20/2012       MS Thesis - August 2012   47
Informal User Evaluation
    Alex Thurman, Columbia University Libraries
    Feedback on
       ease of browsing and obtaining information
       user-friendliness of the interface
       whether they prefer textual or graphical
        interface
       most effective visualization
       effectiveness of the rule-based categorization
        in exploring archives
7/20/2012              MS Thesis - August 2012           48
Feedback
    Effective visualizations:
       Treemap – color coding useful for identifying newer
        additions
       Image plot – screenshots with mouse-over wordles
        allow for good navigation
       Timeline – useful for visualizing development of
        groups in collection
    Suggestions
       Broader timescale for treemaps
       Include stop words from other languages

7/20/2012               MS Thesis - August 2012           49
FUTURE WORK AND
   CONCLUSION
7/20/2012   MS Thesis - August 2012   50
Future Work
    N-Gram wordles
    Term expansion
    Krovetz stemmer (dictionary based stemmer)
    Integration with Archive-It
    Detailed user evaluation
    Implementation for other archives


7/20/2012         MS Thesis - August 2012    51
Conclusion
    Identified metrics for collections




7/20/2012          MS Thesis - August 2012   52
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap




7/20/2012            MS Thesis - August 2012   53
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud




7/20/2012             MS Thesis - August 2012   54
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud
       Bubble chart




7/20/2012              MS Thesis - August 2012   55
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud
       Bubble chart
       Image plot




7/20/2012               MS Thesis - August 2012   56
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud
       Bubble chart
       Image plot
       Wordle


7/20/2012               MS Thesis - August 2012   57
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud
       Bubble chart
       Image plot
       Wordle
       Timeline

7/20/2012               MS Thesis - August 2012   58
Conclusion
    Identified metrics for collections
    Visualizations
       Treemap
       Time cloud
       Bubble chart
       Image plot
       Wordle
       Timeline
    Rule – based categorization
7/20/2012               MS Thesis - August 2012   59
BACKUP


7/20/2012   MS Thesis - August 2012   60
Time Span
                                                                 Small           1 Day - 2 Weeks
                                                 Time span      Medium         2 Weeks - 4 Months
                                                                 Large             > 4 Months




            http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/

7/20/2012                   MS Thesis - August 2012                                       61
Groups
                                                          Small    1
                                          Groups         Medium   2-5
                                                          Large   >5




            http://www.archive-it.org/collections/1068

7/20/2012          MS Thesis - August 2012                              62
URI Domains
                                                  Small             1 - 10
                                 URI Domains     Medium             11 - 20
                                                  Large              > 20




                                http://www.archive-it.org/collections/2836
7/20/2012     MS Thesis - August 2012                                        63
Number of Web Pages
                                                     Small             1 - 10
                                   # of Web Pages   Medium             11 - 99
                                                     Large              > 99




                                  http://www.archive-it.org/collections/2836

7/20/2012        MS Thesis - August 2012                                       64
Jigsaw




                                      Stasko et.al., IEEE VAST 2007
7/20/2012   MS Thesis - August 2012                           65
Themeriver




                                       Wei et.al. in SIGKDD, 2010




7/20/2012    MS Thesis - August 2012                       66
Time Cloud




7/20/2012    MS Thesis - August 2012   68
Bubble Chart




7/20/2012     MS Thesis - August 2012                                           69

                             http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
Image Plot with Wordle




7/20/2012         MS Thesis - August 2012                                           70

                                 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
Timeline




7/20/2012   MS Thesis - August 2012                                           71

                           http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

Weitere ähnliche Inhalte

Ă„hnlich wie MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

Metadata Overview, SEI 2012
Metadata Overview, SEI 2012Metadata Overview, SEI 2012
Metadata Overview, SEI 2012Jenn Riley
 
How the Indianapolis Museum of Art is Building a Content Management Solution
How the Indianapolis Museum of Art is Building a Content Management SolutionHow the Indianapolis Museum of Art is Building a Content Management Solution
How the Indianapolis Museum of Art is Building a Content Management SolutionNuxeo
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...Sebastian Hellmann
 
Aggregating Social Media for Enhancing Conference Experiences
Aggregating Social Media for Enhancing Conference ExperiencesAggregating Social Media for Enhancing Conference Experiences
Aggregating Social Media for Enhancing Conference ExperiencesHouda khrouf
 
Professorial lecture: The many faces of the Web [2012 06-21]
Professorial lecture: The many faces of the Web [2012 06-21]Professorial lecture: The many faces of the Web [2012 06-21]
Professorial lecture: The many faces of the Web [2012 06-21]Thomas Roth-Berghofer
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
Navigation-induced Knowledge Engineering by Example
 Navigation-induced Knowledge Engineering by Example Navigation-induced Knowledge Engineering by Example
Navigation-induced Knowledge Engineering by ExampleSebastian Hellmann
 
Extending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesExtending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesnet2-project
 
Marie Kennedy, ER&L 2012 presentation
Marie Kennedy, ER&L 2012 presentationMarie Kennedy, ER&L 2012 presentation
Marie Kennedy, ER&L 2012 presentationMarie Kennedy
 
Factors and attitudes that shape personal use of social media.docx
Factors and attitudes that shape personal use of social media.docxFactors and attitudes that shape personal use of social media.docx
Factors and attitudes that shape personal use of social media.docx4934bk
 
Open syllabusmobileraynauldatlanta2012
Open syllabusmobileraynauldatlanta2012Open syllabusmobileraynauldatlanta2012
Open syllabusmobileraynauldatlanta2012Raynauld Jacques
 
'Sexy' Grades on the Phone
'Sexy' Grades on the Phone'Sexy' Grades on the Phone
'Sexy' Grades on the PhoneJim Helwig
 
Rediscovering Relevance for the Science & Engineering Library - presentation ...
Rediscovering Relevance for the Science & Engineering Library - presentation ...Rediscovering Relevance for the Science & Engineering Library - presentation ...
Rediscovering Relevance for the Science & Engineering Library - presentation ...Patrick "Tod" Colegrove
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgAI4BD GmbH
 
Photo Essay Zhao
Photo Essay ZhaoPhoto Essay Zhao
Photo Essay ZhaomLong99
 
My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)AI4BD GmbH
 

Ă„hnlich wie MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It (19)

Metadata Overview, SEI 2012
Metadata Overview, SEI 2012Metadata Overview, SEI 2012
Metadata Overview, SEI 2012
 
How the Indianapolis Museum of Art is Building a Content Management Solution
How the Indianapolis Museum of Art is Building a Content Management SolutionHow the Indianapolis Museum of Art is Building a Content Management Solution
How the Indianapolis Museum of Art is Building a Content Management Solution
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
 
Aggregating Social Media for Enhancing Conference Experiences
Aggregating Social Media for Enhancing Conference ExperiencesAggregating Social Media for Enhancing Conference Experiences
Aggregating Social Media for Enhancing Conference Experiences
 
Professorial lecture: The many faces of the Web [2012 06-21]
Professorial lecture: The many faces of the Web [2012 06-21]Professorial lecture: The many faces of the Web [2012 06-21]
Professorial lecture: The many faces of the Web [2012 06-21]
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
Rurik
RurikRurik
Rurik
 
Navigation-induced Knowledge Engineering by Example
 Navigation-induced Knowledge Engineering by Example Navigation-induced Knowledge Engineering by Example
Navigation-induced Knowledge Engineering by Example
 
Semantic Web embraces inclusion in learning with enhanced discovery of access...
Semantic Web embraces inclusion in learning with enhanced discovery of access...Semantic Web embraces inclusion in learning with enhanced discovery of access...
Semantic Web embraces inclusion in learning with enhanced discovery of access...
 
Extending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesExtending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTables
 
Marie Kennedy, ER&L 2012 presentation
Marie Kennedy, ER&L 2012 presentationMarie Kennedy, ER&L 2012 presentation
Marie Kennedy, ER&L 2012 presentation
 
Factors and attitudes that shape personal use of social media.docx
Factors and attitudes that shape personal use of social media.docxFactors and attitudes that shape personal use of social media.docx
Factors and attitudes that shape personal use of social media.docx
 
Open syllabusmobileraynauldatlanta2012
Open syllabusmobileraynauldatlanta2012Open syllabusmobileraynauldatlanta2012
Open syllabusmobileraynauldatlanta2012
 
'Sexy' Grades on the Phone
'Sexy' Grades on the Phone'Sexy' Grades on the Phone
'Sexy' Grades on the Phone
 
Rediscovering Relevance for the Science & Engineering Library - presentation ...
Rediscovering Relevance for the Science & Engineering Library - presentation ...Rediscovering Relevance for the Science & Engineering Library - presentation ...
Rediscovering Relevance for the Science & Engineering Library - presentation ...
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St Petersburg
 
Photo Essay Zhao
Photo Essay ZhaoPhoto Essay Zhao
Photo Essay Zhao
 
My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)My fire st petersburg 27 june 2012 (d hladky)
My fire st petersburg 27 june 2012 (d hladky)
 
E resources on social science & humanitie
E resources on social science & humanitieE resources on social science & humanitie
E resources on social science & humanitie
 

KĂĽrzlich hochgeladen

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

KĂĽrzlich hochgeladen (20)

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

MS Thesis Defense, Aug 2012 - Visualizing Digital Collections at Archive-It

  • 1. Visualizing Digital Collections at Archive-It Kalpesh Padia Director: Michele C. Weigle Committee: Michael L. Nelson Ravi Mukkamala 7/20/2012 MS Thesis - August 2012 1
  • 2. Agenda Introduction Motivation Related Work Collection Retrieval and Processing Visualizations Case Studies Future Work Conclusion 7/20/2012 MS Thesis - August 2012 2
  • 3. INTRODUCTION AND MOTIVATION 7/20/2012 MS Thesis - August 2012 3
  • 5. Archive-It http://archive-it.org/ 7/20/2012 MS Thesis - August 2012 5
  • 6. Archive-It Collection Hierarchy Collection Root Title Level 1 Category 1 Category n Level 2 Web page 1 Web page n Level 3 (Leaf Archived Archived Nodes) Version 1 Version n 7/20/2012 MS Thesis - August 2012 6
  • 7. Exploring Archive-It Collections http://archive-it.org/collections/1068 7/20/2012 MS Thesis - August 2012 7
  • 8. Exploring Archive-It Collections http://archive-it.org/collections/1068 7/20/2012 MS Thesis - August 2012 8
  • 9. Exploring Archive-It Collections http://archive-it.org/collections/1068 7/20/2012 MS Thesis - August 2012 9
  • 10. Exploring Archive-It Collections http://wayback.archive-it.org/1068/*/http://acda.co/ 7/20/2012 MS Thesis - August 2012 10
  • 11. Exploring Archive-It Collections http://archive-it.org/collections/1068 7/20/2012 MS Thesis - August 2012 11
  • 12. Exploring Archive-It Collections http://archive-it.org/collections/2836 7/20/2012 MS Thesis - August 2012 12
  • 13. Drawbacks No visual feedback Discovering individual pages is difficult Optional metadata and categorization Collection structure known only to curator 7/20/2012 MS Thesis - August 2012 13
  • 14. Contribution Interactive visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Temporal exploration of collections Uncover collection structure 7/20/2012 MS Thesis - August 2012 14
  • 15. RELATED WORK 7/20/2012 MS Thesis - August 2012 15
  • 16. Microsoft Pivot http://www.microsoft.com/silverlight/pivotviewer/ 7/20/2012 MS Thesis - August 2012 16
  • 17. Page History Explorer A. Jatowt, Y. Kawai, and K. Tanaka, “Visualizing Historical Content of Web Pages,” in Proceedings of the 17th international conference on World Wide Web,2008. 7/20/2012 MS Thesis - August 2012 17
  • 18. 3D Wall http://www.webarchive.org.uk/ukwa/wall/Blogs 7/20/2012 MS Thesis - August 2012 18
  • 19. Treemap Johnson and Shneiderman, “Space-Filling Approach to the Visualization of Hierarchical Information Structures” in proceedings of the 2nd conference on Visualization '91 7/20/2012 MS Thesis - August 2012 19
  • 20. Series Browser M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009. 7/20/2012 MS Thesis - August 2012 20
  • 21. A1 Explorer M. Whitelaw, “Visualising Archival Collections: The Visible Archive Project,” in Archives and Manuscripts, vol. 37, Issue 2, 2009. 7/20/2012 MS Thesis - August 2012 21
  • 22. EASY Scharnhorst et.al. “Looking at a digital research data archive Visual interfaces to EASY,” in CORR, 2012, http://arxiv.org/abs/1204.3200 7/20/2012 MS Thesis - August 2012 22
  • 23. Wordle . Jonathan Feinberg, http://wordle.net/ , Dogear 7/20/2012 MS Thesis - August 2012 23
  • 24. DATA RETRIEVAL AND PROCESSING 7/20/2012 MS Thesis - August 2012 24
  • 25. 11 Collections, 2K+ Web pages, 70K+ Mementos 7/20/2012 MS Thesis - August 2012 25
  • 26. Data Retrieval & Processing Retrieval:  Screen scrape  Copy collection hierarchy  Store page content Processing:  Calculate TF and TF-IDF  Generate screenshots  Generate wordles  Rule-based categorization  Construct JSON 7/20/2012 MS Thesis - August 2012 26
  • 27. No categorization 7/20/2012 MS Thesis - August 2012 27 http://www.archive-it.org/collections/2836
  • 28. Improper Categorization 7/20/2012 MS Thesis - August 2012 28 http://www.archive-it.org/collections/2323
  • 29. Rule based categorization News Web pages Blogs Social Media Videos 7/20/2012 MS Thesis - August 2012 29 http://www.archive-it.org/collections/2836
  • 30. Special URI and TLD based categorization Pakistani news web pages 7/20/2012 MS Thesis - August 2012 30 http://www.archive-it.org/collections/2836
  • 31. VISUALIZATIONS 7/20/2012 MS Thesis - August 2012 31
  • 32. Treemap 7/20/2012 MS Thesis - August 2012 32
  • 33. Time Cloud 7/20/2012 MS Thesis - August 2012 33
  • 34. Bubble Chart, Image Plot & Timeline 7/20/2012 MS Thesis - August 2012 34
  • 35. CASE STUDIES 7/20/2012 MS Thesis - August 2012 35
  • 36. 1. Collection Building and Growth 7/20/2012 MS Thesis - August 2012 36
  • 37. 2. Re-Categorization (Pakistan Flood: no categorization) 7/20/2012 MS Thesis - August 2012 37
  • 38. 2. Re-Categorization (Pakistan Flood: after categorization) 7/20/2012 MS Thesis - August 2012 38
  • 39. 3. Collection Synopsis 7/20/2012 MS Thesis - August 2012 39
  • 40. 3. Collection Synopsis 7/20/2012 MS Thesis - August 2012 40
  • 41. 3. Collection Synopsis 7/20/2012 MS Thesis - August 2012 41
  • 42. 3. Collection Synopsis 7/20/2012 MS Thesis - August 2012 42
  • 43. 3. Collection Synopsis 7/20/2012 MS Thesis - August 2012 43
  • 44. 4. Theme Tracking 7/20/2012 MS Thesis - August 2012 44
  • 45. 4. Theme Tracking 7/20/2012 MS Thesis - August 2012 45
  • 46. 4. Theme Tracking 7/20/2012 MS Thesis - August 2012 46
  • 47. 4. Theme Tracking 7/20/2012 MS Thesis - August 2012 47
  • 48. Informal User Evaluation Alex Thurman, Columbia University Libraries Feedback on  ease of browsing and obtaining information  user-friendliness of the interface  whether they prefer textual or graphical interface  most effective visualization  effectiveness of the rule-based categorization in exploring archives 7/20/2012 MS Thesis - August 2012 48
  • 49. Feedback Effective visualizations:  Treemap – color coding useful for identifying newer additions  Image plot – screenshots with mouse-over wordles allow for good navigation  Timeline – useful for visualizing development of groups in collection Suggestions  Broader timescale for treemaps  Include stop words from other languages 7/20/2012 MS Thesis - August 2012 49
  • 50. FUTURE WORK AND CONCLUSION 7/20/2012 MS Thesis - August 2012 50
  • 51. Future Work N-Gram wordles Term expansion Krovetz stemmer (dictionary based stemmer) Integration with Archive-It Detailed user evaluation Implementation for other archives 7/20/2012 MS Thesis - August 2012 51
  • 52. Conclusion Identified metrics for collections 7/20/2012 MS Thesis - August 2012 52
  • 53. Conclusion Identified metrics for collections Visualizations  Treemap 7/20/2012 MS Thesis - August 2012 53
  • 54. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud 7/20/2012 MS Thesis - August 2012 54
  • 55. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart 7/20/2012 MS Thesis - August 2012 55
  • 56. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot 7/20/2012 MS Thesis - August 2012 56
  • 57. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle 7/20/2012 MS Thesis - August 2012 57
  • 58. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline 7/20/2012 MS Thesis - August 2012 58
  • 59. Conclusion Identified metrics for collections Visualizations  Treemap  Time cloud  Bubble chart  Image plot  Wordle  Timeline Rule – based categorization 7/20/2012 MS Thesis - August 2012 59
  • 60. BACKUP 7/20/2012 MS Thesis - August 2012 60
  • 61. Time Span Small 1 Day - 2 Weeks Time span Medium 2 Weeks - 4 Months Large > 4 Months http://wayback.archive-it.org/1068/*/http://amigosdemujeres.org/ 7/20/2012 MS Thesis - August 2012 61
  • 62. Groups Small 1 Groups Medium 2-5 Large >5 http://www.archive-it.org/collections/1068 7/20/2012 MS Thesis - August 2012 62
  • 63. URI Domains Small 1 - 10 URI Domains Medium 11 - 20 Large > 20 http://www.archive-it.org/collections/2836 7/20/2012 MS Thesis - August 2012 63
  • 64. Number of Web Pages Small 1 - 10 # of Web Pages Medium 11 - 99 Large > 99 http://www.archive-it.org/collections/2836 7/20/2012 MS Thesis - August 2012 64
  • 65. Jigsaw Stasko et.al., IEEE VAST 2007 7/20/2012 MS Thesis - August 2012 65
  • 66. Themeriver Wei et.al. in SIGKDD, 2010 7/20/2012 MS Thesis - August 2012 66
  • 67. Time Cloud 7/20/2012 MS Thesis - August 2012 68
  • 68. Bubble Chart 7/20/2012 MS Thesis - August 2012 69 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  • 69. Image Plot with Wordle 7/20/2012 MS Thesis - August 2012 70 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068
  • 70. Timeline 7/20/2012 MS Thesis - August 2012 71 http://tmix.cs.odu.edu:8080/project/test.php?coll_id=1068

Hinweis der Redaktion

  1. Collections used for developmentPresent a good mix of various metrics
  2. Many Archive-It collections are not curated wellLack categorizationImproper CategorizationSuggest categorization forOrganizing collectionsProperly categorizing articles with existing categorization
  3. If the domain is news web sites,, such as cnn,abc, bbc, put them into news web site
  4. Domains and subdomains
  5. Number of articles. Or sites
  6. Extend stacked bar charts to represent independent values as imagesAll values in a sample given equal weightThe height of each stack represents the size of each sampleCategories are samples, articles are valuesHover over each article to reveal number of mementos, timespan and wordle summarizing articles’ content