Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Tamu big data-conf-1b

1.220 Aufrufe

Veröffentlicht am

Big Data in the Humanities via eMOP and OCRing two proprietary document collections

Veröffentlicht in: Bildung
  • Hello! I can recommend a site that has helped me. It's called ⇒ www.HelpWriting.net ⇐ They helped me for writing my quality research paper.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Very nice tips on this. In case you need help on any kind of academic writing visit website ⇒ www.WritePaper.info ⇐ and place your order
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Ich kann eine Website empfehlen. Er hat mir wirklich geholfen. ⇒ www.WritersHilfe.com ⇐ Zufrieden und beeindruckt.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

Tamu big data-conf-1b

  1. 1. The Early Modern OCR Project Big Data in the Humanities Matthew Christy, Laura Mandell, Elizabeth Grumbach
  2. 2.  emop.tamu.edu/  Texas A&M Big Data Workshop  emop.tamu.edu/TAMU- BigData  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2
  3. 3. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed docuemnts  44,000 EEBO  2,200 ECCO 3
  4. 4. http://emop.tamu.edu 4
  5. 5. 5 • PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK • SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign • PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University • The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University • The Brazos High Performance Computing Cluster (HPCC) OurPartners
  6. 6. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking  Old, low-quality, small tiff files  Noise, skew, warp, bleedthrough, 6
  7. 7. Page Images DH2014 - Diagnosing Page Image Problems with Post-OCR Triage for eMOP 7
  8. 8. 8 Workflows-Controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py
  9. 9. Post-ProcessingTriage 9
  10. 10. Brazos HPCC • 128 processors as a stakeholder • Access to background queues • Estimated at over 2 months of constant processing • We will have to reprocess some files that fail due to timeouts or that require pre-processing 10

×