SlideShare a Scribd company logo
1 of 42
Digital History and Text Mining


For HIST 511, Topics in Public History: Digital History
                October 13, 2011

                By Roger Bilisoly, Ph.D.
       Department of Mathematical Sciences, CCSU
Overview of Talk


•       Some complexities of text:
    –     Tagging
    –     Multiple versions of a work
    –     Some free visualization tools

•       An extended example analyzing the Dictionary of
        Canadian Biography. This example was inspired by
        Associate Professor William Turkel’s discussion posted
        in his (now inactive) blog “Digital History Hacks.”
Humans use text in many ways …




           Top: My photo of wall decorative pattern in the Miracle Mile Shops, Las Vegas, NV
           http://www.flickr.com/photos/66082566@N00/4281884117/

           Right: an English Hieroglyphic Bible published by Isaiah Thomas in 1788. This is from
           the “Early American Imprints” database.

           Left: King Ashur-nasir-pal at Brooklyn Museum, photo by wallyg
           http://www.flickr.com/photos/wallyg/2440285854/sizes/m/
Online visitors can add tags.


                                                                                                                          Anyone can join
                                                                                                                          the “Posse” and
                                                                                                                          contribute.
http://www.guerrillagirls.com/




                                                                                                                         Tags are given by
                                                                                                                         users and vetted
                                                                                                                         by staff.




                                 More on tagging: http://www.brooklynmuseum.org/community/blogosphere/2008/07/15/collection-preview-and-re-thinking-tagging/
                                                   http://www.steve.museum/
Unfortunately, access to tag data seems limited.
Using visitor feedback to improve tags …




                                User gets to choose
                                one of the following:
                                keep it (green),
                                trash it (red),
                                not sure (yellow).
Folksonomy: Study of tags

• There’s tension between “controlled vocabulary” approach of library
  science and individualism of tagging.
• Analysis of tags is still young: tag listing and clouds are still common
  at Flickr.com, Delicious.com, etc.



                                        Possible future analyses:

                                        • Collocations popular in linguistics,
                                        which would be more informative.

                                        • Short phrases instead of single
                                        words.
It takes several steps to obtain
an electronic version of a text.

                                                                             Original manuscript
                                                                             is at the Morgan
                                                                             Library and
                                                                             Museum in New
                                                                             York City, and this
                                                                             facsimile is
                                                                             available online at
                                                                             the NYTimes.com.




     http://documents.nytimes.com/looking-over-the-shoulder-of-charles-dickens-the-man-who-wrote-of-a-christmas-carol
     http://www.themorgan.org/home.asp
This image is from an 1845 edition
scanned by Google and available
from: http://books.google.com/books
Pick an edition and then
       convert to electronic text.




“Many of our most popular eBooks started out with huge error
levels--only later did they come to the more polished levels
seen today. In fact, many of our eBooks were done totally
without any supervision--by people who had never heard of
Project Gutenberg--and only sent to us after the fact.”

This quote by Michael Hart is from:
http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Mission_Statement_by
_Michael_Hart
Gutenberg.org EBook #46,
which includes scans of John Leech’s illustrations.
Raw Text

Marley was dead: to begin with. There is no doubt whatever about that. The register of his
burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge
signed it: and Scrooge’s name was good upon ’Change, for anything he chose to put his hand
to. Old Marley was as dead as a door-nail.

Mind! I don’t mean to say that I know, of my own knowledge, what there is particularly dead
about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest
piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my
unhallowed hands shall not disturb it, or the Country’s done for. You will therefore permit me
to repeat, emphatically, that Marley was as dead as a door-nail.

Scrooge knew he was dead? Of course he did. How could it be otherwise? Scrooge and he
were partners for I don’t know how many years. Scrooge was his sole executor, his sole
administrator, his sole assign, his sole residuary legatee, his sole friend, and sole mourner.
And even Scrooge was not so dreadfully cut up by the sad event, but that he was an
excellent man of business on the very day of the funeral, and solemnised it with an
undoubted bargain.
A Christmas Carol is relatively simple …

• It’s a novella
• Written in only six weeks
   – Dickens wanted to publish in time for the Christmas of 1843.
• Modifications were made later:
   – Dickens modified text for his readings, which started in 1858
     (See example next slide.)
   – However, there have been many adaptations. The first was for
     the theater in 1844 (not by Dickens, though.)
   – See http://en.wikipedia.org/wiki/List_of_A_Christmas_Carol_adaptations
A Christmas Carol:
         1868 reading version vs. 1843 original version

MARLEY was dead, to begin with. There is no doubt whatever about       Marley was dead: to begin with. There is no doubt whatever about
that. The register of his burial was signed by the clergyman, the      that. The register of his burial was signed by the clergyman, the
clerk, the undertaker, and the chief mourner. Scrooge signed it. And   clerk, the undertaker, and the chief mourner. Scrooge signed it: and
Scrooge's name was good upon 'Change for anything he chose to          Scrooge’s name was good upon ’Change, for anything he chose to
put his hand to.                                                       put his hand to. Old Marley was as dead as a door-nail.

Old Marley was as dead as a door-nail.                                 Mind! I don’t mean to say that I know, of my own knowledge, what
                                                                       there is particularly dead about a door-nail. I might have been
                                                                       inclined, myself, to regard a coffin-nail as the deadest piece of
                                                                       ironmongery in the trade. But the wisdom of our ancestors is in the
                                                                       simile; and my unhallowed hands shall not disturb it, or the Country’s
                                                                       done for. You will therefore permit me to repeat, emphatically, that
                                                                       Marley was as dead as a door-nail.

Scrooge knew he was dead? Of course he did. How could it be            Scrooge knew he was dead? Of course he did. How could it be
otherwise? Scrooge and he were partners for I don't know how many      otherwise? Scrooge and he were partners for I don’t know how many
years. Scrooge was his sole executor, his sole administrator, his      years. Scrooge was his sole executor, his sole administrator, his
sole assign, his sole residuary legatee, his sole friend, his sole     sole assign, his sole residuary legatee, his sole friend, and sole
mourner.                                                               mourner. And even Scrooge was not so dreadfully cut up by the sad
                                                                       event, but that he was an excellent man of business on the very day
                                                                       of the funeral, and solemnised it with an undoubted bargain.



1868 version condensed by Dickens for                                  The 1845 version seen earlier.
his readings.                                                          http://www.gutenberg.org/files/46/46-h/46-h.htm
http://gaslight.mtroyal.ca/carol.htm
For contrast consider Chaucer’s
            “The Wife of Bath” from The Canterbury Tales,
                 which has no original manuscripts.
Ellesmere ms.                                       Lansdowne ms.
Experience / though noon Auctoritee                 Experiment þouhe none auctorite
Were in this world / were right ynogh to me         Where in þis werlde is riht y-nouhe for me
To speke of wo / that is in mariage                 To speke of woo þat is in mariage
ffor lordynges / sith I .xij. yeer was of Age       For lordeinges sen .I. twelue ȝere was of Age

Hengwrt ms.                                         Harleian ms.
Experience / thogh noon Auctoritee                  Experiens þough noon auctorite
Were in this world / is right ynogh for me          were in þis world. it were ynough for me
To speke of wo / that is in mariage                 To speke of wo þat is in mariage
ffor lordynges / sith þat I twelf yeer was of age   For lordyngs syns I twelf ȝer was of age

Cambridge ms.                                       Petworth ms.
Experyment / þough none auctoryte                   Experience thouȝe noon autorite
Were in þis worlde is riȝt/ ynouȝe for me           were in þis world riȝt ynouȝe for me
To speke of woo þat ys in mariage                   To speke of woo þat is in mariage
ffor lordynges siþen I twelfe yere was of age       ffor lordingges siþ I twelue ȝere was of age

Corpus ms.                                          The Cambridge ms. completed by Egerton ms.
Experiment þough non auctorite.                     Experience / though noon auctoritee
Were in þis world is right ynough for me            Were in this world / is right I-now for me
To speke of wo þat is in mariage                    To speken of woo / that is in mariage
ffor lordynges syn I twelue ȝeer was of age /       ffor lordynges / syn I twelue ȝer was of age

   Manuscripts available at http://www.kankedort.net/ECT_manuscripts.htm
Many applications exist for analyzing text …

• Google lab’s “Books Ngram Viewer”
• Trendistic’s analysis of tweets
• Many Eyes project by IBM
Google labs’ “Books Ngram Viewer”




                 http://ngrams.googlelabs.com/
Free tool: http://Trendistic.indextank.com/




Michelle Bachmann’s Gardasil claim occurred on 9/12/2011.
IBM’s “Many Eyes” at
http://www-958.ibm.com/software/data/cognos/manyeyes/
Four visualization tools available as of 10-2011



                                 There are already over
                                 230,000 data sets available,
                                 and you can upload yours.

                                 This Jane Austen data set
                                 was already uploaded.
Example of “Phrase Net”
Visualization of Jane Austin’s Sense and Sensibility
                            by Matthew Hurst, posted on his blog,
                  Text Mining, Visualization and Social Media on 9-24-2011.


Lucy (Steele) is highlighted on the left.
Common keywords are on the right.
Each line represents one chapter.




                                            From http://datamining.typepad.com/
A Taste of Text Mining:
         Analyzing Text with Computers
•  Extracting information from the Web
  – Power of regular expressions
  – Example used here inspired by William Turkel
     (Associate Professor of History at the University of
     Western Ontario)
• Concordancing
  – A powerful technique from corpus linguistics
  – Example here uses corpus obtained by Turkel’s
     approach
  – Introduction of some information retrieval (IR) ideas
Extracting Information from the Web

• This is done continuously by spiders written by
  companies like Google to update their search engines.
• Crawling the Web requires sophisticated programming,
  but scraping info from a particular site is not so hard.
• Following example based on ideas given in 6 blog posts
  at “Digital History Hacks” by William Turkel:
   –   http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-1.html
   –   http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-2.html
   –   http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-3.html
   –   http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-4.html
   –   http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-5.html
   –   http://digitalhistoryhacks.blogspot.com/2006/03/text-mining-dcb-part-6.html
Turkel harvests the online
   Dictionary of Canadian Biography (DCB).

This Web site allows
searches using a form.

Their terms of use,
however, does not
forbid downloading all
their records.

What is not forbidden
must be done!
A Browser Trick


Form submission can be
automated because:
(1) the queries are
shown in the URL box
(2) These queries have
patterns

Two stages:
(1) Obtain all the 592
    links to Canadians.
(2) For each link, access
    it via a program and
    grab its HTML.
Below are the URLs for the 1st, 2nd, 3rd, and 4th
         requests for 20 Canadian records.
•   http://www.biographi.ca/009004-110.01-
    e.php?PHPSESSID=06378al70rmt2mu4ho8mf7c952&q2=&q3=&q10=I&q7
    =&q5=&q1=&interval=20
•   http://www.biographi.ca/009004-110.01-
    e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=21&&PHPSESSID
    =06378al70rmt2mu4ho8mf7c952
•   http://www.biographi.ca/009004-110.01-
    e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=41&&&PHPSESSI
    D=06378al70rmt2mu4ho8mf7c952
•   http://www.biographi.ca/009004-110.01-
    e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=61&&&&PHPSES
    SID=06378al70rmt2mu4ho8mf7c952

                                     # Records   Starting Point
Only eight lines of Perl code downloads
     all 592 DCB URLs for Canadians flourishing prior to 1700.

use LWP::Simple;

open (OUT, ">canadian_bio.txt");

$url_part1 = 'http://www.biographi.ca/009004-110.01-e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=100&sk=';
$url_part2 = '&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7';

for ($i = 1; $i < 601; $i += 100) {
    $doc = get "$url_part1$i$url_part2";
    print OUT "$docnnn";
}

close(OUT);



   The reason this is so short is that there is a module LWP that has commands
   to work with the Web.

   $doc = get "$url_part1$i$url_part2";

   This line of code queries the DCB Web page, and the returned HTML is stored
   in the variable $doc.
A Small Sample of the Downloaded DCB Links

<td>
<a href="009004-110.01-e.php?&q10=I&sk=1&s=3&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">Descending</a></td
 </tr>
 <tr>
    <td class="td_data">1.</td>
    <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=1&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">ABRAHAM, JOHN</a></td>
    <td class="td_data">1000-1700 (Volume I)</td>
</tr>
 <tr>
    <td class="td_data">2.</td>
    <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=2&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AERNOUTSZ, JURRIAEN</a></td>
    <td class="td_data">1000-1700 (Volume I)</td>
</tr>
 <tr>
    <td class="td_data">3.</td>
    <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=3&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGARIATA</a></td>
    <td class="td_data">1000-1700 (Volume I)</td>
</tr>
 <tr>
    <td class="td_data">4.</td>
    <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=4&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGRAMONTE, JUAN DE</a></td>
    <td class="td_data">1000-1700 (Volume I)</td>
</tr>


          We want the URLs to each individual Canadian, which is contained in the
          <a href=“…”> lines above (the URLs are bolded and in red.) These are
          extracted ( with Perl) and then used to download each biography (again
          with Perl).
Now we download the biographies themselves and
     print them to canadian_bios.txt.


use LWP::Simple;

open (IN, "canadian_biography_urls.txt");
open (OUT, ">canadian_bios.txt");

while (<IN>) {
   chomp;
   sleep(1);   # Be a polite spider
   $doc = get "$_";   # Download biographies
   @lines = split(/n/, $doc);
   $flag = 0;
   foreach $x (@lines) {
      if ($x =~ /</BODY>/) { $flag = 0 }
      if ($flag) { print OUT "$xn" }   # Print out biography
      if ($x =~ /<BODY>/) { $flag = 1 }
   }

}
The results are still in HTML,
            but another program can remove the HTML tags.

                                                 <P CLASS="ParagraphFormat"><B>ABRAHAM</B>, <B>JOHN</B>, governor of Port Nelson; fl.
1672–89.</P>
                                                   <P CLASS="ParagraphFormat">  He joined the HBC about 1672 and served in James Bay 1672–75
and 1676–78 under Governor Charles B<SPAN CLASS="SmallCaps">ayly</SPAN>,
against whom he brought charges of mismanagement. In 1679 Abraham was appointed
second to John N<SPAN CLASS="SmallCaps">ixon</SPAN>, Bayly’s
successor and although he absconded with an advance of salary at sailing time,
he was engaged in 1681 as mate of the <I>Diligence</I>
(Capt. N<SPAN CLASS="SmallCaps">ehemiah </SPAN>W<SPAN CLASS="SmallCaps">alker</SPAN>) and wintered in James Bay.</SPAN></P>


            All the HTML tags are in <>, which can be removed by a program:
                                                  ABRAHAM, JOHN, governor of Port Nelson; fl.
 1672–89.
                                                       He joined the HBC about 1672 and served in James Bay 1672–75
 and 1676–78 under Governor Charles Bayly,
 against whom he brought charges of mismanagement. In 1679 Abraham was appointed
 second to John Nixon, Bayly’s
 successor and although he absconded with an advance of salary at sailing time,
 he was engaged in 1681 as mate of the Diligence
 (Capt. Nehemiah Walker) and wintered in James Bay.




       Note that the top version is
            still valid HTML:
Now extract dates with a concordancing program.

Key is constructing regular
expressions (regexes) to find text a
text pattern of interest, which is a 4
digit number starting with 1 in this
case.

$target = '(D1dddD)';

D stands for non-digit
d stands for digit

At right, all the matches of the regex
above are shown after sorting. By
looking at this concordance, a variety
of patterns emerge.
  The concordancing program is from
     Chapter 6 of Bilisoly’s PTMP.
Complication: Dates have more than one use.


Dates have many uses and
conventions, which complicates
their analysis:

• Although volume 1 of the DCB
covers 1000-1700, there are
references to modern texts,
hence dates in the 20th century
appear.
• There are range of dates (e.g.,
1495-1521)
• Dates followed by question
marks (e.g., 1522?)
• Dates in square brackets (none
shown here).
• And so forth …
Years Appearing in Volume 1 of the DCB:
           My Results (top) v. Turkel’s (bottom)




Turkel points out that many of
the date-peaks do correspond
to notable events in early
Canadian history. For
example, 1498 was Cabot’s
second voyage, and 1666 was
the first census of New France.


                            Top produced by me using SAS.
                            Bottom from http://photos1.blogger.com/blogger/4745/1988/1600/dcbo-vol1-dates.jpg
Term-Document Matrix For The DCB

• Here documents are the biographies and terms
  are years.
• Each person’s biography is searched for years.
• Remember that there are complications.
  – Range of years: 1495-1521
  – Years in doubt use a question mark: 1522?
  – Years of publication for references: (London, 1962)
Part of the DCB Name-Year Matrix
Angles between Canadians in the DCB




           This output is from the programming language Mathematica.
Which Canadians are the most alike
       with respect to years noted in DCB?

• Louis Gaudais-Dupont and Mézy de Saffray
  – 1661,1662,1663 (6 times), 1664
  – 1663 (8 times), 1664 (3 times), 1665 (twice)
  – Angle between them is 21.5°


• Thalour du Perron and Sieur de Monts
  – 1662 (twice), 1663 (twice), 1668
  – 1662 (3 times), 1663 (twice)
  – Angle between them is 22.4°
Wordle.net word cloud using the DCB




                  Easier to do, but less informative.
References
•   Language and Computers: A Practical Introduction to the Computer Analysis of Language
     – Geoff Barnbrook
•   Practical Text Mining with Perl (PTMP)
     – Roger Bilisoly
•   Corpus Linguistics: Investigating Language Structure and Use
     – Biber, Conrad and Reppen
•   Concept Data Analysis: Theory and Applications
     – Claudio Corpineto and Giovanni Romano (a more technical book)
•   Programming for Linguists: Perl for Language Researchers
     – Michael Hammond
•   Corpora in Applied Linguistics
     – Susan Hunston
•   Beginning Regular Expressions
     – Andrew Watt
•   Text Mining: Predictive Methods for Analyzing Unstructured Information
     – Shalom Weiss, Nitin Indurkhya, Tong Zhang and Fred Damerau (a more technical book)
•   Geometry and Meaning
     – Dominic Widdows
Learning to Program

• From teaching STAT 527, students vary in their like of
  programming. However, it’s powerful so worth trying if it
  sounds interesting to you.
• Try “The Programming Historian”
   – Teaches Python
   – By William J. Turkel, Adam Crymble and Alan MacEachern
   – http://niche-canada.org/programming-historian/
        • NICHE = Network in Canadian History & Environment
        • NICHE = Nouvelle initiative canadienne en histoire de l’environnement



  Also see William Turkel’s home page: http://history.uwo.ca/faculty/turkel/,
         which links to his now defunct blog, “Digital History Hacks.”
eXtensible Markup Language (XML)
                              and the Text Encoding Initiative (TEI)
<text> <body><div0>

<head>The following is a Copy of a LETTER sent by the                                                The DCB used HTML uses tags to inform a
Author's Master to the Publisher.                                              No year tags! 
</head><div1>                                                                                        Web browser how to display its biographies.
                                                                                                     It would be useful to have additional tags
<p n="1"> <name reg="Wheatley, Phillis" type="personal">PHILLIS</name>                               that encode information for human
was brought from <name rend="italic" type="geographical">Africa</name> to
<name type="geographical" key="italic">America</name>, in the Year 1761,                             consumption. A protocol called XML (a form
between Seven and Eight Years of Age. Without any Assistance from School Education,                  of SGML) was created to do just this. The
and by only what she was taught in the Family, she, in sixteen Months Time from her Arrival,         XML tags are red at left.
attained the English Language, to which she was an utter Stranger before, to such a Degree,
as to read any, the most difficult Parts of the Sacred Writings, to the great Astonishment of all
who heard her. </p>                                                                                  The TEI Consortium organization
                                                                                                     (http://www.tei-c.org/index.xml) produces
<p n="2">As to her WRITING, her own Curiosity led her to it;
and this she learnt in so short a Time, that in the Year 1765, she wrote a Letter to the             standards and encourages the encoding of
<name reg="Occum, Samson" type="personal">Rev. Mr. OCCOM</name>                                      information in literary and linguistic texts.
<note resp="editor" type="biographical">Samson Occum (1723-1792) was a converted
Mohegan Indian who became a Christian minister. He was a friend of Susanna Wheatley,
Phillis Wheatley's mistress, and a friend and correspondent of Phillis Wheatley.</note>,             Unfortunately, this kind of tagging is done by
the <name type="ethnological" rend="italic">Indian</name> Minister, while in                         humans at present (see example at left),
<name type="geographical" rend="italic">England</name>. </p>                                         which is labor intensive.
<p n="3">She has a great Inclination to learn the Latin Tongue, and has made some
Progress in it. This Relation is given by her Master who bought her, and with whom
she now lives. </p>

<signed><name reg="Wheatley, John" type="personal">JOHN WHEATLEY</name>.</signed>
<dateline rend="italic"><name rend="italic" type="geographical">Boston</name>,
<date><distinct rend="italic">Nov.</distinct> 14, 1772</date>.</dateline> </div1>

</div0></body></text> </TEI.2>                               Letter from John Wheatley to the Publisher sent Nov. 14, 1772.
                                                             From the Early Americas Digital Archive (http://www.mith2.umd.edu/eada/)
                                                             Supported by Maryland Institute for Technology in the Humanities (MITH)
                                                             http://www.mith2.umd.edu/eada/html/display.php?docs=wheatley_letter.xml&action=show

More Related Content

Similar to Hist 511 digital history and text mining 2011

Scrooge's Characterisation
Scrooge's CharacterisationScrooge's Characterisation
Scrooge's CharacterisationZaxapias
 
B plata modern_paperback
B plata modern_paperbackB plata modern_paperback
B plata modern_paperbackMikhail Plata
 
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot post
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot postChapple, R. M. 2013 George and the giant archaeological theory. Blogspot post
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot postRobert M Chapple
 
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...Robert M Chapple
 
Tavern League - Portraits of Wisconsin Bars
Tavern League - Portraits of Wisconsin BarsTavern League - Portraits of Wisconsin Bars
Tavern League - Portraits of Wisconsin Barscarlcorey
 
Writing Memo AssignmentMemo will be about ; What .docx
Writing Memo AssignmentMemo will be about ;       What .docxWriting Memo AssignmentMemo will be about ;       What .docx
Writing Memo AssignmentMemo will be about ; What .docxodiliagilby
 
Teaching 19th Century American Literature in the High School Classroom
Teaching 19th Century American Literature in the High School ClassroomTeaching 19th Century American Literature in the High School Classroom
Teaching 19th Century American Literature in the High School ClassroomCraig Carey
 
Process Essay Baking Cake
Process Essay Baking CakeProcess Essay Baking Cake
Process Essay Baking CakeLaurel Connor
 
Benjamin jaffe artwork
Benjamin jaffe artworkBenjamin jaffe artwork
Benjamin jaffe artworkbmjaffe
 
Benjamin Jaffe Artwork
Benjamin Jaffe ArtworkBenjamin Jaffe Artwork
Benjamin Jaffe Artworkbmjaffe
 
Charles dickens who wrote that
Charles dickens   who wrote thatCharles dickens   who wrote that
Charles dickens who wrote thatGabriela Duaigues
 
Essay Peshawar Attack In English
Essay Peshawar Attack In EnglishEssay Peshawar Attack In English
Essay Peshawar Attack In EnglishViviana Principe
 
Creating structured information
Creating structured informationCreating structured information
Creating structured informationŠkola Futura
 
Mt Pleasant book club fall 2014
Mt Pleasant book club fall 2014Mt Pleasant book club fall 2014
Mt Pleasant book club fall 2014hughessn1
 
Essays On Sexuality. PDF Sexuality and Gender Finals Research Paper
Essays On Sexuality. PDF Sexuality and Gender Finals Research PaperEssays On Sexuality. PDF Sexuality and Gender Finals Research Paper
Essays On Sexuality. PDF Sexuality and Gender Finals Research PaperNoel Brooks
 
A Dialogue About This Beauty And Truth Jorge Luis Borges S Translation Of ...
A Dialogue  About This Beauty And Truth   Jorge Luis Borges S Translation Of ...A Dialogue  About This Beauty And Truth   Jorge Luis Borges S Translation Of ...
A Dialogue About This Beauty And Truth Jorge Luis Borges S Translation Of ...Melinda Watson
 
Drama Gcse Essay Help - Writingfixya.Web.Fc2.Com
Drama Gcse Essay Help - Writingfixya.Web.Fc2.ComDrama Gcse Essay Help - Writingfixya.Web.Fc2.Com
Drama Gcse Essay Help - Writingfixya.Web.Fc2.ComErica Spivey
 
23 Graphic Essay Examples Pics - Scholarship
23 Graphic Essay Examples Pics - Scholarship23 Graphic Essay Examples Pics - Scholarship
23 Graphic Essay Examples Pics - ScholarshipSydney Noriega
 

Similar to Hist 511 digital history and text mining 2011 (20)

Scrooge's Characterisation
Scrooge's CharacterisationScrooge's Characterisation
Scrooge's Characterisation
 
B plata modern_paperback
B plata modern_paperbackB plata modern_paperback
B plata modern_paperback
 
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot post
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot postChapple, R. M. 2013 George and the giant archaeological theory. Blogspot post
Chapple, R. M. 2013 George and the giant archaeological theory. Blogspot post
 
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...
Chapple, R. M. 2014 Return of the Phantom Earthwork - a 'fake' ring barrow at...
 
Tavern League - Portraits of Wisconsin Bars
Tavern League - Portraits of Wisconsin BarsTavern League - Portraits of Wisconsin Bars
Tavern League - Portraits of Wisconsin Bars
 
Writing Memo AssignmentMemo will be about ; What .docx
Writing Memo AssignmentMemo will be about ;       What .docxWriting Memo AssignmentMemo will be about ;       What .docx
Writing Memo AssignmentMemo will be about ; What .docx
 
Teaching 19th Century American Literature in the High School Classroom
Teaching 19th Century American Literature in the High School ClassroomTeaching 19th Century American Literature in the High School Classroom
Teaching 19th Century American Literature in the High School Classroom
 
Process Essay Baking Cake
Process Essay Baking CakeProcess Essay Baking Cake
Process Essay Baking Cake
 
Benjamin jaffe artwork
Benjamin jaffe artworkBenjamin jaffe artwork
Benjamin jaffe artwork
 
Benjamin Jaffe Artwork
Benjamin Jaffe ArtworkBenjamin Jaffe Artwork
Benjamin Jaffe Artwork
 
Charles dickens who wrote that
Charles dickens   who wrote thatCharles dickens   who wrote that
Charles dickens who wrote that
 
Day 7-ELIT 46C
Day 7-ELIT 46CDay 7-ELIT 46C
Day 7-ELIT 46C
 
Essay Peshawar Attack In English
Essay Peshawar Attack In EnglishEssay Peshawar Attack In English
Essay Peshawar Attack In English
 
Creating structured information
Creating structured informationCreating structured information
Creating structured information
 
Mt Pleasant book club fall 2014
Mt Pleasant book club fall 2014Mt Pleasant book club fall 2014
Mt Pleasant book club fall 2014
 
Essays On Sexuality. PDF Sexuality and Gender Finals Research Paper
Essays On Sexuality. PDF Sexuality and Gender Finals Research PaperEssays On Sexuality. PDF Sexuality and Gender Finals Research Paper
Essays On Sexuality. PDF Sexuality and Gender Finals Research Paper
 
A Dialogue About This Beauty And Truth Jorge Luis Borges S Translation Of ...
A Dialogue  About This Beauty And Truth   Jorge Luis Borges S Translation Of ...A Dialogue  About This Beauty And Truth   Jorge Luis Borges S Translation Of ...
A Dialogue About This Beauty And Truth Jorge Luis Borges S Translation Of ...
 
Mexico 01
Mexico 01Mexico 01
Mexico 01
 
Drama Gcse Essay Help - Writingfixya.Web.Fc2.Com
Drama Gcse Essay Help - Writingfixya.Web.Fc2.ComDrama Gcse Essay Help - Writingfixya.Web.Fc2.Com
Drama Gcse Essay Help - Writingfixya.Web.Fc2.Com
 
23 Graphic Essay Examples Pics - Scholarship
23 Graphic Essay Examples Pics - Scholarship23 Graphic Essay Examples Pics - Scholarship
23 Graphic Essay Examples Pics - Scholarship
 

Hist 511 digital history and text mining 2011

  • 1. Digital History and Text Mining For HIST 511, Topics in Public History: Digital History October 13, 2011 By Roger Bilisoly, Ph.D. Department of Mathematical Sciences, CCSU
  • 2. Overview of Talk • Some complexities of text: – Tagging – Multiple versions of a work – Some free visualization tools • An extended example analyzing the Dictionary of Canadian Biography. This example was inspired by Associate Professor William Turkel’s discussion posted in his (now inactive) blog “Digital History Hacks.”
  • 3. Humans use text in many ways … Top: My photo of wall decorative pattern in the Miracle Mile Shops, Las Vegas, NV http://www.flickr.com/photos/66082566@N00/4281884117/ Right: an English Hieroglyphic Bible published by Isaiah Thomas in 1788. This is from the “Early American Imprints” database. Left: King Ashur-nasir-pal at Brooklyn Museum, photo by wallyg http://www.flickr.com/photos/wallyg/2440285854/sizes/m/
  • 4. Online visitors can add tags. Anyone can join the “Posse” and contribute. http://www.guerrillagirls.com/ Tags are given by users and vetted by staff. More on tagging: http://www.brooklynmuseum.org/community/blogosphere/2008/07/15/collection-preview-and-re-thinking-tagging/ http://www.steve.museum/
  • 5. Unfortunately, access to tag data seems limited.
  • 6. Using visitor feedback to improve tags … User gets to choose one of the following: keep it (green), trash it (red), not sure (yellow).
  • 7. Folksonomy: Study of tags • There’s tension between “controlled vocabulary” approach of library science and individualism of tagging. • Analysis of tags is still young: tag listing and clouds are still common at Flickr.com, Delicious.com, etc. Possible future analyses: • Collocations popular in linguistics, which would be more informative. • Short phrases instead of single words.
  • 8. It takes several steps to obtain an electronic version of a text. Original manuscript is at the Morgan Library and Museum in New York City, and this facsimile is available online at the NYTimes.com. http://documents.nytimes.com/looking-over-the-shoulder-of-charles-dickens-the-man-who-wrote-of-a-christmas-carol http://www.themorgan.org/home.asp
  • 9. This image is from an 1845 edition scanned by Google and available from: http://books.google.com/books
  • 10. Pick an edition and then convert to electronic text. “Many of our most popular eBooks started out with huge error levels--only later did they come to the more polished levels seen today. In fact, many of our eBooks were done totally without any supervision--by people who had never heard of Project Gutenberg--and only sent to us after the fact.” This quote by Michael Hart is from: http://www.gutenberg.org/wiki/Gutenberg:Project_Gutenberg_Mission_Statement_by _Michael_Hart
  • 11. Gutenberg.org EBook #46, which includes scans of John Leech’s illustrations.
  • 12. Raw Text Marley was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge’s name was good upon ’Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don’t mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country’s done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail. Scrooge knew he was dead? Of course he did. How could it be otherwise? Scrooge and he were partners for I don’t know how many years. Scrooge was his sole executor, his sole administrator, his sole assign, his sole residuary legatee, his sole friend, and sole mourner. And even Scrooge was not so dreadfully cut up by the sad event, but that he was an excellent man of business on the very day of the funeral, and solemnised it with an undoubted bargain.
  • 13. A Christmas Carol is relatively simple … • It’s a novella • Written in only six weeks – Dickens wanted to publish in time for the Christmas of 1843. • Modifications were made later: – Dickens modified text for his readings, which started in 1858 (See example next slide.) – However, there have been many adaptations. The first was for the theater in 1844 (not by Dickens, though.) – See http://en.wikipedia.org/wiki/List_of_A_Christmas_Carol_adaptations
  • 14. A Christmas Carol: 1868 reading version vs. 1843 original version MARLEY was dead, to begin with. There is no doubt whatever about Marley was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrooge signed it. And clerk, the undertaker, and the chief mourner. Scrooge signed it: and Scrooge's name was good upon 'Change for anything he chose to Scrooge’s name was good upon ’Change, for anything he chose to put his hand to. put his hand to. Old Marley was as dead as a door-nail. Old Marley was as dead as a door-nail. Mind! I don’t mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country’s done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail. Scrooge knew he was dead? Of course he did. How could it be Scrooge knew he was dead? Of course he did. How could it be otherwise? Scrooge and he were partners for I don't know how many otherwise? Scrooge and he were partners for I don’t know how many years. Scrooge was his sole executor, his sole administrator, his years. Scrooge was his sole executor, his sole administrator, his sole assign, his sole residuary legatee, his sole friend, his sole sole assign, his sole residuary legatee, his sole friend, and sole mourner. mourner. And even Scrooge was not so dreadfully cut up by the sad event, but that he was an excellent man of business on the very day of the funeral, and solemnised it with an undoubted bargain. 1868 version condensed by Dickens for The 1845 version seen earlier. his readings. http://www.gutenberg.org/files/46/46-h/46-h.htm http://gaslight.mtroyal.ca/carol.htm
  • 15. For contrast consider Chaucer’s “The Wife of Bath” from The Canterbury Tales, which has no original manuscripts. Ellesmere ms. Lansdowne ms. Experience / though noon Auctoritee Experiment þouhe none auctorite Were in this world / were right ynogh to me Where in þis werlde is riht y-nouhe for me To speke of wo / that is in mariage To speke of woo þat is in mariage ffor lordynges / sith I .xij. yeer was of Age For lordeinges sen .I. twelue ȝere was of Age Hengwrt ms. Harleian ms. Experience / thogh noon Auctoritee Experiens þough noon auctorite Were in this world / is right ynogh for me were in þis world. it were ynough for me To speke of wo / that is in mariage To speke of wo þat is in mariage ffor lordynges / sith þat I twelf yeer was of age For lordyngs syns I twelf ȝer was of age Cambridge ms. Petworth ms. Experyment / þough none auctoryte Experience thouȝe noon autorite Were in þis worlde is riȝt/ ynouȝe for me were in þis world riȝt ynouȝe for me To speke of woo þat ys in mariage To speke of woo þat is in mariage ffor lordynges siþen I twelfe yere was of age ffor lordingges siþ I twelue ȝere was of age Corpus ms. The Cambridge ms. completed by Egerton ms. Experiment þough non auctorite. Experience / though noon auctoritee Were in þis world is right ynough for me Were in this world / is right I-now for me To speke of wo þat is in mariage To speken of woo / that is in mariage ffor lordynges syn I twelue ȝeer was of age / ffor lordynges / syn I twelue ȝer was of age Manuscripts available at http://www.kankedort.net/ECT_manuscripts.htm
  • 16. Many applications exist for analyzing text … • Google lab’s “Books Ngram Viewer” • Trendistic’s analysis of tweets • Many Eyes project by IBM
  • 17. Google labs’ “Books Ngram Viewer” http://ngrams.googlelabs.com/
  • 18. Free tool: http://Trendistic.indextank.com/ Michelle Bachmann’s Gardasil claim occurred on 9/12/2011.
  • 19. IBM’s “Many Eyes” at http://www-958.ibm.com/software/data/cognos/manyeyes/
  • 20. Four visualization tools available as of 10-2011 There are already over 230,000 data sets available, and you can upload yours. This Jane Austen data set was already uploaded.
  • 22. Visualization of Jane Austin’s Sense and Sensibility by Matthew Hurst, posted on his blog, Text Mining, Visualization and Social Media on 9-24-2011. Lucy (Steele) is highlighted on the left. Common keywords are on the right. Each line represents one chapter. From http://datamining.typepad.com/
  • 23. A Taste of Text Mining: Analyzing Text with Computers • Extracting information from the Web – Power of regular expressions – Example used here inspired by William Turkel (Associate Professor of History at the University of Western Ontario) • Concordancing – A powerful technique from corpus linguistics – Example here uses corpus obtained by Turkel’s approach – Introduction of some information retrieval (IR) ideas
  • 24. Extracting Information from the Web • This is done continuously by spiders written by companies like Google to update their search engines. • Crawling the Web requires sophisticated programming, but scraping info from a particular site is not so hard. • Following example based on ideas given in 6 blog posts at “Digital History Hacks” by William Turkel: – http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-1.html – http://digitalhistoryhacks.blogspot.com/2006/01/text-mining-dcb-part-2.html – http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-3.html – http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-4.html – http://digitalhistoryhacks.blogspot.com/2006/02/text-mining-dcb-part-5.html – http://digitalhistoryhacks.blogspot.com/2006/03/text-mining-dcb-part-6.html
  • 25. Turkel harvests the online Dictionary of Canadian Biography (DCB). This Web site allows searches using a form. Their terms of use, however, does not forbid downloading all their records. What is not forbidden must be done!
  • 26. A Browser Trick Form submission can be automated because: (1) the queries are shown in the URL box (2) These queries have patterns Two stages: (1) Obtain all the 592 links to Canadians. (2) For each link, access it via a program and grab its HTML.
  • 27. Below are the URLs for the 1st, 2nd, 3rd, and 4th requests for 20 Canadian records. • http://www.biographi.ca/009004-110.01- e.php?PHPSESSID=06378al70rmt2mu4ho8mf7c952&q2=&q3=&q10=I&q7 =&q5=&q1=&interval=20 • http://www.biographi.ca/009004-110.01- e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=21&&PHPSESSID =06378al70rmt2mu4ho8mf7c952 • http://www.biographi.ca/009004-110.01- e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=41&&&PHPSESSI D=06378al70rmt2mu4ho8mf7c952 • http://www.biographi.ca/009004-110.01- e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=20&sk=61&&&&PHPSES SID=06378al70rmt2mu4ho8mf7c952 # Records Starting Point
  • 28. Only eight lines of Perl code downloads all 592 DCB URLs for Canadians flourishing prior to 1700. use LWP::Simple; open (OUT, ">canadian_bio.txt"); $url_part1 = 'http://www.biographi.ca/009004-110.01-e.php?q2=&q3=&q10=I&q7=&q5=&q1=&interval=100&sk='; $url_part2 = '&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7'; for ($i = 1; $i < 601; $i += 100) { $doc = get "$url_part1$i$url_part2"; print OUT "$docnnn"; } close(OUT); The reason this is so short is that there is a module LWP that has commands to work with the Web. $doc = get "$url_part1$i$url_part2"; This line of code queries the DCB Web page, and the returned HTML is stored in the variable $doc.
  • 29. A Small Sample of the Downloaded DCB Links <td> <a href="009004-110.01-e.php?&q10=I&sk=1&s=3&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">Descending</a></td </tr> <tr> <td class="td_data">1.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=1&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">ABRAHAM, JOHN</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">2.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=2&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AERNOUTSZ, JURRIAEN</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">3.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=3&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGARIATA</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> <tr> <td class="td_data">4.</td> <td class="td_data"><a href="009004-119.01-e.php?&id_nbr=4&interval=100&&PHPSESSID=s7drhd5m5ac4dem8vgqq2ppgh7">AGRAMONTE, JUAN DE</a></td> <td class="td_data">1000-1700 (Volume I)</td> </tr> We want the URLs to each individual Canadian, which is contained in the <a href=“…”> lines above (the URLs are bolded and in red.) These are extracted ( with Perl) and then used to download each biography (again with Perl).
  • 30. Now we download the biographies themselves and print them to canadian_bios.txt. use LWP::Simple; open (IN, "canadian_biography_urls.txt"); open (OUT, ">canadian_bios.txt"); while (<IN>) { chomp; sleep(1); # Be a polite spider $doc = get "$_"; # Download biographies @lines = split(/n/, $doc); $flag = 0; foreach $x (@lines) { if ($x =~ /</BODY>/) { $flag = 0 } if ($flag) { print OUT "$xn" } # Print out biography if ($x =~ /<BODY>/) { $flag = 1 } } }
  • 31. The results are still in HTML, but another program can remove the HTML tags. <P CLASS="ParagraphFormat"><B>ABRAHAM</B>, <B>JOHN</B>, governor of Port Nelson; fl. 1672–89.</P> <P CLASS="ParagraphFormat"> He joined the HBC about 1672 and served in James Bay 1672–75 and 1676–78 under Governor Charles B<SPAN CLASS="SmallCaps">ayly</SPAN>, against whom he brought charges of mismanagement. In 1679 Abraham was appointed second to John N<SPAN CLASS="SmallCaps">ixon</SPAN>, Bayly’s successor and although he absconded with an advance of salary at sailing time, he was engaged in 1681 as mate of the <I>Diligence</I> (Capt. N<SPAN CLASS="SmallCaps">ehemiah </SPAN>W<SPAN CLASS="SmallCaps">alker</SPAN>) and wintered in James Bay.</SPAN></P> All the HTML tags are in <>, which can be removed by a program: ABRAHAM, JOHN, governor of Port Nelson; fl. 1672–89. He joined the HBC about 1672 and served in James Bay 1672–75 and 1676–78 under Governor Charles Bayly, against whom he brought charges of mismanagement. In 1679 Abraham was appointed second to John Nixon, Bayly’s successor and although he absconded with an advance of salary at sailing time, he was engaged in 1681 as mate of the Diligence (Capt. Nehemiah Walker) and wintered in James Bay. Note that the top version is still valid HTML:
  • 32. Now extract dates with a concordancing program. Key is constructing regular expressions (regexes) to find text a text pattern of interest, which is a 4 digit number starting with 1 in this case. $target = '(D1dddD)'; D stands for non-digit d stands for digit At right, all the matches of the regex above are shown after sorting. By looking at this concordance, a variety of patterns emerge. The concordancing program is from Chapter 6 of Bilisoly’s PTMP.
  • 33. Complication: Dates have more than one use. Dates have many uses and conventions, which complicates their analysis: • Although volume 1 of the DCB covers 1000-1700, there are references to modern texts, hence dates in the 20th century appear. • There are range of dates (e.g., 1495-1521) • Dates followed by question marks (e.g., 1522?) • Dates in square brackets (none shown here). • And so forth …
  • 34. Years Appearing in Volume 1 of the DCB: My Results (top) v. Turkel’s (bottom) Turkel points out that many of the date-peaks do correspond to notable events in early Canadian history. For example, 1498 was Cabot’s second voyage, and 1666 was the first census of New France. Top produced by me using SAS. Bottom from http://photos1.blogger.com/blogger/4745/1988/1600/dcbo-vol1-dates.jpg
  • 35. Term-Document Matrix For The DCB • Here documents are the biographies and terms are years. • Each person’s biography is searched for years. • Remember that there are complications. – Range of years: 1495-1521 – Years in doubt use a question mark: 1522? – Years of publication for references: (London, 1962)
  • 36. Part of the DCB Name-Year Matrix
  • 37. Angles between Canadians in the DCB This output is from the programming language Mathematica.
  • 38. Which Canadians are the most alike with respect to years noted in DCB? • Louis Gaudais-Dupont and Mézy de Saffray – 1661,1662,1663 (6 times), 1664 – 1663 (8 times), 1664 (3 times), 1665 (twice) – Angle between them is 21.5° • Thalour du Perron and Sieur de Monts – 1662 (twice), 1663 (twice), 1668 – 1662 (3 times), 1663 (twice) – Angle between them is 22.4°
  • 39. Wordle.net word cloud using the DCB Easier to do, but less informative.
  • 40. References • Language and Computers: A Practical Introduction to the Computer Analysis of Language – Geoff Barnbrook • Practical Text Mining with Perl (PTMP) – Roger Bilisoly • Corpus Linguistics: Investigating Language Structure and Use – Biber, Conrad and Reppen • Concept Data Analysis: Theory and Applications – Claudio Corpineto and Giovanni Romano (a more technical book) • Programming for Linguists: Perl for Language Researchers – Michael Hammond • Corpora in Applied Linguistics – Susan Hunston • Beginning Regular Expressions – Andrew Watt • Text Mining: Predictive Methods for Analyzing Unstructured Information – Shalom Weiss, Nitin Indurkhya, Tong Zhang and Fred Damerau (a more technical book) • Geometry and Meaning – Dominic Widdows
  • 41. Learning to Program • From teaching STAT 527, students vary in their like of programming. However, it’s powerful so worth trying if it sounds interesting to you. • Try “The Programming Historian” – Teaches Python – By William J. Turkel, Adam Crymble and Alan MacEachern – http://niche-canada.org/programming-historian/ • NICHE = Network in Canadian History & Environment • NICHE = Nouvelle initiative canadienne en histoire de l’environnement Also see William Turkel’s home page: http://history.uwo.ca/faculty/turkel/, which links to his now defunct blog, “Digital History Hacks.”
  • 42. eXtensible Markup Language (XML) and the Text Encoding Initiative (TEI) <text> <body><div0> <head>The following is a Copy of a LETTER sent by the The DCB used HTML uses tags to inform a Author's Master to the Publisher. No year tags!  </head><div1> Web browser how to display its biographies. It would be useful to have additional tags <p n="1"> <name reg="Wheatley, Phillis" type="personal">PHILLIS</name> that encode information for human was brought from <name rend="italic" type="geographical">Africa</name> to <name type="geographical" key="italic">America</name>, in the Year 1761, consumption. A protocol called XML (a form between Seven and Eight Years of Age. Without any Assistance from School Education, of SGML) was created to do just this. The and by only what she was taught in the Family, she, in sixteen Months Time from her Arrival, XML tags are red at left. attained the English Language, to which she was an utter Stranger before, to such a Degree, as to read any, the most difficult Parts of the Sacred Writings, to the great Astonishment of all who heard her. </p> The TEI Consortium organization (http://www.tei-c.org/index.xml) produces <p n="2">As to her WRITING, her own Curiosity led her to it; and this she learnt in so short a Time, that in the Year 1765, she wrote a Letter to the standards and encourages the encoding of <name reg="Occum, Samson" type="personal">Rev. Mr. OCCOM</name> information in literary and linguistic texts. <note resp="editor" type="biographical">Samson Occum (1723-1792) was a converted Mohegan Indian who became a Christian minister. He was a friend of Susanna Wheatley, Phillis Wheatley's mistress, and a friend and correspondent of Phillis Wheatley.</note>, Unfortunately, this kind of tagging is done by the <name type="ethnological" rend="italic">Indian</name> Minister, while in humans at present (see example at left), <name type="geographical" rend="italic">England</name>. </p> which is labor intensive. <p n="3">She has a great Inclination to learn the Latin Tongue, and has made some Progress in it. This Relation is given by her Master who bought her, and with whom she now lives. </p> <signed><name reg="Wheatley, John" type="personal">JOHN WHEATLEY</name>.</signed> <dateline rend="italic"><name rend="italic" type="geographical">Boston</name>, <date><distinct rend="italic">Nov.</distinct> 14, 1772</date>.</dateline> </div1> </div0></body></text> </TEI.2> Letter from John Wheatley to the Publisher sent Nov. 14, 1772. From the Early Americas Digital Archive (http://www.mith2.umd.edu/eada/) Supported by Maryland Institute for Technology in the Humanities (MITH) http://www.mith2.umd.edu/eada/html/display.php?docs=wheatley_letter.xml&action=show