SlideShare ist ein Scribd-Unternehmen logo
1 von 94
Digital Text and
Data-Intensive Research
Nina Tahmasebi, Associate Professor
University of Gothenburg
Digital Literacy | 2020-2021
Nina Tahmasebi, Digital Literacy, Sept. 2020
Centre for
Digital Humanities
(2018-2019)
Mathematics
(B.Sc & M.Sc)
2003-2008
Computer/ Data Science
(Phd + Postdoc)
2008-2014)
NLP /
Language Technology
(Researcher, Associate
Professor) 2014→
Nina Tahmasebi, Digital Literacy, Sept. 2020 2
Views on text
Language
1010011010010
1001010010101
0011010010101
Data
Nina Tahmasebi, Digital Literacy, Sept. 2020 4
Based on
• Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for
Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198-
227.
• Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of
Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449.
Nina Tahmasebi, Digital Literacy, Sept. 2020 5
When do we benefit from
computational methods?
Nina Tahmasebi, Digital Literacy, Sept. 2020 6
A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020
From text to answers
text
text mining
method
research question
results
Nina Tahmasebi, Digital Literacy, Sept. 2020 10
From text to answers
text
research question
text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 11
results
Today’s outline
3. Research results and interpretation
1. Digital Text
2. Data-intensive research methodology
Nina Tahmasebi, Digital Literacy, Sept. 2020 12
Digital Text
Nina Tahmasebi, Digital Literacy, Sept. 2020
A book:
• Empty pages in the
beginning / end
• Large letter at the
beginning of each chapter
• Images?
Nina Tahmasebi, Digital Literacy, Sept. 2020 14
A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020 15
Too many physical
pieces cannot be
treated manually.
Digital Text
Nina Tahmasebi, Digital Literacy, Sept. 2020 16
Too many digital texts cannot
be studied in TOO LARGE
DETAIL either!
We need to ignore a lot of formatting
• White pages
• White space
• Fonts
• Capitalization of letters
• Etc…
Nina Tahmasebi, Digital Literacy, Sept. 2020 17
Digital text
Printed texts
not available digitally
Printed texts
born digital
Other digital
publications
User generated textEdited text
Less errors of the kind
• OCR errors due to modern
fonts,
• Less dirty pages, younger age.
• Modern language
Data of the kind:
• News
• Professional blogs
• Reviews
A lot of errors
• Spelling errors
• Grammatical errors
• Abbreviations
• Smileys
(automatic) Metadata
The older the text, the more
errors
• Paper in bad quality
• Different fonts
• Skewed columns
• (Spelling variations)
Nina Tahmasebi, Digital Literacy, Sept. 2020 19
Nina Tahmasebi, Digital Literacy, Sept. 2020 20
Corpus /dataset
• Corpus → linguistically oriented
• Dataset → any collection of text!
• Thematic
• Time periods
• Media types
• Genre
• …
• There are certain types of
questions that cannot be
answered by any text
Digital text
Nina Tahmasebi, Digital Literacy, Sept. 2020 21
Individual
Individual text
With individual intent
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Researcher/group
analyzing in detail
Smart search scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 22
Smart search/selection
• All interpretation and analysis is
left to human
• Often, the correctness of each
individual bit is simple to verify
• But what happens when we have
millions of bits and pieces?
→ We still cannot study manually
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 23
Nina Tahmasebi, Digital Literacy, Sept. 2020
Sources of error:
• We made a bad model:
• E.g. Lost formatting
• Too many OCR errors in the text
→ We cannot find what we are looking
for
→We find much more than we need
• What we are looking for
semantically is not covered by the
terms we use for search:
kvinna ≠ quinna
• Other sources of error?
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 25
Researcher/group
analyzing in detail
Individual
Individual text
With individual intent
Signal change
Signal
topic, cluster, vector…
Multiple texts –
dataset/corpus
Researcher/group
analyzing in detail
Text mining scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 26
NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
Text mining scenario
Smart search
scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 29
Clean much – keep much information
Tokenize
Remove low-frequent words
Remove veeeery high-frequent words
Tokens with little information
• Numbers, punctuation marks etc.
Remove capitalization
Normalize (é → e, eeee→e)
→ Choices all depend on application
and research question
Matter of economy:
• We cannot afford
to keep it all
• So we keep what gives us
most value (= information)
frequency
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 30
I like the room but not the sheet. (only verbs)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheets.
Nina Tahmasebi, Digital Literacy, Sept. 2020 31
Nina Tahmasebi, Digital Literacy, Sept. 2020 32
3. Nouns. After a series of experiments, it was determined that the thematic
information in this corpus could best be captured by modeling only the remaining
nouns. Using the Standford POS tagger, each word in each segment was marked up with
a part of speech indicator and all but the nouns were removed.12
Jockers and Mimno, Significant Themes in
19th-Century Literature
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
Prezentio add. 5
Nina Tahmasebi, Digital Literacy, Sept. 2020 35
Amount of
information
Amount of text
Text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020
In short, ladies and gentlemen, my
message today is that
data is gold. … Let's start mining it.
Neelie Kroes
Vice-President of the European Commission responsible for the
Digital Agenda, SPEECH/11/872 , 2011
Nina Tahmasebi, Digital Literacy, Sept. 2020
Is it true that data is gold?
Nina Tahmasebi, Digital Literacy, Sept. 2020
same data
+ different methods
= different answers
Nina Tahmasebi, Digital Literacy, Sept. 2020 39
Since there is infinite amount
of information in the text,
the text becomes infinitely
complex.
→ Currently, there are no
methods to mine all the
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 40
Data-intensive
research methodology
Nina Tahmasebi, Digital Literacy, Sept. 2020
Traditional research methodology
Research
question
Text
Nina Tahmasebi, Digital Literacy, Sept. 2020 42
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 43
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 44
Data Hypothesis
Data Hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 45
Hypothesis
Data-intensive research methodology
Text mining
method
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 46
Text-mining method
Dimensions
Filtering: Function words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
Nina Tahmasebi, Digital Literacy, Sept. 2020 47
Hypothesis
Data-intensive research methodology
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 48
Results as a window to the text
Nina Tahmasebi, Digital Literacy, Sept. 2020 49
Viewpoint on the data
Nina Tahmasebi, Digital Literacy, Sept. 2020 50
Nina Tahmasebi, Digital Literacy, Sept. 2020 51
Nina Tahmasebi, Digital Literacy, Sept. 2020 52
Nina Tahmasebi, Digital Literacy, Sept. 2020 53
Nina Tahmasebi, Digital Literacy, Sept. 2020 54
The better your method
(WRT the information related to
your research question)
→ the better the pieces
Amount
of
informa
tion
Amount of text
Text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 55
Hypothesis
Data-intensive research methodology
Text mining
method
results
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 56
Data-intensive research methodology
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 57
Data-intensive research methodology
results
results
results
Text mining
method
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 58
Results and research questions
Research
question
Sometimes the results
do not answer
the research question in full
Nina Tahmasebi, Digital Literacy, Sept. 2020 59
Nina Tahmasebi, Digital Literacy, Sept. 2020 60
Image: https://ipec.co.zwNina Tahmasebi, Digital Literacy, Sept. 2020
Research questions
Evidence
• Attack/demonstrations
• Homicide investigation
• Financial irregularities
• Data breach
Majority
• How well is our product received
• Which of our issues are
most/least attractive to our
voters?
• How will people vote?
Nina Tahmasebi, Digital Literacy, Sept. 2020 62
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 63
Truths about data-
intensive research
Not all methods fit all data
Not all data fit all questions
Not all methods can answer all questions
Nothing lives separately,
it must be evaluated together:
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 64
Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 65
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020m
Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Which kind of questions fit
your purposes?
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 68
Results and
research questions
Hypotes
Text mining
method
resultat
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 69
Reduction vs. representation
digitization
preprocessing
method
hypothesis
choice
Nina Tahmasebi, Digital Literacy, Sept. 2020 70
?
Store
Writer A
Male authors
Journal 1
Written language
Pharmacy
Writer B
Female authors
Journal 2
Spoken language
Are these different (significantly) or the same?
Sample 2Sample 1
Sample 2Sample 1
H1 H0
Nina Tahmasebi, Digital Literacy, Sept. 2020 71
Inference requires
random selection
• Only if the selection is random,
can we use the sample to draw
conclusions about the world
• We almost NEVER have a random
sample in a textual corpus!
→ We cannot draw conclusions
about the world Sample 2Sample 1
random
inference
Nina Tahmasebi, Digital Literacy, Sept. 2020 72
When we have little data, the uncertainty
is large:
• Is A larger than B?
But when we have large data, we are more
certain about our observations, STILL, our
errors can be much larger
• Because our selection is biased Sample 2
Sample 2
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Nina Tahmasebi, Digital Literacy, Sept. 2020 73
In corpus studies, we frequently do have enough data, so
the fact that a relation between two phenomena is
demonstrably non-random, does not support the
inference that it is not arbitrary. Language is never,
ever, ever, random,
Adam Kilgariff, 2005
Nina Tahmasebi, Digital Literacy, Sept. 2020 74
Method + Data = Results
result
Nina Tahmasebi, Digital Literacy, Sept. 2020 75
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 76
Reject 1 Data 2 Method / Preprocessing 3 Hypothesis
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 77
Accept 1 Method 2
Correct interpretation
of the results
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 78
Math results, average difference
Men
Women
Nina Tahmasebi, Digital Literacy, Sept. 2020 79Source: Factfullness
Men
Women
Math results, average difference
Nina Tahmasebi, Digital Literacy, Sept. 2020 80Source: Factfullness
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
Nina Tahmasebi, Digital Literacy, Sept. 2020 82
result
hypothesis
1 Method 2
Correct interpretation
of the results
3
Where do the
results live?
Nina Tahmasebi, Digital Literacy, Sept. 2020 83
result
hypothesis
1 Method 2
Correct interpretation
of the results
3
Where do the
results live?
Nina Tahmasebi, Digital Literacy, Sept. 2020 84
Experimental design
Even when the math is right, we need to question the
selection and the grounds on which our conclusions are.
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 85
Evaluation
Nina Tahmasebi, Digital Literacy, Sept. 2020
Evaluation
individual
individual text
signal
topic, cluster, vector…
signal change
collective text
minimum optimum medium
Nina Tahmasebi, Digital Literacy, Sept. 2020
Representativeness
Nina Tahmasebi, Digital Literacy, Sept. 2020 88
Conclusions
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020 90
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 91
Experimental design
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 92
Prof. Hans Rosling
You can’t understand
the world without
numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Nina Tahmasebi, Digital Literacy, Sept. 2020 93
Tack!
Nina.tahmasebi@gu.se
nina@tahmasebi.se
Nina Tahmasebi, Digital Literacy, Sept. 2020 94

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Kürzlich hochgeladen (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Workshop on Digital Literacy - Digital text and data-intensive research

  • 1. Digital Text and Data-Intensive Research Nina Tahmasebi, Associate Professor University of Gothenburg Digital Literacy | 2020-2021 Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 2. Centre for Digital Humanities (2018-2019) Mathematics (B.Sc & M.Sc) 2003-2008 Computer/ Data Science (Phd + Postdoc) 2008-2014) NLP / Language Technology (Researcher, Associate Professor) 2014→ Nina Tahmasebi, Digital Literacy, Sept. 2020 2
  • 4. Nina Tahmasebi, Digital Literacy, Sept. 2020 4
  • 5. Based on • Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198- 227. • Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449. Nina Tahmasebi, Digital Literacy, Sept. 2020 5
  • 6. When do we benefit from computational methods? Nina Tahmasebi, Digital Literacy, Sept. 2020 6
  • 7. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 8. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 9. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 10. From text to answers text text mining method research question results Nina Tahmasebi, Digital Literacy, Sept. 2020 10
  • 11. From text to answers text research question text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 11 results
  • 12. Today’s outline 3. Research results and interpretation 1. Digital Text 2. Data-intensive research methodology Nina Tahmasebi, Digital Literacy, Sept. 2020 12
  • 13. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 14. A book: • Empty pages in the beginning / end • Large letter at the beginning of each chapter • Images? Nina Tahmasebi, Digital Literacy, Sept. 2020 14
  • 15. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020 15
  • 16. Too many physical pieces cannot be treated manually. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020 16
  • 17. Too many digital texts cannot be studied in TOO LARGE DETAIL either! We need to ignore a lot of formatting • White pages • White space • Fonts • Capitalization of letters • Etc… Nina Tahmasebi, Digital Literacy, Sept. 2020 17
  • 18.
  • 19. Digital text Printed texts not available digitally Printed texts born digital Other digital publications User generated textEdited text Less errors of the kind • OCR errors due to modern fonts, • Less dirty pages, younger age. • Modern language Data of the kind: • News • Professional blogs • Reviews A lot of errors • Spelling errors • Grammatical errors • Abbreviations • Smileys (automatic) Metadata The older the text, the more errors • Paper in bad quality • Different fonts • Skewed columns • (Spelling variations) Nina Tahmasebi, Digital Literacy, Sept. 2020 19
  • 20. Nina Tahmasebi, Digital Literacy, Sept. 2020 20
  • 21. Corpus /dataset • Corpus → linguistically oriented • Dataset → any collection of text! • Thematic • Time periods • Media types • Genre • … • There are certain types of questions that cannot be answered by any text Digital text Nina Tahmasebi, Digital Literacy, Sept. 2020 21
  • 22. Individual Individual text With individual intent Multiple texts – dataset/corpus Bits and pieces from a large dataset Researcher/group analyzing in detail Smart search scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 22
  • 23. Smart search/selection • All interpretation and analysis is left to human • Often, the correctness of each individual bit is simple to verify • But what happens when we have millions of bits and pieces? → We still cannot study manually Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 23
  • 24. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 25. Sources of error: • We made a bad model: • E.g. Lost formatting • Too many OCR errors in the text → We cannot find what we are looking for →We find much more than we need • What we are looking for semantically is not covered by the terms we use for search: kvinna ≠ quinna • Other sources of error? Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 25
  • 26. Researcher/group analyzing in detail Individual Individual text With individual intent Signal change Signal topic, cluster, vector… Multiple texts – dataset/corpus Researcher/group analyzing in detail Text mining scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 26
  • 27. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection
  • 28. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection Text mining scenario Smart search scenario
  • 29. Nina Tahmasebi, Digital Literacy, Sept. 2020 29
  • 30. Clean much – keep much information Tokenize Remove low-frequent words Remove veeeery high-frequent words Tokens with little information • Numbers, punctuation marks etc. Remove capitalization Normalize (é → e, eeee→e) → Choices all depend on application and research question Matter of economy: • We cannot afford to keep it all • So we keep what gives us most value (= information) frequency information Nina Tahmasebi, Digital Literacy, Sept. 2020 30
  • 31. I like the room but not the sheet. (only verbs) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (after lemmatization) I like the room but not the sheets. (after stop word filtering) I like the room but not the sheets. Nina Tahmasebi, Digital Literacy, Sept. 2020 31
  • 32. Nina Tahmasebi, Digital Literacy, Sept. 2020 32 3. Nouns. After a series of experiments, it was determined that the thematic information in this corpus could best be captured by modeling only the remaining nouns. Using the Standford POS tagger, each word in each segment was marked up with a part of speech indicator and all but the nouns were removed.12 Jockers and Mimno, Significant Themes in 19th-Century Literature
  • 33. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three.
  • 34. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Prezentio add. 5
  • 35. Nina Tahmasebi, Digital Literacy, Sept. 2020 35
  • 36. Amount of information Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 37. In short, ladies and gentlemen, my message today is that data is gold. … Let's start mining it. Neelie Kroes Vice-President of the European Commission responsible for the Digital Agenda, SPEECH/11/872 , 2011 Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 38. Is it true that data is gold? Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 39. same data + different methods = different answers Nina Tahmasebi, Digital Literacy, Sept. 2020 39
  • 40. Since there is infinite amount of information in the text, the text becomes infinitely complex. → Currently, there are no methods to mine all the information Nina Tahmasebi, Digital Literacy, Sept. 2020 40
  • 42. Traditional research methodology Research question Text Nina Tahmasebi, Digital Literacy, Sept. 2020 42
  • 43. Data-intensive research methodology Research question Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 43
  • 44. Data-intensive research methodology Research question Text (digital large-scale text) Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 44
  • 45. Data Hypothesis Data Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 45
  • 46. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 46
  • 47. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result Nina Tahmasebi, Digital Literacy, Sept. 2020 47
  • 48. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 48
  • 49. Results as a window to the text Nina Tahmasebi, Digital Literacy, Sept. 2020 49
  • 50. Viewpoint on the data Nina Tahmasebi, Digital Literacy, Sept. 2020 50
  • 51. Nina Tahmasebi, Digital Literacy, Sept. 2020 51
  • 52. Nina Tahmasebi, Digital Literacy, Sept. 2020 52
  • 53. Nina Tahmasebi, Digital Literacy, Sept. 2020 53
  • 54. Nina Tahmasebi, Digital Literacy, Sept. 2020 54
  • 55. The better your method (WRT the information related to your research question) → the better the pieces Amount of informa tion Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 55
  • 56. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 56
  • 57. Data-intensive research methodology Hypothesis Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 57
  • 58. Data-intensive research methodology results results results Text mining method Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 58
  • 59. Results and research questions Research question Sometimes the results do not answer the research question in full Nina Tahmasebi, Digital Literacy, Sept. 2020 59
  • 60. Nina Tahmasebi, Digital Literacy, Sept. 2020 60
  • 61. Image: https://ipec.co.zwNina Tahmasebi, Digital Literacy, Sept. 2020
  • 62. Research questions Evidence • Attack/demonstrations • Homicide investigation • Financial irregularities • Data breach Majority • How well is our product received • Which of our issues are most/least attractive to our voters? • How will people vote? Nina Tahmasebi, Digital Literacy, Sept. 2020 62
  • 63. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 63
  • 64. Truths about data- intensive research Not all methods fit all data Not all data fit all questions Not all methods can answer all questions Nothing lives separately, it must be evaluated together: Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 64
  • 65. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 65
  • 66. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 67. Nina Tahmasebi, Digital Literacy, Sept. 2020m
  • 68. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Which kind of questions fit your purposes? Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 68
  • 69. Results and research questions Hypotes Text mining method resultat Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 69
  • 71. ? Store Writer A Male authors Journal 1 Written language Pharmacy Writer B Female authors Journal 2 Spoken language Are these different (significantly) or the same? Sample 2Sample 1 Sample 2Sample 1 H1 H0 Nina Tahmasebi, Digital Literacy, Sept. 2020 71
  • 72. Inference requires random selection • Only if the selection is random, can we use the sample to draw conclusions about the world • We almost NEVER have a random sample in a textual corpus! → We cannot draw conclusions about the world Sample 2Sample 1 random inference Nina Tahmasebi, Digital Literacy, Sept. 2020 72
  • 73. When we have little data, the uncertainty is large: • Is A larger than B? But when we have large data, we are more certain about our observations, STILL, our errors can be much larger • Because our selection is biased Sample 2 Sample 2 Sample 1 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Nina Tahmasebi, Digital Literacy, Sept. 2020 73
  • 74. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. Language is never, ever, ever, random, Adam Kilgariff, 2005 Nina Tahmasebi, Digital Literacy, Sept. 2020 74
  • 75. Method + Data = Results result Nina Tahmasebi, Digital Literacy, Sept. 2020 75
  • 76. result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 76
  • 77. Reject 1 Data 2 Method / Preprocessing 3 Hypothesis result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 77
  • 78. Accept 1 Method 2 Correct interpretation of the results result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 78
  • 79. Math results, average difference Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 79Source: Factfullness
  • 80. Men Women Math results, average difference Nina Tahmasebi, Digital Literacy, Sept. 2020 80Source: Factfullness
  • 81. NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
  • 82. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 82
  • 83. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 83
  • 84. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 84
  • 85. Experimental design Even when the math is right, we need to question the selection and the grounds on which our conclusions are. • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 85
  • 86. Evaluation Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 87. Evaluation individual individual text signal topic, cluster, vector… signal change collective text minimum optimum medium Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 89. Conclusions Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 90. Nina Tahmasebi, Digital Literacy, Sept. 2020 90
  • 91. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 91
  • 92. Experimental design • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 92
  • 93. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers. Nina Tahmasebi, Digital Literacy, Sept. 2020 93