SlideShare a Scribd company logo
1 of 11
Download to read offline
Positive words carry less
information than negative
words
D. Garcia, A. Garas and F. Schweitzer
EPJ Data Science, 2012
JClub 2014.4.23
by Kazutoshi Sasahara
Introduction
n  Is human language biased towards positive
emotion or neutral?
n  Statistical properties of word freq. and length
n  Word freq. (word rank)-1 (Zipf 1949)
n  Word freq. predicts word length as a result of a principle
of least effort
n  Word length increases with information content for
efficient communication (Piantadosi et al. 2011).
Introduction (cont.)
n  Pollyanna hypothesis (Boucher & Osgood 1969)
A universal human tendency to use evaluatively positive
words more frequently and than evaluatively negative words
in communicating.
n  Previous researches reported emotional bias but
with the lack of control
n  Problems in the use of Amazon Mechanical Turk
n  Possible biases
n  Acquiescent bias
n  Social desirability bias
n  Framing effects
Data Analysis
n  This paper examined emotional bias in three major
languages on the Internet.
n  English (56.6%), German (6.5%), Spanish (4.6%)
n  Data
n  Established lexica of affective word usage:
English:1,034, German: 2,902, Spanish: 1,034
n  Google N-gram dataset: 1012 tokens
n  Valence (v)
The degree of pleasure induced by the affective word
usage, rescaled between -1 and 1.
Data Science 2012, 1:3
atascience.com/content/1/1/3
Figure 1 Emotion word clouds with frequencies calculated from Google’s crawl. In each word cloud for
English (left), German (middle), and Spanish (right), the size of a word is proportional to its frequency of
appearance in the trillion-token Google N-gram dataset [26]. Word colors are chosen from red (negative) to
green (positive) in the valence range from psychology studies [7–9]. For the three languages, positive words
predominate on the Internet.
Results: Frequency of emotional
words
v=-1: red, v=+1: green
Exception
n  Positive words predominate on the Internet.
English German Spanish
Results: Distribution of
emotional wordsa Science 2012, 1:3 Page 5 of 12
science.com/content/1/1/3
n  The median shifts significantly
towards positive values ( 0.3).
n  95% confidence intervals
(Wilcoxon tests):
n  English: 0.257 0.032
n  German: 0.167 0.017
n  Spanish: 0.287 0.035
n  Empirical evidence of positive
bias
No control Control
al. EPJ Data Science 2012, 1:3 Pag
ww.epjdatascience.com/content/1/1/3
Table 1 Correlations between word valence and information measurements.
English German Spanish
ρ(v,f) 0.222** 0.144** 0.236**
ρ(v,I) –0.368** –0.325** –0.402**
ρ(v,I′) –0.294** –0.222** –0.311**
ρ(v,I2) –0.332** –0.301** –0.359**
ρ(v,I3) –0.313** –0.201** –0.340**
ρ(v,I4) –0.254** –0.049* –0.162**
Correlation coefficients of the valence (v), frequency f, self-information I, and information content measured for 2-grams I2, 3-
grams I3, and 4-grams I4, and with self-information I′ measured from the frequencies reported in [42–44]. Significance levels:
*p < 0.01, **p < 0.001.
ones, but this nonlinear mapping between frequency and self-information makes the latter
more closely related to word valence than the former. The first two lines of Table  show
the Pearson’s correlation coefficient of word valence and frequency ρ(v,f ), followed by the
correlation coefficient between word valence and self-information, ρ(v,I). For all three
languages, the absolute value of the correlation coefficient with I is larger than with f ,
showing that self-information provides more knowledge about word valence than plain
Results: Relation between
information and valence (1)
n  Information content is measured by self-information I(w),
which provides more knowledge on the valence than the frequency.
n  Negative correlation between v and I:
n  Positive words carry less information than negative words.
n  Correlation coefficient becomes smaller for the larger context
(N).
I(w) = −log2 P(w)
←Control analysis
Data Science 2012, 1:3 Page 7 of 12
jdatascience.com/content/1/1/3
Results: Relation between
information and valence (2)
−
1
N
log2
i=1
N
∑ P(W = w |C = ci )( )
n  For all languages and
context sizes, valence
decreases with information
content.
(Left)
Color: Valence (v)
Size: Self-information (I)
(Right)
Average self-information:
Results: Additional analysis of
valence, length, and self-info (1)
n  The sign of valence matters
n  Word length I(w)
n  Valence ? (word length)-1
n  Combined influence of valence and length to I(w)
n  Additional dimension in the communication process related to
emotional content (v) rather than communication efficiency (l)
valence, which means, indeed, that the usage frequency of a word is not just related to the
overall emotional intensity, but to the positive or negative emotion expressed by the word.
Subsequently, we found that the correlation coefficient between word length and self-
information (ρ(l,I)) is positive, showing that word length increases with self-information.
These values of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s
Table 2 Additional correlations between valence, self-information and length.
English German Spanish
ρ(abs(v),I) 0.032
◦
0.109*** 0.135***
ρ(l,I) 0.378*** 0.143*** 0.361***
ρ(v,l) –0.044
◦
–0.071*** –0.112***
ρ(v,I|l) –0.379*** –0.319*** –0.399***
ρ(l,I|v) 0.389*** 0.126*** 0.357***
Correlation coefficients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information
(I). Partial correlations are calculated for both variables (ρ(v,I|l),ρ(l,I|v)), and correlation between valence and length (ρ(v,l)).
Significance levels:
◦
p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001.
but this trend is not so clear for German. These trends are properly quan-
rson’s correlation coefficients between valence and information content for
size (Table ). Each correlation coefficient becomes smaller for larger sizes of
as the information content estimation includes a larger context but becomes
nal analysis of valence, length and self-information
rovide additional support for our results, we tested different hypotheses im-
elation between word usage and valence. First, we calculated Pearson’s and
orrelation coefficients between the absolute value of the valence and the self-
of a word, ρ(abs(v),I) (see Table ). We found both correlation coefficients
. for German and Spanish, while they are not significant for English. The
between valence and self-information disappears if we ignore the sign of the
h means, indeed, that the usage frequency of a word is not just related to the
onal intensity, but to the positive or negative emotion expressed by the word.
tly, we found that the correlation coefficient between word length and self-
ρ(l,I)) is positive, showing that word length increases with self-information.
of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s
ional correlations between valence, self-information and length.
English German Spanish
0.032
◦
0.109*** 0.135***
0.378*** 0.143*** 0.361***
–0.044
◦
–0.071*** –0.112***
–0.379*** –0.319*** –0.399***
0.389*** 0.126*** 0.357***
ients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information
Additional analysis of valence, length and self-information
rder to provide additional support for our results, we tested different hypotheses im-
ting the relation between word usage and valence. First, we calculated Pearson’s and
arman’s correlation coefficients between the absolute value of the valence and the self-
rmation of a word, ρ(abs(v),I) (see Table ). We found both correlation coefficients
e around . for German and Spanish, while they are not significant for English. The
endence between valence and self-information disappears if we ignore the sign of the
nce, which means, indeed, that the usage frequency of a word is not just related to the
rall emotional intensity, but to the positive or negative emotion expressed by the word.
ubsequently, we found that the correlation coefficient between word length and self-
rmation (ρ(l,I)) is positive, showing that word length increases with self-information.
se values of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s
e 2 Additional correlations between valence, self-information and length.
English German Spanish
s(v),I) 0.032
◦
0.109*** 0.135***
0.378*** 0.143*** 0.361***
) –0.044
◦
–0.071*** –0.112***
|l) –0.379*** –0.319*** –0.399***
v) 0.389*** 0.126*** 0.357***
lation coefficients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information
rtial correlations are calculated for both variables (ρ(v,I|l),ρ(l,I|v)), and correlation between valence and length (ρ(v,l)).
ficance levels:
◦
p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001.
Page 9 of 12
ween valence and information content.
German Spanish
–0.100*** –0.058*
–0.070*** –0.149***
–0.020* –0.084**
n content measured on different context sizes (I2, I3, I4) controlling
1, **p < 0.01, ***p < 0.001.
and length ρ(v,l) are very low or not significant.
f valence and length to self-information, we cal-
s ρ(v,I|l) and ρ(l,I|v). The results are shown in
e intervals of the original correlation coefficients
or the existence of an additional dimension in the
o emotional content rather than communication
wn result that word lengths adapt to information
dent semantic feature of valence. Valence is also
the symbolic representation of the word through
context by controlling for word frequency. In Ta-
fficients of valence with information content for
g for self-information. We find that most of the
ve sign, with the exception of I for English. The
probably related to two word constructions such
ents between valence and information content.
h German Spanish
–0.100*** –0.058*
–0.070*** –0.149***
* –0.020* –0.084**
information content measured on different context sizes (I2, I3, I4) controlling
< 0.3, *p < 0.1, **p < 0.01, ***p < 0.001.
valence and length ρ(v,l) are very low or not significant.
fluence of valence and length to self-information, we cal-
oefficients ρ(v,I|l) and ρ(l,I|v). The results are shown in
onfidence intervals of the original correlation coefficients
upport for the existence of an additional dimension in the
elated to emotional content rather than communication
the known result that word lengths adapt to information
independent semantic feature of valence. Valence is also
ut not to the symbolic representation of the word through
uence of context by controlling for word frequency. In Ta-
tion coefficients of valence with information content for
ontrolling for self-information. We find that most of the
of negative sign, with the exception of I for English. The
es of  is probably related to two word constructions such
n valence and information content.
German Spanish
–0.100*** –0.058*
–0.070*** –0.149***
–0.020* –0.084**
tent measured on different context sizes (I2, I3, I4) controlling
p < 0.01, ***p < 0.001.
length ρ(v,l) are very low or not significant.
lence and length to self-information, we cal-
(v,I|l) and ρ(l,I|v). The results are shown in
tervals of the original correlation coefficients
he existence of an additional dimension in the
motional content rather than communication
result that word lengths adapt to information
t semantic feature of valence. Valence is also
symbolic representation of the word through
text by controlling for word frequency. In Ta-
ents of valence with information content for
or self-information. We find that most of the
sign, with the exception of I for English. The
Results: Additional analysis of
valence, length, and self-info (2)et al. EPJ Data Science 2012, 1:3 Pa
www.epjdatascience.com/content/1/1/3
Table 3 Partial correlation coefficients between valence and information content.
English German Spanish
ρ(v,I2|I) –0.034
◦
–0.100*** –0.058*
ρ(v,I3|I) –0.101** –0.070*** –0.149***
ρ(v,I4|I) –0.134*** –0.020* –0.084**
Correlation coefficients of the valence (v) and information content measured on different context sizes (I2, I3, I4) controlling
for self-information (I). Significance levels:
◦
p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001.
correlation coefficients between valence and length ρ(v,l) are very low or not significant.
In order to test the combined influence of valence and length to self-information, we cal-
culated the partial correlation coefficients ρ(v,I|l) and ρ(l,I|v). The results are shown in
Table , and are within the % confidence intervals of the original correlation coefficients
ρ(v,I) and ρ(l,I). This provides support for the existence of an additional dimension in the
communication process closely related to emotional content rather than communication
efficiency. This is consistent with the known result that word lengths adapt to information
n  Most of the correlations keep significant and negative sign,
except I2 for English.
n  Knowing the possible contexts of a word (N=2,3,4) provides
further information about word valence than sole self-
information.
Summary
n  Empirical evidence for a positive bias in language
n  Positive words are more frequently used.
n  Pollyanna hypothesis
n  Facilitation of social links
n  Negative words convey more information content than
positive words.
n  Word frequency is determined by
n  Not only word length and information content
n  But also emotional content

More Related Content

Viewers also liked

Performance variability enables adaptive plasticity ‘crystallized’ adult song
Performance variability enables adaptive plasticity ‘crystallized’ adult songPerformance variability enables adaptive plasticity ‘crystallized’ adult song
Performance variability enables adaptive plasticity ‘crystallized’ adult song
Tokyo Tech
 
Decksetがよかった話
Decksetがよかった話Decksetがよかった話
Decksetがよかった話
Kohki Miki
 
How sleep affects the developmental learning of bird song
How sleep affects the developmental learning of bird songHow sleep affects the developmental learning of bird song
How sleep affects the developmental learning of bird song
Tokyo Tech
 

Viewers also liked (16)

Performance variability enables adaptive plasticity ‘crystallized’ adult song
Performance variability enables adaptive plasticity ‘crystallized’ adult songPerformance variability enables adaptive plasticity ‘crystallized’ adult song
Performance variability enables adaptive plasticity ‘crystallized’ adult song
 
Beamertemplete
BeamertempleteBeamertemplete
Beamertemplete
 
Decksetがよかった話
Decksetがよかった話Decksetがよかった話
Decksetがよかった話
 
鳥の複雑なさえずりと進化的デザイン
鳥の複雑なさえずりと進化的デザイン鳥の複雑なさえずりと進化的デザイン
鳥の複雑なさえずりと進化的デザイン
 
Ets induction
Ets inductionEts induction
Ets induction
 
Erlangen Beamer
Erlangen BeamerErlangen Beamer
Erlangen Beamer
 
Gadget2
Gadget2Gadget2
Gadget2
 
How sleep affects the developmental learning of bird song
How sleep affects the developmental learning of bird songHow sleep affects the developmental learning of bird song
How sleep affects the developmental learning of bird song
 
スライド、作ってみませんか? #osc16tk
スライド、作ってみませんか? #osc16tk スライド、作ってみませんか? #osc16tk
スライド、作ってみませんか? #osc16tk
 
Marp colors
Marp colorsMarp colors
Marp colors
 
Marp Tips
Marp TipsMarp Tips
Marp Tips
 
Marp使ってみた
Marp使ってみたMarp使ってみた
Marp使ってみた
 
Soramame.Block 100行のJavaScriptで ビジュアルプログラミング言語(のフロントエンド)を作ってみた:
Soramame.Block 100行のJavaScriptで ビジュアルプログラミング言語(のフロントエンド)を作ってみた: Soramame.Block 100行のJavaScriptで ビジュアルプログラミング言語(のフロントエンド)を作ってみた:
Soramame.Block 100行のJavaScriptで ビジュアルプログラミング言語(のフロントエンド)を作ってみた:
 
ギークスマホ「Fx0」入手と運用
ギークスマホ「Fx0」入手と運用ギークスマホ「Fx0」入手と運用
ギークスマホ「Fx0」入手と運用
 
機械学習ライブラリ「Spark MLlib」で作る アニメレコメンドシステム
機械学習ライブラリ「Spark MLlib」で作る アニメレコメンドシステム機械学習ライブラリ「Spark MLlib」で作る アニメレコメンドシステム
機械学習ライブラリ「Spark MLlib」で作る アニメレコメンドシステム
 
WordPressプラグイン開発の めんどうな作業は執事(Jenkins)にお任せ
WordPressプラグイン開発の めんどうな作業は執事(Jenkins)にお任せWordPressプラグイン開発の めんどうな作業は執事(Jenkins)にお任せ
WordPressプラグイン開発の めんどうな作業は執事(Jenkins)にお任せ
 

Recently uploaded

Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
Sérgio Sacani
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Sérgio Sacani
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Sérgio Sacani
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 RpWASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
WASP-69b’s Escaping Envelope Is Confined to a Tail Extending at Least 7 Rp
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
 
Cell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptx
 
GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)GBSN - Microbiology Lab 2 (Compound Microscope)
GBSN - Microbiology Lab 2 (Compound Microscope)
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptx
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
 
Microbial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptxMicrobial bio Synthesis of nanoparticles.pptx
Microbial bio Synthesis of nanoparticles.pptx
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
family therapy psychotherapy types .pdf
family therapy psychotherapy types  .pdffamily therapy psychotherapy types  .pdf
family therapy psychotherapy types .pdf
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 

Positive words carry less information than negative words

  • 1. Positive words carry less information than negative words D. Garcia, A. Garas and F. Schweitzer EPJ Data Science, 2012 JClub 2014.4.23 by Kazutoshi Sasahara
  • 2. Introduction n  Is human language biased towards positive emotion or neutral? n  Statistical properties of word freq. and length n  Word freq. (word rank)-1 (Zipf 1949) n  Word freq. predicts word length as a result of a principle of least effort n  Word length increases with information content for efficient communication (Piantadosi et al. 2011).
  • 3. Introduction (cont.) n  Pollyanna hypothesis (Boucher & Osgood 1969) A universal human tendency to use evaluatively positive words more frequently and than evaluatively negative words in communicating. n  Previous researches reported emotional bias but with the lack of control n  Problems in the use of Amazon Mechanical Turk n  Possible biases n  Acquiescent bias n  Social desirability bias n  Framing effects
  • 4. Data Analysis n  This paper examined emotional bias in three major languages on the Internet. n  English (56.6%), German (6.5%), Spanish (4.6%) n  Data n  Established lexica of affective word usage: English:1,034, German: 2,902, Spanish: 1,034 n  Google N-gram dataset: 1012 tokens n  Valence (v) The degree of pleasure induced by the affective word usage, rescaled between -1 and 1.
  • 5. Data Science 2012, 1:3 atascience.com/content/1/1/3 Figure 1 Emotion word clouds with frequencies calculated from Google’s crawl. In each word cloud for English (left), German (middle), and Spanish (right), the size of a word is proportional to its frequency of appearance in the trillion-token Google N-gram dataset [26]. Word colors are chosen from red (negative) to green (positive) in the valence range from psychology studies [7–9]. For the three languages, positive words predominate on the Internet. Results: Frequency of emotional words v=-1: red, v=+1: green Exception n  Positive words predominate on the Internet. English German Spanish
  • 6. Results: Distribution of emotional wordsa Science 2012, 1:3 Page 5 of 12 science.com/content/1/1/3 n  The median shifts significantly towards positive values ( 0.3). n  95% confidence intervals (Wilcoxon tests): n  English: 0.257 0.032 n  German: 0.167 0.017 n  Spanish: 0.287 0.035 n  Empirical evidence of positive bias No control Control
  • 7. al. EPJ Data Science 2012, 1:3 Pag ww.epjdatascience.com/content/1/1/3 Table 1 Correlations between word valence and information measurements. English German Spanish ρ(v,f) 0.222** 0.144** 0.236** ρ(v,I) –0.368** –0.325** –0.402** ρ(v,I′) –0.294** –0.222** –0.311** ρ(v,I2) –0.332** –0.301** –0.359** ρ(v,I3) –0.313** –0.201** –0.340** ρ(v,I4) –0.254** –0.049* –0.162** Correlation coefficients of the valence (v), frequency f, self-information I, and information content measured for 2-grams I2, 3- grams I3, and 4-grams I4, and with self-information I′ measured from the frequencies reported in [42–44]. Significance levels: *p < 0.01, **p < 0.001. ones, but this nonlinear mapping between frequency and self-information makes the latter more closely related to word valence than the former. The first two lines of Table  show the Pearson’s correlation coefficient of word valence and frequency ρ(v,f ), followed by the correlation coefficient between word valence and self-information, ρ(v,I). For all three languages, the absolute value of the correlation coefficient with I is larger than with f , showing that self-information provides more knowledge about word valence than plain Results: Relation between information and valence (1) n  Information content is measured by self-information I(w), which provides more knowledge on the valence than the frequency. n  Negative correlation between v and I: n  Positive words carry less information than negative words. n  Correlation coefficient becomes smaller for the larger context (N). I(w) = −log2 P(w) ←Control analysis
  • 8. Data Science 2012, 1:3 Page 7 of 12 jdatascience.com/content/1/1/3 Results: Relation between information and valence (2) − 1 N log2 i=1 N ∑ P(W = w |C = ci )( ) n  For all languages and context sizes, valence decreases with information content. (Left) Color: Valence (v) Size: Self-information (I) (Right) Average self-information:
  • 9. Results: Additional analysis of valence, length, and self-info (1) n  The sign of valence matters n  Word length I(w) n  Valence ? (word length)-1 n  Combined influence of valence and length to I(w) n  Additional dimension in the communication process related to emotional content (v) rather than communication efficiency (l) valence, which means, indeed, that the usage frequency of a word is not just related to the overall emotional intensity, but to the positive or negative emotion expressed by the word. Subsequently, we found that the correlation coefficient between word length and self- information (ρ(l,I)) is positive, showing that word length increases with self-information. These values of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s Table 2 Additional correlations between valence, self-information and length. English German Spanish ρ(abs(v),I) 0.032 ◦ 0.109*** 0.135*** ρ(l,I) 0.378*** 0.143*** 0.361*** ρ(v,l) –0.044 ◦ –0.071*** –0.112*** ρ(v,I|l) –0.379*** –0.319*** –0.399*** ρ(l,I|v) 0.389*** 0.126*** 0.357*** Correlation coefficients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information (I). Partial correlations are calculated for both variables (ρ(v,I|l),ρ(l,I|v)), and correlation between valence and length (ρ(v,l)). Significance levels: ◦ p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001. but this trend is not so clear for German. These trends are properly quan- rson’s correlation coefficients between valence and information content for size (Table ). Each correlation coefficient becomes smaller for larger sizes of as the information content estimation includes a larger context but becomes nal analysis of valence, length and self-information rovide additional support for our results, we tested different hypotheses im- elation between word usage and valence. First, we calculated Pearson’s and orrelation coefficients between the absolute value of the valence and the self- of a word, ρ(abs(v),I) (see Table ). We found both correlation coefficients . for German and Spanish, while they are not significant for English. The between valence and self-information disappears if we ignore the sign of the h means, indeed, that the usage frequency of a word is not just related to the onal intensity, but to the positive or negative emotion expressed by the word. tly, we found that the correlation coefficient between word length and self- ρ(l,I)) is positive, showing that word length increases with self-information. of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s ional correlations between valence, self-information and length. English German Spanish 0.032 ◦ 0.109*** 0.135*** 0.378*** 0.143*** 0.361*** –0.044 ◦ –0.071*** –0.112*** –0.379*** –0.319*** –0.399*** 0.389*** 0.126*** 0.357*** ients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information Additional analysis of valence, length and self-information rder to provide additional support for our results, we tested different hypotheses im- ting the relation between word usage and valence. First, we calculated Pearson’s and arman’s correlation coefficients between the absolute value of the valence and the self- rmation of a word, ρ(abs(v),I) (see Table ). We found both correlation coefficients e around . for German and Spanish, while they are not significant for English. The endence between valence and self-information disappears if we ignore the sign of the nce, which means, indeed, that the usage frequency of a word is not just related to the rall emotional intensity, but to the positive or negative emotion expressed by the word. ubsequently, we found that the correlation coefficient between word length and self- rmation (ρ(l,I)) is positive, showing that word length increases with self-information. se values of ρ(l,I) are consistent with previous results [, ]. Pearson’s and Spearman’s e 2 Additional correlations between valence, self-information and length. English German Spanish s(v),I) 0.032 ◦ 0.109*** 0.135*** 0.378*** 0.143*** 0.361*** ) –0.044 ◦ –0.071*** –0.112*** |l) –0.379*** –0.319*** –0.399*** v) 0.389*** 0.126*** 0.357*** lation coefficients of the valence (v), absolute value of the valence (abs(v)), and word length (l) versus self-information rtial correlations are calculated for both variables (ρ(v,I|l),ρ(l,I|v)), and correlation between valence and length (ρ(v,l)). ficance levels: ◦ p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001. Page 9 of 12 ween valence and information content. German Spanish –0.100*** –0.058* –0.070*** –0.149*** –0.020* –0.084** n content measured on different context sizes (I2, I3, I4) controlling 1, **p < 0.01, ***p < 0.001. and length ρ(v,l) are very low or not significant. f valence and length to self-information, we cal- s ρ(v,I|l) and ρ(l,I|v). The results are shown in e intervals of the original correlation coefficients or the existence of an additional dimension in the o emotional content rather than communication wn result that word lengths adapt to information dent semantic feature of valence. Valence is also the symbolic representation of the word through context by controlling for word frequency. In Ta- fficients of valence with information content for g for self-information. We find that most of the ve sign, with the exception of I for English. The probably related to two word constructions such ents between valence and information content. h German Spanish –0.100*** –0.058* –0.070*** –0.149*** * –0.020* –0.084** information content measured on different context sizes (I2, I3, I4) controlling < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001. valence and length ρ(v,l) are very low or not significant. fluence of valence and length to self-information, we cal- oefficients ρ(v,I|l) and ρ(l,I|v). The results are shown in onfidence intervals of the original correlation coefficients upport for the existence of an additional dimension in the elated to emotional content rather than communication the known result that word lengths adapt to information independent semantic feature of valence. Valence is also ut not to the symbolic representation of the word through uence of context by controlling for word frequency. In Ta- tion coefficients of valence with information content for ontrolling for self-information. We find that most of the of negative sign, with the exception of I for English. The es of  is probably related to two word constructions such n valence and information content. German Spanish –0.100*** –0.058* –0.070*** –0.149*** –0.020* –0.084** tent measured on different context sizes (I2, I3, I4) controlling p < 0.01, ***p < 0.001. length ρ(v,l) are very low or not significant. lence and length to self-information, we cal- (v,I|l) and ρ(l,I|v). The results are shown in tervals of the original correlation coefficients he existence of an additional dimension in the motional content rather than communication result that word lengths adapt to information t semantic feature of valence. Valence is also symbolic representation of the word through text by controlling for word frequency. In Ta- ents of valence with information content for or self-information. We find that most of the sign, with the exception of I for English. The
  • 10. Results: Additional analysis of valence, length, and self-info (2)et al. EPJ Data Science 2012, 1:3 Pa www.epjdatascience.com/content/1/1/3 Table 3 Partial correlation coefficients between valence and information content. English German Spanish ρ(v,I2|I) –0.034 ◦ –0.100*** –0.058* ρ(v,I3|I) –0.101** –0.070*** –0.149*** ρ(v,I4|I) –0.134*** –0.020* –0.084** Correlation coefficients of the valence (v) and information content measured on different context sizes (I2, I3, I4) controlling for self-information (I). Significance levels: ◦ p < 0.3, *p < 0.1, **p < 0.01, ***p < 0.001. correlation coefficients between valence and length ρ(v,l) are very low or not significant. In order to test the combined influence of valence and length to self-information, we cal- culated the partial correlation coefficients ρ(v,I|l) and ρ(l,I|v). The results are shown in Table , and are within the % confidence intervals of the original correlation coefficients ρ(v,I) and ρ(l,I). This provides support for the existence of an additional dimension in the communication process closely related to emotional content rather than communication efficiency. This is consistent with the known result that word lengths adapt to information n  Most of the correlations keep significant and negative sign, except I2 for English. n  Knowing the possible contexts of a word (N=2,3,4) provides further information about word valence than sole self- information.
  • 11. Summary n  Empirical evidence for a positive bias in language n  Positive words are more frequently used. n  Pollyanna hypothesis n  Facilitation of social links n  Negative words convey more information content than positive words. n  Word frequency is determined by n  Not only word length and information content n  But also emotional content