SlideShare ist ein Scribd-Unternehmen logo
1 von 19
DATA MINING SCIENTISTS NAMES TO MEASURE CULTURAL BIASES IN SCIENTIFIC RESEARCH 
Presenting NamSor Software at ICOS2014 
1 
2014-08-28
Introduction 
2 
Personal names are markers of Identity 
How immune are researchers to cultural biases? 
PubMed is one of the large publications databases covering Medical Research, Life Sciences, etc. 
Cultural biases could cause valuable medical research to be overlooked, poor research to be overrated
Names & Identity 
3 
●Names go with identities and identities go with names (Brendler, 2012); 
●Names have been utilised, for instance, as markers of cultural, social, ethnic or national identity (Härtel 1997; Virkkula 2001; Postles 2002; Beechetal 2002; Hagström 2006)
●We do accept that the notion of identity is a dynamic construct which cannot be viewed as a fixed entity under investigation 
●However, the combination of forename and surname might often help us to identify majority of names as indicators of national identity 
Identity as a dynamic construct
‘Teaching’ Names & Identity to NamSor 
5 
FN 
LN 
Mette 
Andersen 
Lene 
Andersson 
Eva 
Arndt-Riise 
Heidi 
Astrup 
Mie 
Augustesen 
Margot 
Bærentzen 
Louise 
Bager Nørgaard 
Marie 
Bagger Rasmussen 
Yutta 
Barding 
Ulla 
Barding-Poulsen 
FN 
LN 
Xian 
Dongmei 
Zheng 
Dongmei 
Jin 
Dongxiang 
Xu 
Dongxiang 
Li 
Dongxiao 
Qin 
Dongya 
Li 
Dongying 
Han 
Duan 
Li 
Duihong 
Jiang 
Fan 
Training set : Athletes 
Step 1 – Learn stereotypes 
bitao 
gong 
biwang 
jiang 
birgitta 
agerberth 
birgitte l. 
eriksen 
bitao 
gong 
bitten 
thorengaard 
biwang 
Jiang 
birgitta 
agerberth 
birgitte l. 
eriksen 
bitten 
thorengaard 
Data set : Scientists 
Step 2 – Classify 
Onomastic Classes
Breaking down 3.3 million names 
6 
16% 
8% 
5% 
5% 
3% 
3% 
3% 
3% 
3% 
3% 
3% 
2% 
2% 
2% 
2% 
2% 
2% 
2% 
1% 
1% 
28% 
PubMed onomastic classes (top20) Source: PubMed PMC, sample: 3.3M names 
(GB,LATIN) 
(FR,LATIN) 
(DE,LATIN) 
(IT,LATIN) 
(IN,LATIN) 
(ES,LATIN) 
(NL,LATIN) 
(JP,LATIN) 
(CN,LATIN) 
(IE,LATIN) 
(AT,LATIN) 
(PL,LATIN) 
(KR,LATIN) 
(SE,LATIN) 
(PT,LATIN) 
(GR,LATIN) 
(FI,LATIN) 
(TW,LATIN) 
(DK,LATIN)
Number of publications and number of citations, by onomastic class (top 20) 
7 
Onoma 
A 
C 
Ratio (C/A) 
(GB,LATIN) 
557,177 
1,664,415 
3.0 
(FR,LATIN) 
272,150 
743,471 
2.7 
(DE,LATIN) 
192,778 
448,103 
2.3 
(JP,LATIN) 
172,866 
361,682 
2.1 
(IT,LATIN) 
187,564 
323,771 
1.7 
(IE,LATIN) 
86,161 
422,103 
4.9 
(NL,LATIN) 
102,982 
321,787 
3.1 
(AT,LATIN) 
78,199 
339,819 
4.3 
(CN,LATIN)* 
219,040 
186,464 
0.9 
(IN,LATIN) 
153,555 
221,332 
1.4 
(ES,LATIN) 
113,407 
228,650 
2.0 
(PL,LATIN) 
47,961 
268,115 
5.6 
(SE,LATIN) 
65,717 
237,017 
3.6 
(FI,LATIN) 
35,533 
247,231 
7.0 
(KR,LATIN) 
146,444 
105,605 
0.7 
(TW,LATIN)* 
88,822 
162,132 
1.8 
(GR,LATIN) 
51,564 
196,056 
3.8 
(DK,LATIN) 
42,403 
181,199 
4.3 
(BE,LATIN) 
44,647 
162,146 
3.6 
(CH,LATIN) 
32,295 
162,495 
5.0 
*CN+TW 
307,862 
348,596 
1.1 
Scientists with British names have published 557 thousand articles in PubMed and have been cited 1.6 million times in other PubMed articles: the ratio is 3. 
Articles written by authors with Italian names have been relatively less cited (with a ratio of 1.7) while the articles written by authors with Irish names or Finnish names have been more cited (with ratios respectively 4.9 and 7).
Chinese names in PubMed : x3 in 2008 
8
But, the more CN publications, the less citations 
9 
- 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
0 
10000 
20000 
30000 
40000 
50000 
60000 
70000 
80000 
90000 
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
2010 
2011 
2012 
Evolution of the number of publication authored by scientists with Chinese names in PubMed/PMC and the Citation/Authorship ratio between 2002 and 2012 
A (Authors in Publications) 
Ratio (C/A)
Citing scientists within the same onomastic class 
10 
Onomastic Class 
Onoma Authored % 
Onoma Self Citations % 
Bias Factor 
(GB,LATIN) 
16.6% 
17.0% 
1.02 
(FR,LATIN) 
8.1% 
7.6% 
0.94 
(IT,LATIN) 
5.6% 
3.8% 
0.68 
(DE,LATIN) 
5.8% 
6.1% 
1.05 
(CN+TW,LATIN) 
9.2% 
12.1% 
1.32 
(ES,LATIN) 
3.4% 
3.8% 
1.13 
(JP,LATIN) 
5.2% 
19.3% 
3.73 
(IE,LATIN) 
2.6% 
4.4% 
1.73 
(NL,LATIN) 
3.1% 
5.6% 
1.83 
(AT,LATIN) 
2.3% 
4.2% 
1.79 
(SE,LATIN) 
2.0% 
3.5% 
1.76 
(IN,LATIN) 
4.6% 
4.1% 
0.89 
(PT,LATIN) 
1.9% 
2.3% 
1.17 
(GR,LATIN) 
1.5% 
2.8% 
1.82 
(KR,LATIN) 
4.4% 
3.0% 
0.68 
(BE,LATIN) 
1.3% 
2.6% 
1.98 
(DK,LATIN) 
1.3% 
3.4% 
2.65 
Authors with British names represent 16.6% of publications, but 17% of their citations : a bias factor of 1.02 (almost no bias). 
Authors with French names represent 8.1% of publications, but only 7.6% of their citations : a bias factor of 0.94 indicating that authors with French names tend to cite authors with foreign names more. 
As for authors with Chinese names, they represent 9.2% of the publications, but 12.1% of their citations : a bias factor of 1.32 indicating that they tend to cite authors with Chinese names more.
The ‘rest of the world’ has a strong negative bias in citing scientists with Chinese names 
11 
Onomastic class 
Chinese Onoma Citation Pct% 
Bias Factor 
(GB,LATIN) 
3.9% 
0.43 
(FR,LATIN) 
3.9% 
0.42 
(IT,LATIN) 
3.9% 
0.43 
(DE,LATIN) 
4.1% 
0.44 
(CN+TW,LATIN) 
12.1% 
1.32 
(ES,LATIN) 
4.0% 
0.43 
(JP,LATIN) 
5.2% 
0.56 
(IE,LATIN) 
4.0% 
0.44 
(NL,LATIN) 
3.5% 
0.38 
(AT,LATIN) 
4.1% 
0.44 
(SE,LATIN) 
3.6% 
0.40 
(IN,LATIN) 
5.9% 
0.65 
(PT,LATIN) 
4.0% 
0.43 
(GR,LATIN) 
3.9% 
0.42 
(KR,LATIN) 
6.8% 
0.74 
(BE,LATIN) 
3.8% 
0.42 
(DK,LATIN) 
3.9% 
0.42 
Authors with a Chinese name tend to cite authors with a Chinese name more. 
Comparatively, scientists with non Chinese names (British, French, Italian, German etc.) have a bias factor of 0.46 and are 3 times less likely to cite publications authored by a scientist with a Chinese name.
The negative bias towards CN research is growing bigger 
12 
- 
0.20 
0.40 
0.60 
0.80 
1.00 
1.20 
1.40 
1.60 
2002 
2003 
2004 
2005 
2006 
2007 
2008 
2009 
2010 
2011 
2012 
Scientists with Chinese/non Chinese names: evolution of the bias factor towards citing scientists with Chinese names 
CN Bias Factor 
No Bias 
Non CN Bias Factor
Manual controls – method one 
13 
1280 names recognized as Chinese by NamSor software were checked manually 
83% of names the software recognized as Chinese were manually verified as Chinese; 2% unknown; 15% as non-Chinese (ie. mis- classifications). 
76% of the names were classified with positive confidence. For the names recognized as Chinese with a positive confidence, 94% were manually verified as Chinese; 1% unknown; 4% as non-Chinese (ie. mis- classification) 
Initials/First names of just one or two character (ex. J or JH) accounted for 90% of mis-classifications. With full names, the accuracy is 99%.
Chinese surnames 
●Chinese (mainland); 
●Cantonese (+Hong Kong); 
●Taiwanese;
Difficulties: Korean and Vietnamese surnames
Manual controls – method two 
17 
10172 scientists names were manually classified as Chinese/non Chinese by a Human independently, then compared to the Computer output 
Chinese Names 
Non Chinese Names 
Confidence>0 && Len(firstName)>2 
Computer 
Human 
Computer 
Human 
Precision 
97% 
72% 
96% 
100% 
Recall 
72% 
51% 
100% 
48%
Conclusions 
18 
Applied onomastics can help measure cultural biases; such biases should be accounted for in bibliometrics/scientometrics 
As it is, valuable medical research could be overlooked, poor research could be overrated 
International comparisons of performance in Research & Higher Education are biaised (ex. Shanghai Ranking of Top Universities) 
Future work : use larger commercial bibliographic databases ; compare Diasporas esp. in US Research vs. Local Affiliations
Thank you! 
Elian.Carsenat@namsor.com 
Phone : +33 6 52 77 99 07 
Twitter : @ElianCarsenat 
Dr. Evgeny Shokhenmayer 
Twitter : @eug1979 
e-onomastics.blogspot.de 
19 
Read the full paper on NamSor.com blog http://namesorts.com/2014/08/20/is-china-really-becoming-a-science-and-technology-superpower/

Weitere ähnliche Inhalte

Mehr von Elian CARSENAT

NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)Elian CARSENAT
 
HomeComing for Develoment in Africa
HomeComing for Develoment in AfricaHomeComing for Develoment in Africa
HomeComing for Develoment in AfricaElian CARSENAT
 
#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'Elian CARSENAT
 
Data Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeData Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeElian CARSENAT
 
Text mining names in ‘Big Data’ to recognize migration trends
Text mining names in ‘Big Data’ to recognize migration trendsText mining names in ‘Big Data’ to recognize migration trends
Text mining names in ‘Big Data’ to recognize migration trendsElian CARSENAT
 
BigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxBigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxElian CARSENAT
 
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Elian CARSENAT
 

Mehr von Elian CARSENAT (7)

NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
 
HomeComing for Develoment in Africa
HomeComing for Develoment in AfricaHomeComing for Develoment in Africa
HomeComing for Develoment in Africa
 
#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'#APIDays Paris - NamSor API for 'Gender Gap Grader'
#APIDays Paris - NamSor API for 'Gender Gap Grader'
 
Data Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la FemmeData Geeks Paris - Cherchez la Femme
Data Geeks Paris - Cherchez la Femme
 
Text mining names in ‘Big Data’ to recognize migration trends
Text mining names in ‘Big Data’ to recognize migration trendsText mining names in ‘Big Data’ to recognize migration trends
Text mining names in ‘Big Data’ to recognize migration trends
 
BigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux SociauxBigData Paris 2014 - Enjeux Sociaux
BigData Paris 2014 - Enjeux Sociaux
 
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
Rôle des français de l'étranger pour faire rayonner la 'Marque France', les M...
 

Kürzlich hochgeladen

Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 

Kürzlich hochgeladen (20)

Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 

Measuring the cultural biases in medical research using scientists personal names

  • 1. DATA MINING SCIENTISTS NAMES TO MEASURE CULTURAL BIASES IN SCIENTIFIC RESEARCH Presenting NamSor Software at ICOS2014 1 2014-08-28
  • 2. Introduction 2 Personal names are markers of Identity How immune are researchers to cultural biases? PubMed is one of the large publications databases covering Medical Research, Life Sciences, etc. Cultural biases could cause valuable medical research to be overlooked, poor research to be overrated
  • 3. Names & Identity 3 ●Names go with identities and identities go with names (Brendler, 2012); ●Names have been utilised, for instance, as markers of cultural, social, ethnic or national identity (Härtel 1997; Virkkula 2001; Postles 2002; Beechetal 2002; Hagström 2006)
  • 4. ●We do accept that the notion of identity is a dynamic construct which cannot be viewed as a fixed entity under investigation ●However, the combination of forename and surname might often help us to identify majority of names as indicators of national identity Identity as a dynamic construct
  • 5. ‘Teaching’ Names & Identity to NamSor 5 FN LN Mette Andersen Lene Andersson Eva Arndt-Riise Heidi Astrup Mie Augustesen Margot Bærentzen Louise Bager Nørgaard Marie Bagger Rasmussen Yutta Barding Ulla Barding-Poulsen FN LN Xian Dongmei Zheng Dongmei Jin Dongxiang Xu Dongxiang Li Dongxiao Qin Dongya Li Dongying Han Duan Li Duihong Jiang Fan Training set : Athletes Step 1 – Learn stereotypes bitao gong biwang jiang birgitta agerberth birgitte l. eriksen bitao gong bitten thorengaard biwang Jiang birgitta agerberth birgitte l. eriksen bitten thorengaard Data set : Scientists Step 2 – Classify Onomastic Classes
  • 6. Breaking down 3.3 million names 6 16% 8% 5% 5% 3% 3% 3% 3% 3% 3% 3% 2% 2% 2% 2% 2% 2% 2% 1% 1% 28% PubMed onomastic classes (top20) Source: PubMed PMC, sample: 3.3M names (GB,LATIN) (FR,LATIN) (DE,LATIN) (IT,LATIN) (IN,LATIN) (ES,LATIN) (NL,LATIN) (JP,LATIN) (CN,LATIN) (IE,LATIN) (AT,LATIN) (PL,LATIN) (KR,LATIN) (SE,LATIN) (PT,LATIN) (GR,LATIN) (FI,LATIN) (TW,LATIN) (DK,LATIN)
  • 7. Number of publications and number of citations, by onomastic class (top 20) 7 Onoma A C Ratio (C/A) (GB,LATIN) 557,177 1,664,415 3.0 (FR,LATIN) 272,150 743,471 2.7 (DE,LATIN) 192,778 448,103 2.3 (JP,LATIN) 172,866 361,682 2.1 (IT,LATIN) 187,564 323,771 1.7 (IE,LATIN) 86,161 422,103 4.9 (NL,LATIN) 102,982 321,787 3.1 (AT,LATIN) 78,199 339,819 4.3 (CN,LATIN)* 219,040 186,464 0.9 (IN,LATIN) 153,555 221,332 1.4 (ES,LATIN) 113,407 228,650 2.0 (PL,LATIN) 47,961 268,115 5.6 (SE,LATIN) 65,717 237,017 3.6 (FI,LATIN) 35,533 247,231 7.0 (KR,LATIN) 146,444 105,605 0.7 (TW,LATIN)* 88,822 162,132 1.8 (GR,LATIN) 51,564 196,056 3.8 (DK,LATIN) 42,403 181,199 4.3 (BE,LATIN) 44,647 162,146 3.6 (CH,LATIN) 32,295 162,495 5.0 *CN+TW 307,862 348,596 1.1 Scientists with British names have published 557 thousand articles in PubMed and have been cited 1.6 million times in other PubMed articles: the ratio is 3. Articles written by authors with Italian names have been relatively less cited (with a ratio of 1.7) while the articles written by authors with Irish names or Finnish names have been more cited (with ratios respectively 4.9 and 7).
  • 8. Chinese names in PubMed : x3 in 2008 8
  • 9. But, the more CN publications, the less citations 9 - 0.5 1.0 1.5 2.0 2.5 3.0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Evolution of the number of publication authored by scientists with Chinese names in PubMed/PMC and the Citation/Authorship ratio between 2002 and 2012 A (Authors in Publications) Ratio (C/A)
  • 10. Citing scientists within the same onomastic class 10 Onomastic Class Onoma Authored % Onoma Self Citations % Bias Factor (GB,LATIN) 16.6% 17.0% 1.02 (FR,LATIN) 8.1% 7.6% 0.94 (IT,LATIN) 5.6% 3.8% 0.68 (DE,LATIN) 5.8% 6.1% 1.05 (CN+TW,LATIN) 9.2% 12.1% 1.32 (ES,LATIN) 3.4% 3.8% 1.13 (JP,LATIN) 5.2% 19.3% 3.73 (IE,LATIN) 2.6% 4.4% 1.73 (NL,LATIN) 3.1% 5.6% 1.83 (AT,LATIN) 2.3% 4.2% 1.79 (SE,LATIN) 2.0% 3.5% 1.76 (IN,LATIN) 4.6% 4.1% 0.89 (PT,LATIN) 1.9% 2.3% 1.17 (GR,LATIN) 1.5% 2.8% 1.82 (KR,LATIN) 4.4% 3.0% 0.68 (BE,LATIN) 1.3% 2.6% 1.98 (DK,LATIN) 1.3% 3.4% 2.65 Authors with British names represent 16.6% of publications, but 17% of their citations : a bias factor of 1.02 (almost no bias). Authors with French names represent 8.1% of publications, but only 7.6% of their citations : a bias factor of 0.94 indicating that authors with French names tend to cite authors with foreign names more. As for authors with Chinese names, they represent 9.2% of the publications, but 12.1% of their citations : a bias factor of 1.32 indicating that they tend to cite authors with Chinese names more.
  • 11. The ‘rest of the world’ has a strong negative bias in citing scientists with Chinese names 11 Onomastic class Chinese Onoma Citation Pct% Bias Factor (GB,LATIN) 3.9% 0.43 (FR,LATIN) 3.9% 0.42 (IT,LATIN) 3.9% 0.43 (DE,LATIN) 4.1% 0.44 (CN+TW,LATIN) 12.1% 1.32 (ES,LATIN) 4.0% 0.43 (JP,LATIN) 5.2% 0.56 (IE,LATIN) 4.0% 0.44 (NL,LATIN) 3.5% 0.38 (AT,LATIN) 4.1% 0.44 (SE,LATIN) 3.6% 0.40 (IN,LATIN) 5.9% 0.65 (PT,LATIN) 4.0% 0.43 (GR,LATIN) 3.9% 0.42 (KR,LATIN) 6.8% 0.74 (BE,LATIN) 3.8% 0.42 (DK,LATIN) 3.9% 0.42 Authors with a Chinese name tend to cite authors with a Chinese name more. Comparatively, scientists with non Chinese names (British, French, Italian, German etc.) have a bias factor of 0.46 and are 3 times less likely to cite publications authored by a scientist with a Chinese name.
  • 12. The negative bias towards CN research is growing bigger 12 - 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Scientists with Chinese/non Chinese names: evolution of the bias factor towards citing scientists with Chinese names CN Bias Factor No Bias Non CN Bias Factor
  • 13. Manual controls – method one 13 1280 names recognized as Chinese by NamSor software were checked manually 83% of names the software recognized as Chinese were manually verified as Chinese; 2% unknown; 15% as non-Chinese (ie. mis- classifications). 76% of the names were classified with positive confidence. For the names recognized as Chinese with a positive confidence, 94% were manually verified as Chinese; 1% unknown; 4% as non-Chinese (ie. mis- classification) Initials/First names of just one or two character (ex. J or JH) accounted for 90% of mis-classifications. With full names, the accuracy is 99%.
  • 14. Chinese surnames ●Chinese (mainland); ●Cantonese (+Hong Kong); ●Taiwanese;
  • 15.
  • 16. Difficulties: Korean and Vietnamese surnames
  • 17. Manual controls – method two 17 10172 scientists names were manually classified as Chinese/non Chinese by a Human independently, then compared to the Computer output Chinese Names Non Chinese Names Confidence>0 && Len(firstName)>2 Computer Human Computer Human Precision 97% 72% 96% 100% Recall 72% 51% 100% 48%
  • 18. Conclusions 18 Applied onomastics can help measure cultural biases; such biases should be accounted for in bibliometrics/scientometrics As it is, valuable medical research could be overlooked, poor research could be overrated International comparisons of performance in Research & Higher Education are biaised (ex. Shanghai Ranking of Top Universities) Future work : use larger commercial bibliographic databases ; compare Diasporas esp. in US Research vs. Local Affiliations
  • 19. Thank you! Elian.Carsenat@namsor.com Phone : +33 6 52 77 99 07 Twitter : @ElianCarsenat Dr. Evgeny Shokhenmayer Twitter : @eug1979 e-onomastics.blogspot.de 19 Read the full paper on NamSor.com blog http://namesorts.com/2014/08/20/is-china-really-becoming-a-science-and-technology-superpower/