This document discusses two methods for determining the geographic origin of surnames based on bibliographic data. The first method uses Kullback-Leibler divergence to assign surnames to the country with the highest frequency and levels of assurance. The second method uses the Gini index to assign surnames to the country with the highest concentration. Both methods are validated against a "golden list" of surnames assigned to countries based on language, ethnicity, and historical origin. Results show the Gini index method has higher accuracy for some countries but lower coverage overall. Further validation over time is needed to account for migratory movements.
Can we track the geography of surnames based on bibliographic data?
1. Can we track the geographic origin of
surnames based on bibliographic data?
Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas
15th INTERNATIONAL CONFERENCE
ON SCIENTOMETRICS & INFORMETRICS
29 June – 3 July, 2015,
Bogazici University, Istanbul, Turkey
EC3metrics spin off CWTS
Leiden University
3. Background
“the use of surnames in human population biology dates back to
1875, when George Darwin used frequency of occurrences of the
same surname in married couples to study in-breeding”
Kissin, 2011
WHAT IS IN A SURNAME?
o Proxy for genetic/ethnic origin
-> Epidemiology, Biomedical research
o Proxy for country origin
-> Demographic studies, migratory movements
4. Background
o The representation of Jewish surnames in biomedical journals
and US-patents
Kissin, 2011; Kissin & Bradley, 2013
o Relation between ethnic mix collaboration and citation impact
Freeman & Huang, 2014
… in the field of bibliometrics
5. Background
HOW CAN WE DETERMINE THE GEOGRAPHIC ORIGIN OF
SURNAMES?
METHODS
o Manually curated lists
o Probability and Bayesian
methods
o Clustering techniques
DATA SOURCES
o National census
o Dispersion of sources
o Lack of international
coverage
6. Bibliographic data
o Scientific databases as international surnames data
sources
Regional restrictions Temporal restrictions
o Establishing ‘trusted’ linkages between surnames and
countries
Reprint address First author-First address
One country publications Author-address linkages (2008)
7. Bibliographic data
o Scientific databases as international surnames data
sources
Regional restrictions Temporal restrictions
o Establishing ‘trusted’ linkages between surnames and
countries
Some figures:
-> 1,568,052 distinct surnames assigned
to 119 countries
-> France 8,8%; Germany 8,0%;
Russia 7,1%; Spain 4,9%
8. Assumptions
HYPOTHESIS 1
A surname should be assigned to the country where there
is a higher frequency of such surname
HYPOTHESIS 2
A surname should be assigned to the country where there
is a greater concentration of such surname.
9. Method 1. Kullback-Leibler
OPERATIONALIZATION
A surname will be assigned to a country if 1) it has the highest
frequency, and 2) there are “certain levels of assurance”.
METHOD 1
Kullback-Leibler divergence
indicates the (dis)similarity of a
global surname distribution with its
distribution in each country.
10. Method 2. Gini Index
OPERATIONALIZATION
A surname will be assigned to a country if it is the one with the
highest concentration of such surname.
METHOD 2
Gini Index is an inequality indicator
already employed for other
purposes in bibliometrics. It ponder
within 0 and 1 the concentration of
a surname in a country.
11. Kulback-Leibler vs. Gini index
Country No. surnames
FRANCE 138349
GERMANY 112445
RUSSIA 111716
SPAIN 83529
USA 76219
ITALY 69637
ENGLAND 63885
JAPAN 56345
CANADA 49775
NETHERLANDS 41306
Country No. surnames
USA 310739
FRANCE 117938
GERMANY 111375
RUSSIA 94369
ITALY 65699
JAPAN 52399
ENGLAND 47521
CANADA 46146
POLAND 44087
INDIA 42897
Method 1. Kullback-Leibler Method 2. Gini index
Top 10 countries with the highest number of surnames assigned
12. Kulback-Leibler vs. Gini index
Surname Country
CLINTON USA
EGGHE BELGIUM
GARFIELD USA
HERRERA SPAIN
GARCIA SPAIN
EINSTEIN USA
NOYONS NETHERLANDS
PEREIRA BRAZIL
Method 1. Kullback-Leibler Method 2. Gini index
Top 10 countries with the highest number of surnames assigned
Surname Country
CLINTON USA
EGGHE BELGIUM
GARFIELD USA
HERRERA CUBA
GARCIA CUBA
EINSTEIN ISRAEL
NOYONS NETHERLANDS
PEREIRA PORTUGAL
13. The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS
o Coverage
o Criteria
› Language
› Ethnicity
› Historical origin
o Reliance and double assignments
14. The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS
o Coverage
o Criteria
› Language
› Ethnicity
› Historical origin
o Reliance and double assignments
15. The ‘golden list’
Validating the methods proposed
Unified country Languages
Denmark Danish
England
Celtic; Anglo-Cornish; English; Scottish;
Irish
Finland Finnish
France Breton; French
Germany German
Greece Greek
Iceland Icelandic
Italy Italian
Japan Japanese
Netherlands Afrikaans; Dutch
Portugal Portuguese
Spain Basque; Catalan; Galician;
In search for a
‘golden list’ of
surnames assigned
to
countries/languages/
ethnicities
http://en.wikipedia.org/wiki/Category:Surnames_by_language
17. Next or previous steps
o Is the Web of Science a good sample of the world
population?
› Country census crossed with the WoS
o Time frames and migratory movements
› Apply methods to different periods
o Validation and comparison with other techniques
› Bayesian, probability, clustering
o Multiple assignments of countries (e.g., Lee, Santos)
18. Thank you! elrobin@ugr.es
Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas
15th INTERNATIONAL CONFERENCE
ON SCIENTOMETRICS & INFORMETRICS
29 June – 3 July, 2015,
Bogazici University, Istanbul, Turkey
EC3metrics spin off CWTS
Leiden University