Weitere ähnliche Inhalte Ähnlich wie Political Science and Machine Learning - Neural Ideal Point Estimation Network (20) Mehr von NAVER Engineering (20) Kürzlich hochgeladen (20) Political Science and Machine Learning - Neural Ideal Point Estimation Network1. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST
Apr. 10, 2018
Kyungwoo Song
kyungwoo.song@gmail.com / gtshs2@kaist.ac.kr
1
Political Science and Machine Learning
2. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 2
Contents
• Neural Ideal Point Estimation Network
• Kyungwoo Song, Wonsung Lee, and Il-Chul Moon. Neural Ideal Point
Estimation Network. AAAI Conference on Artificial Intelligence (AAAI 2018).
New Orleans. Feb 2-7
• Etc.
• Flaxman, Seth R., Yu-Xiang Wang, and Alexander J. Smola. "Who
supported Obama in 2012?: Ecological inference through distribution
regression." Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2015.
• Xing, Zhengming, Sunshine Hillygus, and Lawrence Carin. "Evaluating US
Electoral Representation with a Joint Statistical Model of Congressional
Roll-Calls, Legislative Text, and Voter Registration Data." Proceedings of
the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. ACM, 2017.
• Lahoti, Preethi, Kiran Garimella, and Aristides Gionis. "Joint Non-negative
Matrix Factorization for Learning Ideological Leaning on Twitter." arXiv
preprint arXiv:1711.10251 (2017). (WSDM 18)
• Research Idea
3. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 3
Neural Ideal Point Estimation Network
KyungwooSong,WonsungLee,andIl-ChulMoon.NeuralIdealPointEstimationNetwork.AAAI
ConferenceonArtificialIntelligence(AAAI2018).NewOrleans.Feb2-7
4. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 4
Computational Political Science
• Legislative bill monitoring, tracking, analyzation
• Voting prediction
• Provide legislative information to corresponding
corporate in real time
• Catalist stores, and dynamically updates data on
over 240 million unique voting-age
individuals across all 50 states
• Build membership, target persuasive messaging
• Roll call data
• Focus on bill tracking
• Visualize the ideology-leadership score chart
• Organizing and disclosing data related to the
Korean Parliament
• National Assembly
• Central Election Commission
Researchers are increasingly turning to computational methods to study the dynamic properties of
political and economic systems. Politicians, citizens, interest groups, and organizations interact in
dynamic, complex environments. - Computational Models in Political Economy -
5. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 5
Roll Call Data
• Roll call data
• The recorded votes of deliberative bodies (e.g., legislator)
• Bill’s feature (Text, Year, Type, …)
• Legislator’s feature (Name, Type, Region, …)
• Rating (“YEA”, “NAY”, “Not Voting”)
• Analysis of roll call data
• Make conjectures about legislative behavior
• Quantitative analysis
• Helping make the study of legislative politics
• Primary goal: Estimation of ideal point
The American Health Care Act of 2017 (AHCA) was a leading
proposal in the first half of 2017 by House Republicans to "repeal
and replace" the Affordable Care Act (aka Obamacare, but we'll
abbreviate it ACA) and "defund" Planned Parenthood. It was also
the vehicle for passage of the Senate Republicans' leading proposals.
…
H.R. 1628: American Health Care Act of 2017
Vote Party Representative District
No R Biggs, Andy AZ 5th
Aye R Byrne, Bradley AL 1st
Aye R LaHood, Darin IL 18th
…
6. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 6
Ideal Point
• Ideal point is a measure of preference
• For legislator (or Bill)
• For each topic (or Global ideal point)
• Sign of ideal point represents the preferred direction
• The size of ideal points represents the preference strength.
• Importance of ideal point estimation [APSR 2004]
• Ideal point estimates the legislators and bill
• Estimates from roll call analysis can be used to research of
legislative behavior (comparative politics, International relations, ..)
Example of Ideal Point estimation [KDD 14]
7. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 7
Motivation
• Political ideal points make the behavior pattern
• Newspaper / TV program
• Facebook / Twitter
• Demonstration / Protest
• Importance of legislator’s ideal points
• Kind of measure of nation-wide average ideal points
• Ideal points control their speech / behavior
• Pass a bill or not
• There have been a lot of research based on roll-call data
• Roll-call data : Records of bills and voting results
• Understanding of the Congress
• Bill and legislator’s voting
• Depends on bill contents and ideal points
• Depends on the party(or particular group)’s decision or other reasons
• Necessity of understanding the Congress
8. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 8
Previous Research
• One–dimensional ideal point
[AJPS 1985]
• Each legislator and bill has a
single ideal point
• 𝑝 𝑣 𝑢𝑑 = 1 = 𝜎(𝑥 𝑢 ⋅ 𝑎 𝑑 + 𝑏 𝑑)
• 𝑏 𝑑 : popularity
• Higher–dimensional ideal point
[APSR 2004]
• 𝑝 𝑣 𝑢𝑑 = 1 = 𝜎(𝒙 𝒖 ⋅ 𝒂 𝒅 + 𝑏 𝑑)
• Map each legislator and bill into a
𝐾-dimensional space
• IA-IPM [NIPS 2012]
• Utilize a predetermined topic
• Label LDA [EMNLP 2009]
• Combine global ideal point and
topic-specific ideal point
• TFIPM [KDD 2014]
• Utilize a topic model
• PLSA [UAI 1999]
• Remove a global ideal point
• Global ideal point constraints the
change of topic-specific ideal point
PLSA : Probabilistic Latent Semantic Analysis
LDA : Latent Dirichlet Allocation
IA-IPM : Issue-adjusted ideal point model
TFIPM : Topic-Factorized Ideal Point Estimation Model
NOTE (Topic Modeling)
• Each word is generated from a single topic
(or mixture of topic)
• Different words in a document may be
generated from different topics
9. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 9
Modeling Assumption
• The modeling assumption of NIPEN is based on the theory claimed
in the political domain.
• Ideal point is important in the legislative process (Poole and Rosenthal,
AJPS 1991)
• Multi-dimensional representation of the ideal point is necessary
(Clinton, APSR 2004)
• The legislative process must be influenced by the social network
between the legislators (Kirkland, The journal of politics 2011)
• Relation could be asymmetric (Fowler, Political analysis 2006)
• The voting is relevant to the ideal point as well as the network
(Jackson, AJPS 1992)
10. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 10
Research Questions
RQ 1) Quantifying the ideal points of
bills / legislators
RQ 2) Quantification of trust between
legislators
RQ 3) Modeling the behavior of
individual legislators, taking into
account ideal points and trust
RQ 4) Voting predictions for individual
legislators
Individual ideal points
Legislator’s
trust
The ratio of
contents and
network
VOTE
BILL
Let's create a model that can predict voting result with a fluent explanation!
11. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 11
Methodology – Overall Structure
• Legislator factors affecting to vote "YEA" or "NAY" on the bill
• Contents Part
• 1) Information on the bill (Topic proportion / Bill ideal point for each topic)
• 2) Information on the legislator (Legislator ideal point for each topic)
• Network Part
• 1) Similarity between legislator’s voting behavior
• Voting prediction modeling with contents and a network part
• 𝛼, 𝛽 control the strength of contents and a network part
• 𝛼 : The extent to which the contents part is affected (𝑈 × 1 vector)
• 𝛽 : The extent to which the network part is affected (𝑈 × 1 vector)
RQ 1
RQ 2
RQ 3
RQ 4
12. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 12
Methodology – Overall Structure
• We adopted VAE / SDAE to model the bills and combined it with various
causalities to create a model with high explanatory power
• In order to consider the correlation between the topics in the bills, we
adopted a tensor-based operation
• NIPEN-Tensor is a more generalized model than the existing models
(including NIPEN-VAE/SDAE).
𝐷
𝑧 𝑑𝑘 𝑤 𝑑𝑣
𝑟𝑢𝑑
𝑎 𝑑𝑘
𝜂 𝑑
𝑥 𝑢𝑘
𝜏 𝑢𝑢′𝑟𝑢′𝑑
𝑈′ 𝑈
𝑦 𝑑𝑘
𝛼 𝑈
𝜙
𝜃
𝛽 𝑈
Legislator
Ideal
points
Legislator
Network
Strength
Latent of
Bill
Bill
Ideal
points
Legislative
Bill
Voting
|𝑽| → 𝟓𝟏𝟐 → 𝟏𝟐𝟖 → |𝑲| → 𝟏𝟐𝟖 → 𝟓𝟏𝟐 → |𝑽|
𝑤 𝑑𝑣 𝑤 𝑑𝑣𝑧 𝑑𝑘 𝜽𝝓
𝑦 𝑑𝑘 𝑎 𝑑𝑘𝑥 𝑢𝑘
𝐸
𝐶
𝑟 𝑢′ 𝑑
𝑁
𝒓 𝒖𝒅
𝑼
𝑫
𝟏
𝑼
𝑫
𝟏
𝑼
𝑫
𝑲
<NIPEN-VAE> <NIPEN-Tensor>
13. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 13
Methodology – SDAE and VAE
SDAE [JMLR 2010] Graphical model of VAE [ICLR 2014]
𝐳 𝑛 𝐱 𝑛
𝑁
𝜃
𝜙
• Generative model
• 𝑝 𝐲 𝑛 𝐳 𝑛 = Bernoulli(𝐲 𝑛|𝑝 = NN (𝐳 𝑛; 𝜃)) for
binary values
• 𝑝 𝐲 𝑛 𝐳 𝑛 = Normal 𝐲 𝑛 𝜇, 𝜎2
= NN 𝐳 𝑛; 𝜃
for continuous values
• Recognition model
• 𝑞 𝐳|𝐲; 𝜙 = ς 𝑛=1
𝑁
𝑞(𝐳 𝑛|𝐲n; 𝜆 𝑛) =
ς 𝑛=1
𝑁
𝑞(𝐳 𝑛; 𝜆 𝑛 = NN 𝐲 𝑛; 𝜙 )
• SDAE randomly insert noise at input level.
• SDAE captures the structure of the data-
generating density implicitly
• VAEs learn what noise to insert at code
• VAEs are explicitly designed to form a
generative model. (generate data by sampling)
0 0… …
… …
… …
… …
𝑓𝜃
𝑔 𝜃
𝑞 𝐷
1
𝑋
෨𝑋
𝑍
𝑋
…
…
• Representation should be robust to
introduction of noise (𝑝( ෨𝑋|𝑋))
• Random assignment of subset of input to 0
• Reconstruction 𝑋 computed from corrupted
input ෨𝑋
14. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 14
Methodology I – NIPEN VAE
𝐷 : # of bill / 𝑈, 𝑈′ : # of legislator
𝑉 : # of vocabulary / 𝐾 : # of topic
𝐷
𝑧 𝑑𝑘 𝑤 𝑑𝑣
𝑟𝑢𝑑
𝑎 𝑑𝑘
𝜂 𝑑
𝑥 𝑢𝑘
𝜏 𝑢𝑢′𝑟𝑢′𝑑
𝑈′ 𝑈
𝑦 𝑑𝑘
𝛼 𝑈
𝜙
𝜃
𝛽 𝑈
Legislator
Ideal
points
Legislator
Network
Strength
Latent of
Bill
Bill
Ideal
points
Legislative
Bill
Voting
• Factors to consider in the bill
• BOW : 𝑤 𝑑𝑣
• Topic Proportion : 𝑧 𝑑𝑘
• Latent of Bill : 𝑦 𝑑𝑘
• 𝑧 𝑑𝑘 is latent concentrated on the text
itself, whereas 𝑦 𝑑𝑘 is a task-specific
latent variable
• Bill’s ideal point : 𝑎 𝑑𝑘
• Examine the extent to liberal or
conservative for each topic
• Popularity of Bill : 𝜂 𝑑
• Factors to consider in the legislator
• Voting record from legislator 𝑢 to bill 𝑑 :
𝑟𝑢𝑑
• Ideal point for each legislator (𝑢) and
topic (𝑘) : 𝑥 𝑢𝑘
• Trust network between legislator 𝑢 and
𝑢′ : 𝜏 𝑢𝑢′
• Contents (Network) scaling parameter
for legislator : 𝛼 𝑈 (𝛽 𝑈)
BOW : Bag Of Words
15. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 15
Methodology I – NIPEN VAE
𝐷
𝑧 𝑑𝑘 𝑤 𝑑𝑣
𝑟𝑢𝑑
𝑎 𝑑𝑘
𝜂 𝑑
𝑥 𝑢𝑘
𝜏 𝑢𝑢′𝑟𝑢′𝑑
𝑈′ 𝑈
𝑦 𝑑𝑘
𝛼 𝑈
𝜙
𝜃
𝛽 𝑈
Legislator
Ideal
points
Legislator
Network
Strength
Latent of
Bill
Bill
Ideal
points
Legislative
Bill
Voting
Assumption :
• 𝑥 𝑢𝑘 𝑎 𝑑𝑘 > 0 and 𝑦 𝑑𝑘 ↑ 𝑝 𝑟𝑢𝑑 = 1 ↑
• 𝜏 𝑢𝑢′ ↑ and 𝑟𝑢𝑑
′
= 1 𝑝 𝑟𝑢𝑑 = 1 ↑
• 𝐶 𝑢𝑑, 𝑁 𝑢𝑑 > 0 and 𝛼 𝑢, 𝛽 𝑈 ↑ 𝑝 𝑟𝑢𝑑 = 1 ↑
= 𝐶 𝑢𝑑
= 𝑁 𝑢𝑑
𝑝 𝑟𝑢𝑑 = 1 = 𝜎(𝛼 𝑢 Σ 𝑘 𝑥 𝑢𝑘 𝑦 𝑑𝑘 𝑎 𝑑𝑘 + 𝜂 𝑑
+𝛽 𝑢 Σ 𝑢′∈𝐼 𝑢
𝜏 𝑢𝑢′ 𝑟 𝑢′ 𝑑 )
16. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 16
Methodology I – NIPEN VAE
𝐷
𝑧 𝑑𝑘 𝑤 𝑑𝑣
𝑟𝑢𝑑
𝑎 𝑑𝑘
𝜂 𝑑
𝑥 𝑢𝑘
𝜏 𝑢𝑢′𝑟𝑢′𝑑
𝑈′ 𝑈
𝑦 𝑑𝑘
𝛼 𝑈
𝜙
𝜃
𝛽 𝑈
Legislator
Ideal
points
Legislator
Network
Strength
Latent of
Bill
Bill
Ideal
points
Legislative
Bill
Voting
𝑝 𝑟𝑢𝑑 = 1 = 𝜎(𝛼 𝑢 Σ 𝑘 𝑥 𝑢𝑘 𝑦 𝑑𝑘 𝑎 𝑑𝑘 + 𝜂 𝑑
+𝛽 𝑢 Σ 𝑢′∈𝐼 𝑢
𝜏 𝑢𝑢′ 𝑟 𝑢′ 𝑑 )
𝐿 𝑁𝐼𝑃𝐸𝑁 = −𝐷 𝐾𝐿(𝑞 𝜙 𝑧 𝑤 | 𝑝 𝜃 𝑧 +
1
𝐿
Σ𝑙=1
𝐿
log 𝑝 𝜃 𝑤 𝑧 𝑙
+
𝜆 𝑓
2
Σ 𝑢,𝑑 ,𝑟 𝑢𝑑≠0
1 + 𝑟𝑢𝑑
2
log 𝑝 𝑟𝑢𝑑 = 1
+
𝜆 𝑓
2
Σ 𝑢,𝑑 ,𝑟 𝑢𝑑≠0
1 − 𝑟𝑢𝑑
2
log 𝑝 𝑟𝑢𝑑 = −1
−
𝜆 𝑦
2
Σ 𝑑=1
𝐷
𝑦 𝑑 − 𝑧 𝑑 2
2
−
𝜆 𝛼
2
( 𝛼 2
2
+ 𝛽 2
2
) −
𝜆 𝑢
2
𝑎 𝐹
2
+ 𝑥 𝐹
2
−
𝜆 𝜏
2
𝜏 𝐹
2
1
2
3
4
5
Loss w.r.t. topic modeling of the bill (VAE)1
Modeling the voting record 𝑟𝑢𝑑2 3
Minimize the difference between bill’s latent itself and bill’s task-specific latent4
Regularization Loss5
17. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 17
Methodology II – NIPEN Tensor
• If multiple topics are combined, the ideal point for that topic combination
may vary
• Existing models derive an ideal points for cross-topic by simple summation.
• NIPEN-Tensor to incorporate the cross-topic influence in casting a vote,
and it is a generalized version of existing model
|𝑽| → 𝟓𝟏𝟐 → 𝟏𝟐𝟖 → |𝑲| → 𝟏𝟐𝟖 → 𝟓𝟏𝟐 → |𝑽|
𝑤 𝑑𝑣 𝑤 𝑑𝑣𝑧 𝑑𝑘 𝜽𝝓
𝑦 𝑑𝑘 𝑎 𝑑𝑘𝑥 𝑢𝑘
𝐸
𝐶
𝑟 𝑢′ 𝑑
𝑁
𝒓 𝒖𝒅
𝑼
𝑫
𝟏
𝑼
𝑫
𝟏
𝑼
𝑫
𝑲
• 𝐸 𝑢𝑑𝑘 = 𝒙 𝒖𝒌 𝒚 𝒅𝒌 𝒂 𝒅𝒌
• ෨𝐸 𝑢𝑑𝑙 = σ 𝑘 𝐸 𝑢𝑑𝑘 𝑊𝑘𝑙
𝑇1
+ 𝑏𝑙
𝑇1
• 𝐶 𝑢𝑑 = σ𝑙
෨𝐸 𝑢𝑑𝑙 𝑊𝑙1
𝑇2
+ 𝜂 𝑑
• 𝑁 𝑢𝑑 = σ 𝑢′∈𝑈 𝜏 𝑢𝑢′ 𝑣 𝑢′ 𝑑
• 𝑝 𝑟𝑢𝑑 = 1 = 𝜎(𝛼 𝑢 𝐶 𝑢𝑑 + 𝛽Σ 𝑢′∈𝐼 𝑢
𝑁 𝑢′ 𝑑)
• 𝑝 𝑟𝑢𝑑 = 1 = 𝜎(𝛼 𝑢 Σ 𝑘 𝒙 𝒖𝒌 𝒚 𝒅𝒌 𝒂 𝒅𝒌 + 𝜂 𝑑
+𝛽 𝑢 Σ 𝑢′∈𝐼 𝑢
𝜏 𝑢𝑢′ 𝑟 𝑢′ 𝑑 )
NIPEN-VAE
18. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 18
Dataset
• Roll call data : The recorded
votes of deliberative bodies
• Politic2013 and Politic2016
include records 1990~2013 and
1990~2016 respectively
• Politic2013 is a more sparse
dataset than Politic2016 in the
ratings and the vocabulary sizes.
Politic2013 : Thomas [Yupeng Gu, 2014] / Politic2016 : Govtrack.com
NOTE
• We only use the latest version of a bill and its summary text
• Remove stopwords and choose the top-n frequency word
• Consider the both of House representative and senator
19. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 19
Results
• The major topic of H.Res.794 (114th) is “Business and Finance” with negative ideal
points
• There is greatest disagreement between the Republicans and the Democrats on that
topic
• The voting was very partisan (92.2% republican voted YEA and 90.3% Democrat voted
NAY)
RQ 1) Quantifying the ideal points of bills / legislators
: Republican
: Democrat
: Third Party
H.Res.794 (114th) :
“Making appropriations
for financial services…”
20. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 20
Results
• In general, the legislators have a strong positive relationship when they have the same district and
the party
• The closest relation is ’Thomas E. Petri’ and ’Jim Sensenbrenner’. (Republican representatives
from Wisconsin)
• J. Duncan Jr and Dana Rohrabacher have the greatest network impact on the Republican party.
(Duncan started as a congressman in Tennessee in 1988 and Rohrabacher as a California
congressman in 1989.)
RQ 2) Quantification of trust between legislators
21. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 21
Results
• Top-five legislators who are affected by contents or network factors a lot
• Majority of legislators are voting to focus on contents rather than network
effect
• A small number of legislators are highly dependent on the network effect.
RQ 3) Modeling the behavior of individual legislators, taking into
account ideal points and trust
22. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 22
Results
• Variations of NIPEN shows the best performance in every metric and
dataset
• NIPEN-Tensor is a model that considers the correlation between topics,
and NIPEN-Tensor may have a better performance when a bill text has
multiple topics with complex and rich textual information
RQ 4) Voting predictions for individual legislators
23. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 23
Conclusion
• We proposed two versions of machine learning models, NIPEN-
PGM and NIPEN-Tensor, to analyze the ideology in the legislation
process.
• The variations of NIPEN show the state-of-the-art performance in
all measures on Politic2013 and Politic2016. Furthermore, NIPEN
provides various interpretations in why YEA or NAY is casted by
illustrating
• 1) the ideal point estimation of individual legislators and bills;
• 2) the trust network between legislators
• 3) the content and network influence for each legislator.
24. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST
Who Supported Obama in 2012?
Flaxman,SethR.,Yu-XiangWang,andAlexanderJ.Smola."WhosupportedObamain2012?:
Ecologicalinferencethroughdistributionregression."Proceedingsofthe21thACMSIGKDD
InternationalConferenceonKnowledgeDiscoveryandDataMining.ACM,2015.
24
25. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 25
Example Results
Exit poll results for women
Ecological regression results for women
Gender gap : Obama support among women
- Obama support among men
How much larger was Obama’s vote
share among women than among men?
How can we estimate the support of a particular subgroup using only the entire group data?
• Given : 1) Entire support of A, B, C, D district, 2) Individual feature for each district
• Objective : Estimate the support among women for each district (A,B,C,D)
This research estimates exactly
matched the ground truth from
national-level exit polls
• 48% support for Obama among
men, 55% among women
26. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 26
Preliminary – Ecological Inference
• Ecological Inference : inferring the unobserved behavior of subgroups
based on the aggregate behavior of groups
• Application
• Political Science
• Who Supported Obama in 2012?
• Marketing
• What types of people buy your products?
• Epidemiology
• Does radon cause lung cancer?
• Education / Economics / …
Supervised Learning
Learning from label
proportion (LLP)
Ecological Inference
Simpson’s paradox : Trend appears in several
different groups of data but disappears or reverses
when these groups are combined.
< Overall acceptance rate (UC-Berkeley) >
Applicants Admitted Ratio
Male 1,150 815 70.9%
Female 820 185 22.6%
27. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 27
Preliminary – Ecological Inference
• Ecological Inference : inferring the unobserved behavior of subgroups
based on the aggregate behavior of groups
• Application
• Political Science
• Who Supported Obama in 2012?
• Marketing
• What types of people buy your products?
• Epidemiology
• Does radon cause lung cancer?
• Education / Economics / …
Supervised Learning
Learning from label
proportion (LLP)
Ecological Inference
< A Dept. acceptance rate (UC-Berkeley) >
< B Dept. acceptance rate (UC-Berkeley) >
Applicants Admitted Ratio
Male 1,000 800 80.0%
Female 120 100 83.3%
Applicants Admitted Ratio
Male 150 15 10.0%
Female 700 85 12.1%
28. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 28
Preliminary – GPR
Consider a point as Gaussian distribution
⇒ A point has its mean and sigma individually
When a new point is observed
⇒ Consider it as a normal distribution also
⇒ Can predict its mean and sigma
∵ 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑜𝑓 𝐺𝑢𝑎𝑠𝑠𝑖𝑎𝑛 𝑖𝑠 𝑎𝑙𝑠𝑜 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛
𝒇∗|𝑿∗, 𝑿, 𝒇~𝑵 𝒇∗ 𝝁∗, 𝜮∗
GP Regression Demo
Points as
Gaussian distribution
New
Observations
Calculate
mean and covariance
[Rasmussen 2006]
29. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 29
Methodology – Overall Structure
• Project the individuals from each group
into feature space using 𝜙(𝑥)
• 𝑥3
1
, 𝑥3
2
, … , 𝑥3
5
→ 𝜙 𝑥3
1
, 𝜙 𝑥3
2
, … , 𝜙(𝑥3
5
)
• Take the mean by group
• 𝜇3 =
1
3
(𝜙 𝑥3
1
+ 𝜙 𝑥3
2
+ ⋯ + 𝜙 𝑥3
5
)
• Learn a function 𝑓 ∶ 𝜇 → 𝑦
• 𝑓~𝐺𝑃(0, 𝜎𝑥
2
< ෝ𝜇𝑖, ෞ𝜇 𝑗 > +𝑘 𝑠 𝑠𝑖, 𝑠𝑗 )
• 𝑠1, … , 𝑠 𝑛 : locations / 𝑘 𝑠 : Matern Kernel
• Subgroup prediction
• 𝑓(𝜇3
𝑚
) and 𝑓(𝜇3
𝑤
)
• 𝜇3
𝑚
=
1
2
(𝜙 𝑥3
3
+ 𝜙 𝑥3
4
)
• 𝜇3
𝑤
=
1
3
(𝜙 𝑥3
1
+ 𝜙 𝑥3
2
+ 𝜙(𝑥3
5
))
𝑥1
𝑗
𝑗=1
𝑁1
, 𝑦1 , 𝑥2
𝑗
𝑗=1
𝑁2
, 𝑦2 , … , 𝑥𝑖
𝑗
𝑗=1
𝑁 𝑛
, 𝑦𝑖 , … , ( 𝑥 𝑛
𝑗
𝑗=1
𝑁 𝑛
, 𝑦𝑛)
Group 𝑖 has a single real-valued label 𝑦𝑖
Group 𝑖 has 𝑁𝑖 individual observations 𝑥𝑖
𝑗
∈ 𝑅 𝑑
(gender, race, income, …)
Given :
30. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 30
Results - Gender
Exit poll results for women
Ecological regression results for women
Gender gap : Obama support among women
- Obama support among men
How much larger was Obama’s vote
share among women than among men?
When estimating the group data, what is the support among the subgroup (gender)?
This research estimates exactly
matched the ground truth from
national-level exit polls
• 48% support for Obama among
men, 55% among women
31. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 31
Results – Income / Age
Blue line : 95% uncertainty interval
(exit poll’s margin or error for 𝑛 = 600)
When estimating the group data, what is the support among the subgroup (income and age)?
Estimation of Obama's support based on income section
Estimation of Obama's support based on age section
Income ≤ $50,000 $50,000 < Income ≤ $100,000 $100,000 ≤ Income
18-29 year olds 30-44 year olds 45-64 year olds 65 years or older
Exit poll
Estimated
Correlation
• low incomes : 0.85
• Medium incomes : 0.90
• High incomes : 0.67
Exit poll
Estimated
Correlation
• (18-29) : 0.60 / (30-44) : 0.90 / (45-64) : 0.92 / (65- ) : 0.90
32. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST
Joint Non-negative Matrix Factorization
for Learning Ideological Leaning on Twitter
Lahoti,Preethi,KiranGarimella, andAristidesGionis."JointNon-negativeMatrixFactorizationfor
LearningIdeologicalLeaningonTwitter."arXivpreprintarXiv:1711.10251(2017).(WSDM18)
32
33. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 33
Joint Non-negative Matrix Factorization
for Learning Ideological Leaning on Twitter
• Problems
• Shift from traditional news sources to online news
• The technology of the online news platform only recommends news
similar to the user's point of view. (⇒ lead to polarization of opinions)
• Tackle the filter bubble problem
• Approach
• Infer the ideological stances
of users and media sources
• Joint MF
• Results
Option to choose
how willing the
user is to explore
the other side
• Dataset
• Twitter from 2011 to 2016
• 1) Gun control 2) abortion 3) Obamacare
• 6,361 users / 19 million tweets
34. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST
Evaluating U.S. Electoral Representation with a Joint
Statistical Model of Congressional Roll-Calls,
LegislativeText, and Voter RegistrationData
Xing,Zhengming,SunshineHillygus,andLawrenceCarin."EvaluatingUSElectoralRepresentation
withaJointStatisticalModelofCongressionalRoll-Calls,LegislativeText,andVoterRegistration
Data."Proceedingsofthe23rdACMSIGKDDInternationalConferenceonKnowledgeDiscoveryand
DataMining.ACM,2017.
34
35. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 35
Evaluating U.S. Electoral Representation with a Joint Statistical Model of
Congressional Roll-Calls, Legislative Text, and Voter Registration Data
• Problems
• Linking 1) constituency data, 2) legislative voting record from each district, 3)
and the text of the legislation
• Extent to which elected officials represent the preferences of the citizens who
elect them
• Approach
• Constituent information and legislative text modeling (MF + HDP)
• Constituent information ⇒ Legislator’s feature
• legislative text modeling ⇒ Legislation’s feature
• Dataset
• A 1% random sample of Catalist database (3 million cases) in 2012
• Legislative votes (U.S. House of Representative) on legislation from 2009-2011
|𝜉𝑗| : deviates from the
characteristics of his/her
constituents
Legislators who are aligned with
their constituents received a 15%
larger share of the election vote
• Results
High-income Democrats
(SF, LA , DC, ..)
36. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 36
Research Idea
• Dynamic Ideal Point Estimation
• Estimation of ideal point change for
each legislator and the Party
• Dynamic EI
• The existing EI with ML research are
solved only at a given timestamp
• Dynamic EI with transition
• Interpretable dynamic EI model with
transition
• EI with Active Learning
• Given the current EI environment,
what are the data and labels to
collect for more accurate estimation?
• EX) A more efficient exit survey
Democrat Republican Etc.
Male ? ? ? 1,400
Female ? ? ? 2,100
2,200 1,250 900
Democrat Republican Etc.
Male ? ? ? 1,200
Female ? ? ? 2,700
2,500 1,200 1000
Democrat Republican Etc.
Male ? ? ? 1,100
Female ? ? ? 2,600
2,200 1,000 1400
2008 Election
2012 Election
2016 Election
EI : Ecological Inference
37. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 37
Reference
• Poole, Keith T., and Howard Rosenthal. "Patterns of congressional voting." American journal of political science(1991): 228-278.
• Clinton, Joshua, Simon Jackman, and Douglas Rivers. "The statistical analysis of roll call data." American Political Science Review 98.2 (2004):
355-370.
• Kirkland, Justin H. "The relational determinants of legislative outcomes: Strong and weak ties between legislators." The Journal of Politics 73.3
(2011): 887-898.
• Fowler, James H. "Connecting the Congress: A study of cosponsorship networks." Political Analysis 14.4 (2006): 456-487.
• Jackson, John E., and John W. Kingdon. "Ideology, interest group scores, and legislative votes." American Journal of Political Science (1992): 805-
823.
• Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked Denoising Autoencoders: Learning Useful Representations in
a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research 11:3371–3408.
• Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
• Guo, Guibing, Jie Zhang, and Neil Yorke-Smith. "TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of
Item Ratings." AAAI. Vol. 15. 2015.
• Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web.
ACM, 2015.
• Wu, Yao, et al. "Collaborative denoising auto-encoders for top-n recommender systems." Proceedings of the Ninth ACM International Conference
on Web Search and Data Mining. ACM, 2016.
• Gu, Yupeng, et al. "Topic-factorized ideal point estimation model for legislative voting network." Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 2014.
• Wang, Hao, Naiyan Wang, and Dit-Yan Yeung. "Collaborative deep learning for recommender systems." Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
• Flaxman, Seth R., Yu-Xiang Wang, and Alexander J. Smola. "Who supported Obama in 2012?: Ecological inference through distribution
regression." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
• Simone Zhang, “Ecological Inference”, 2017 (https://scholar.princeton.edu/sites/default/files/bstewart/files/ecological_inference_slides.pdf)
• Colin Wood, “Predicting the Future of State Legislation”, 2016 (http://www.govtech.com/state/Online-Services-Predict-the-Legislative-Future.html)
• Rasmussen, Carl Edward. "Advances in Gaussian processes." Advances in Neural Information Processing Systems 19 (2006).
• Le, Quoc, Tamás Sarlós, and Alex Smola. "Fastfood-approximating kernel expansions in loglinear time." Proceedings of the international
conference on machine learning. Vol. 85. 2013.
• Xing, Zhengming, Sunshine Hillygus, and Lawrence Carin. "Evaluating US Electoral Representation with a Joint Statistical Model of Congressional
Roll-Calls, Legislative Text, and Voter Registration Data." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2017.
• Lahoti, Preethi, Kiran Garimella, and Aristides Gionis. "Joint Non-negative Matrix Factorization for Learning Ideological Leaning on Twitter." arXiv
preprint arXiv:1711.10251 (2017). (WSDM 18)