a paper review. This presentation introduces Abductive Commonsense Reasoning which is the published paper in ICLR 2020. In this paper, the authors use commonsense to generate plausible hypotheses. They generate new data set 'ART' and propose new models for 'aNLI', 'aNLG' using BERT, and GPT.
1. Abductive Commonsense
Reasoning
San Kim
2020.07.03
Chandra Bhagavatula(1), Ronan Le Bras(1), Chaitanya Malaviya(1), Keisuke Sakaguchi(1),
Ari Holtzman(1), Hannah Rashkin(1), Doug Downey(1), Scott Wen-tau Yih(2), Yejin Choi(1,3)
1. Allen Institute for AI
2. Facebook AI
3. Paul G. Allen School of Computer Scient & Engineering
3. Abductive Commonsense Reasoning
𝜶𝑵𝑳𝑰 is formulated as multiple choice problems consisting of a pair of
observations as context and a pair of hypothesis choices.
𝑂1: The observation at time 𝑡1
𝑂2: The observation at time 𝑡2 > 𝑡1.
ℎ+
: A plausible hypothesis that explains the two observations 𝑂1 and 𝑂2.
ℎ−: An implausible (or less plausible) hypothesis for observations 𝑂1 and 𝑂2.
Abductive Natural Language Generation
𝜶𝑵𝑳𝑮 is the task of generating a valid hypothesis ℎ+ given the two
observation 𝑂1 and 𝑂2. Formally, the task requires to maximize 𝑃 ℎ+
𝑂1, 𝑂2 .
Abductive Natural Language Inference
5. A Probabilistic Framework for 𝛼𝑁𝐿𝐼
The 𝛼𝑁𝐿𝐼 task is to select the hypothesis ℎ∗
that is most probable given the
observations.
ℎ∗
= arg max
ℎ 𝑖
𝑃 𝐻 = ℎ𝑖
𝑂1, 𝑂2
Rewriting the objective using Bayes Rule conditioned on 𝑂1, we have:
𝑃 ℎ𝑖 𝑂1, 𝑂2 ∝ 𝑃 𝑂2 𝐻 𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1
Illustration of the graphical models described in the probabilistic framework.
Adopted from [1]
6. A Probabilistic Framework for 𝛼𝑁𝐿𝐼
(a) Hypothesis Only
• the strong assumption: the hypothesis is entirely independent of both observations, i.e.
(𝐻 ⊥ 𝑂1, 𝑂2).
• Maximize the marginal 𝑷 𝑯 .
(b, c) First (or Second) Observation Only
• Weaker assumption: the hypothesis depends on only one of the first 𝑂1 or second 𝑂2
observation.
• Maximize the conditional probability 𝑷 𝑯 𝑶 𝟏 or 𝑷 𝑶 𝟐 𝑯 .
Adopted from [1]
7. A Probabilistic Framework for 𝛼𝑁𝐿𝐼
(d) Linear Chain
• Uses both observations, but consider each observation’s influence on the hypothesis
independently.
• The model assumes that the three variables < 𝑶 𝟏, 𝑯, 𝑶 𝟐 > form a linear Markov chain.
• The second observation is conditionally independent of the first, given the hypothesis (i.e.
𝑂1 ⊥ 𝑂2 𝐻))
• ℎ∗
= arg max
ℎ 𝑖
𝑃 𝑂2 ℎ∗
𝑃 ℎ∗
𝑂1 𝑤ℎ𝑒𝑟𝑒 𝑂1 ⊥ 𝑂2 𝐻
(e) Fully Connected
• Jointly models all three random variables.
• Combine information across both observations to choose the correct hypothesis.
• ℎ∗ = arg max
ℎ 𝑖
𝑃 𝑂2 ℎ𝑖, 𝑂1 𝑃 ℎ𝑖 𝑂1
Adopted from [1]
8. Difference btw/ the Linear Chain and Fully Connected model
𝑂1: Carl went to the store desperately searching for flour tortillas for a recipe.
𝑂2: Carl left the store very frustrated.
ℎ1: The cashier was rude. ℎ2: The store had corn tortillas, but not flour ones.
𝑡0
𝑡1
𝑡2
Linear Chain
𝑂2 ℎ1 ℎ1
𝑂1𝑃 ) 𝑃 )
𝑃 ) 𝑃 )
:Plausible! :Plausible!
𝑂2 ℎ2
ℎ2
𝑂1 :Plausible!:Plausible…
Fully Connected
𝑂2 ℎ1 ℎ1 𝑂1𝑃 , ) 𝑃 )
𝑃 )
:Plausible… :Plausible!
𝑂2 ℎ2
ℎ2 𝑂2 :Plausible!:Plausible!
𝑂1
𝑃 , )𝑂1
ℎ1 ℎ2
ℎ1 ℎ2
>
<
9. 𝛼𝑁𝐿𝐺 model
Given ℎ+
= 𝑤1
ℎ
⋯ 𝑤𝑙
ℎ
, 𝑂1 = 𝑤1
𝑜1
⋯ 𝑤 𝑚
𝑜1
, 𝑂2 = 𝑤1
𝑜2
⋯ 𝑤 𝑛
𝑜2
as sequence of tokens,
The 𝛼𝑁𝐿𝐺 task can be modeled as 𝑷 𝒉+ 𝑶 𝟏, 𝑶 𝟐 = 𝑷 𝒘𝒊
𝒉
𝒘<𝒊
𝒉
, 𝒘 𝟏
𝒐 𝟏
⋯ 𝒘 𝒎
𝒐 𝟏
, 𝒘 𝟏
𝒐 𝟐
⋯ 𝒘 𝒏
𝒐 𝟐
Optionally, The model can also be conditioned on background knowledge 𝓚.
ℒ = −
𝑖=1
𝑁
log 𝑃 𝑤𝑖
ℎ
𝑤<𝑖
ℎ
, 𝑤1
𝑜1
⋯ 𝑤 𝑚
𝑜1
, 𝑤1
𝑜2
⋯ 𝑤 𝑛
𝑜2
, 𝒦
Adopted from [1]
10. ConceptNet [5]
• ConceptNet is a multilingual knowledge base
• Is constructed of:
• Nodes ( a word or a short phrase)
• Edges (relation and weight)
• ConceptNet is a multilingual knowledge base
• Is constructed of:
• Nodes ( a word or a short phrase)
• Edges (relation and weight)
Birds is not capable of … {bark, chew their food,
breathe water}
• Related Works
• Wordnet
• Microsoft Concept Net
• Google knowledge base
Adopted from [6]
11. ATOMIC [2]
1. xIntent: Why does X cause an event?
2. xNeed: What does X need to do before the
event?
3. xAttr: How would X be described?
4. xEffect: What effects does the event have on X?
5. xWant: What would X likely want to do after
the event?
6. xReaction: How does X feel after the event?
7. oReact: How do others’ feel after the event?
8. oWant: What would others likely want to do
after the event?
9. oEffect: What effects does the event have on
others?
If-Then Relation Types
• If-Event-Then-Mental-State
• If-Event-Then-Event
• If-Event-Then-Persona
Inference dimension
Adopted from [2]
12. ATOMIC
• If-Event-Then-Mental-State: three relations relating to the mental pre- and post-conditions of an event.
X compliments Y X wants to be nice
X feels good
Y feels flattered
xIntent: likely intents of the event
xReaction: likely (emotional) reactions of the event’s subject
oReaction: likely (emotional) reactions of others
Adopted from [2]
13. ATOMIC
• If-Event-Then-Event: five relations relating to events that constitute probable pre- and post-conditions of
a given event.
X calls the police X needs to dial 911
X starts to panic
X wants to explain everything to the police
xNeeds
xEffect
xWant
oWant
oEffect
Others want to dispatch some officers
Y will smileX pays Y a compliment
• If-Event-Then-Persona: a stative relation that describes how the subject of an event is described or
perceived.
xAttrX is flatteringX pays Y a compliment
xAttrX is caring
19. Adversarial Filtering Algorithm
• A significant challenge in creating datasets is avoiding annotation artifacts.
• Annotation artifacts: unintentional patterns in the data that leak information about the
target label.
In each iteration 𝑖,
• 𝑀𝑖: adversarial model
• 𝒯𝑖: a random subset for training
• 𝒱𝑖: validation set
• ℎ 𝑘
+
, ℎ 𝑘
−
: plausible and implausible
hypotheses for an instance 𝑘.
• 𝛿 = ∆ 𝑀 𝑖
ℎ 𝑘
+
, ℎ 𝑘
−
: the difference in the
model evaluation of ℎ 𝑘
+
and ℎ 𝑘
−
.
With probability 𝑡𝑖, update instance 𝑘 that
𝑀𝑖 gets correct with a pair ℎ+, ℎ− ∈ ℋ𝑘
+
×
ℋ𝑘
−
of hypotheses that reduces the value
of 𝛿, where ℋ𝑘
+
(resp. ℋ𝑘
−
) is the pool of
plausible (resp. implausible) hypotheses for
instance 𝑘.
Adopted from [1]
23. BERT-Score
Evaluating Text Generation with BERT (like inception-score, Frechet-inception-distance, kernel-inception-distance in image generation)
Adopted from [4]
26. Abductive Commonsense Reasoning
1. 𝑂1 ↛ ℎ−: ℎ− is unlikely to follow after the first observation 𝑂1.
2. ℎ− ↛ 𝑂2: ℎ− is plausible after 𝑂1 but unlikely to precede the second observation 𝑂2.
3. Plausible: 𝑂1, ℎ−
, 𝑂2 is a coherent narrative and forms a plausible alternative, but it less plausible
than 𝑂1, ℎ+
, 𝑂2 .
Adopted from [1]
Bert performance