Soft cardinality is a parameterized text similarity function that uses a "soft count" instead of a crisp count to calculate the cardinality of the intersection between two texts. It includes a parameter p that controls the softness and extended weights like tf-idf for words. A parameterized resemblance coefficient is also used that balances the sizes of the two texts being compared. The approach was tested on semantic textual similarity tasks and achieved a higher rank mean score than other baselines like cosine similarity with tf-idf.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Soft Cardinality: A Parameterized Similarity Function for Text Comparison
1. Soft Cardinality: A Parameterized
Similarity Function for Text Comparison
Sergio Jimenez Claudia Becerra Alexander Gelbukh
Center for Computing Research,
Instituto Politécnico Nacional
(National Polytechnic (Technical)
Institute), Mexico
2. Outline
• Cardinality-based similarity functions
• What is Soft Cardinality?
• Parameterized resemblance coefficient
• Building text similarity functions
• Optimizing parameters
• Results in STS SemEval-2012
• Conclusions
3. Cardinality-based similarity functions
Jaccard (1905)
BA
BA
BASIM ),(
Dice (1945)
BA
BA
BASIM
5.05.0
),(
Only two thing are needed:
1. Cardinality function
2. Resemblance coefficient
""
""
referent
iescommonalit
Soft cardinality
Parameterized
resemblance
coefficient
8. An extended soft cardinality model
n
i
n
j
ji aasim
A
1
1
'
,
1
iaw Weights for the elements (words) e.g. tf-idf
iaw
p Controls the “softness” of the soft cardinality
0p AAp
'
p
iawA
'
11. Parameterized resemblance coefficient
BA
BA
BASIM
5.05.0
),(
The referent is a balance
between the “sizes” of A and B
Tversky (1977)
“the son resembles the father” not
“the father resembles the son”
“an ellipse is like a circle” not
“a circle is like an ellipse”
“North Korea is like Red China” not
“Red China is like North Korea”
“the son resembles the father” not
“the father resembles the son”
“an ellipse is like a circle” not
“a circle is like an ellipse”
“North Korea is like Red China” not
“Red China is like North Korea”
A B
In general “the variant is more similar to the prototype”
A
BA
BASIM ),(
BA
BA
BASIM
,min
),(
Overlap coefficient
14. Building text similarity functions
),max()1(),min(
),( ''''
'
BABA
biasBA
BASIM
Soft cardinality
n
i
n
j
p
ji
a
aasim
wA i
1
1
'
,
1
),max()1(),min(
),(
iisimiisim
simii
ji
baba
biasba
basim
),( BASIM
),( ji basim
Compares two texts as
sets of words
Compares two words
as sets of q-grams
16. Used Resources
Run baer/task6-UKP-run2
jan_snajder/task6-
takelab-simple
sgjimenezv/task6-
SOFT-CARDINALITY
Used
resoruces
KB similarity
Lemmatizer
String Similarity
Dictionaries
Distributional thesaurus
Monolingual corpora
Multilingual corpora
Wikipedia
WordNet
Distributional Similarity
POS tagger
SMT
Textual Entailment
Other
KB similarity
Lemmatizer
Dictionaries
Distributional thesaurus
Monolingual corpora
Stop words
Wikipedia
WordNet
Distributional Similarity
Lexical Substitution
Machine Learning
POS tagger
Other
KB similarity
Lemmatizer
String Similarity
Mean (r) 0.6773 0.6753 0.6708
RankMean 1st 2nd 3rd
Difference 0.969% 0.671% 0%
17. Results for other measures
Cosine tf-idf + lemmatizer
RankMean=0.6326 (10th)
SoftTFIDF+ lemmatizer
RankMean=0.6415 (7th)
Soft Cardinality+lemmatizer+
Paramet.Res.Coeff.+Hill climbing
RankMean=0.6788
Text A
TextB
Text A
TextB
Text A Text B
TextBTextA
18. Conclusions
1. The soft cardinality approach proved to be a
an effective and low-cost text similarity
function, even in Semantic Textual Similarity
scenarios.
2. The set of parameters of the proposed
function were maningfull and easy to find
their optimal values when training data is
available.