Big Data Palooza Talk: Aspects of Semantic Processing

Knowledgent Big Data-palooza:
Aspects of Semantic Processing
Na’im R. Tyson, PhD
February 6, 2014

Discussion Topics

• Semantic Processing
– What is Semantics?
– What is Pragmatics?
• Lexical Semantics
– Computing Semantic Similarity
∗ WordNet
∗ Vector Space Modeling
• Ontology Basics
• Text Mining: Basics
1

Semantic Processing

• What is Semantics?
– Study of literal meanings of words and sentences
∗ Lexical Semantics - word meanings & word relations
– Sometimes stated formally using some logical form
∗ Example: ∀x∃yloves(x, y)
• What is Pragmatics?
– Study of language use and its situational contexts (discourse, deixis,
presupposition, etc.)

2

Lexical Semantics
WordNet: Description
• Word relation database
• Created by George Miller & Christiane Fellbaum (Miller, 1995; Fellbaum, 1998)
@ Princeton University
• Types of Relationships
Synonymy - word pair similarity
Antonymy - word pair dissimilarity
Meronymy - part-of relation
– Example: ’engine’ and ’car’
Hyponymy - subordinate relation between words (i.e., a type-of relation)
– Example: ’red’ is a hyponym of ’color’ (’red’ is a type of color)
Hypernymy - superordinate relation between words
3

– Example: ’color’ is a hypernym of ’red’
Question: What’s the relationship between a hyponym and a hypernym?
• 150K words w/ 115k synsets and approx. 200k word-sense pairs

4

Lexical Semantics

• Adapted from Python Text Processing with NLTK 2.0 Cookbook (Perkins,
2010)
>>> from nltk.corpus import wordnet as wn
>>> word_synset = wn.synsets(’cookbook’)[0]
>>> word_synset.name
’cookbook.n.01’
>>> word_synset.definition
’a book of receipes and cooking directions’

5

Lexical Semantics

• Antonymy:
>>> ga1 = wn.synset(’good.a.01’)
>>> ga1.definition
’having desirable or positive qualities especially those suitable
for a thing specified’
>>> bad = ga1.lemmas[0].antonyms()[0]
>>> bad.name
’bad’
>>> bad.synset.definition
’having undesirable or negative qualities’

6

Lexical Semantics

• Hyponymy & Hypernymy:
>>> word_synset.hyponyms()
>>> word_synset.hypernyms()

7

Computing Similarity by WordNet

• Similarity by Path Length (see Perkins, 2010, p. 19)
>>> from nltk.corpus import wordnet as wn
>>> cb = wn.synset(’cookbook.n.01’)
>>> ib = wn.synset(’instruction_book.n.01’)
>>> cb.wup_similarity(ib) # Wu-Palmer Similarity
0.91666666666666663
• For path similarity explanations, see Jaganadhg (2010)

8

Advantages & Disadvantages

• Advantages
Quality: developed and maintained by researchers
Practice: applications can use WordNet
Software: SenseRelate (Perl) - http://senserelate.sourceforge.net
• Disadvantages
Coverage: technical terms may be missing
Irregularity: path lengths can be irregular across hierarchies
Relatedness: related terms may not be in the same hierarchies
Example: Tennis Problem
– ’player’, ’racquet’, ’ball’ and ’net’

9

Computing Word Similarity by Vector Space Modeling

• Computing Similarity from a Document Corpus
Goal: determine distributional properties of a word
Steps: In general...
– Create vector of size n for each word of interest
– Think of them as points in some n-dimensional space
– Use a similarity metric to compute distance
Algorithm: Brown et al. (1992)
– C(x) - vector with properties of x (context of ’x’)
– C(w) = #(w1), #(w2), ..., #(wk ) , where #(wi) is the number of times
wi followed w in a corpus

10

Similarity Measure: Cosine
Cosine cos(⃗ , ⃗ ) =
x y

⃗ ∗⃗
x y
|⃗ ||⃗|
x y

n

=

i=1
n
i=1

xi yi
n

x2

i=1

y2

cosmonaut

astronaut

moon

car

truck

Soviet

1

0

0

1

1

American

0

1

0

1

1

spacewalking

1

1

0

0

0

red

0

0

0

1

1

full

0

0

1

0

0

old

0

0

0

1

1

, xn )

cos(cosm, astr) =

1∗0+0∗1+1∗1+0∗0+0∗0+0∗0
12 +02 +12 +02 +02 +02

02 +12 +12 +02 +02 +02

Figure 1: Cosine Similarity Comparison from Collins (2007)

Outline

12

Similarity Measure: Euclidean
n
i=1 (xi

Euclidean |⃗ , ⃗ | = |⃗ − ⃗ | =
x y
x y

− yi )2

cosmonaut

astronaut

moon

car

truck

Soviet

1

0

0

1

1

American

0

1

0

1

1

spacewalking

1

1

0

0

0

red

0

0

0

1

1

full

0

0

1

0

0

old

0

0

0

1

1

•

•
•

euclidian(cosm, astr) =
(1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2

Figure 2: Euclidean Similarity Comparison from Collins (2007)

14

Cosine & Euclidean Similarity in Python

>>> import numpy as np
>>> from scipy.spatial import distance as dist
>>> cosm = np.array([1,0,1,0,0,0])
>>> astr = np.array([0,1,1,0,0,0])
>>> dist.cosine(cosm, astr)
1.0
>>> dist.euclidean(cosm, astr)
2.4494897427831779

15

Computing Word Similarity by Vector Space Modeling

• Advantages & Disadvantages
– Requires no database lookups
– Semantic similarity doesn’t imply synonymy, antonymy, meronymy, hyponymy,
hypernymy, etc.

16

Ontology Basics

• Semantic Web Technologies
–
–
–
–

Data Models
Ontology Language
Distributed Query Language
Applications
∗ Large knowledge bases
∗ Business Intelligence

17

Ontology Basics

Figure 3: Cambridge Semantics’ simpliﬁed view of Semantic Web solutions.

18

Ontology Basics
• W3C Semantic Web
– RDF - Resource Description Framework
∗ Data model w/ identifiers and named relations b/t resource pairs
∗ Represented as directed graphs b/t resources and literal values
· Done w/ collections of triples
· triple: subject, predicate and object
1. Na’im Tyson born in 197x
2. Na’im Tyson works for Knowledgent
3. Knowledgent headquartered Warren
– SPARQL - SPARQL Protocol And RDF Query Language
∗ Query language of Semantic Web
∗ Queries RDF stores over HTTP
∗ Very similar to SQL
– Capturing Relationships
RDF Schema: Vocabulary (term definitions), Schema (class definitions) and
Taxonomies (defining hierarchies)
19

OWL: Expressive relation deﬁnitions (symmetry, transitivity, etc.)
RIF: Rules Interchange Form - representation for exchanging sets of logical
and business rules

20

Text Mining Basics

• What people think Text Mining is?
– Automated discovery of new previously unknown information, by
automatically extracting information from a usually amount of diﬀerent
unstructured textual resources (Wasilewska, 2014)

21

Text Mining Basics
• What text mining really is?

Data Mining

Information Retrieval

Text Mining
Statistics

Web Mining

Computational Linguistics &
Natural Language Processing

Figure 4: Venn Diagram of Text Mining (Wasilewska, 2014).

22

Text Mining Basics
• A General Approach — ignore Process
Text Mining the cloud!

• Document Clustering
• Text Characteristics

Interpretation /
Evaluation
Data Mining /
Pattern Discovery

Attribute Selection

Text Transformation
(Attribute Generation)
Text Preprocessing
Text

Figure 5: General Approaches to Text Mining Process (Wasilewska, 2014).

23

Text Mining Basics

• Application - Document Clustering
Goal: Group large amounts of textual data
Techniques: High Level
– k-means - top down
∗ cluster documents into k groups using vectors and distance metric
– agglomerative hierarchical clustering - bottom up
∗ Start with each document being a single cluster
∗ Eventually all documents belong to the same cluster
∗ Documents represented as a hierarchy (dendogram)
Reference: Taming Text (see Ingersoll et al., 2013, chap. 6)
• Final Remarks
24

References
Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and
Jenifer C. Lai. Class-based n-gram models of natural language. Computational
Linguistics, 18:467–479, 1992.
Michael
Collins.
Lexical
Semantics:
Similarity
Measures
and
Clustering,
November
2007.
URL
http://www.cs.columbia.edu/∼mcollins/6864/slides/wordsim.4up.pdf.
Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming Text: How
to Find, Organize, and Manipulate It. Manning Publications Co., January 2013.
Jaganadhg. Wordnet sense similarity with nltk: some basics, October 2010. URL
http://jaganadhg.freeflux.net/blog/archive/tag/WSD/.
26

George A. Miller. Wordnet: A lexical database for english. Communications of the
ACM, 38(11):39–41, 1995.
Jason Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing, 2010.
Anita Wasilewska. CSE 634 - Data Mining: Text Mining, January 2014. URL
http://www.cs.sunysb.edu/ cse634/presentations/TextMining.pdf.

27

Big Data Palooza Talk: Aspects of Semantic Processing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big Data Palooza Talk: Aspects of Semantic Processing

Ähnlich wie Big Data Palooza Talk: Aspects of Semantic Processing (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Palooza Talk: Aspects of Semantic Processing