HalifaxNGGs

Overview and Framework Applications Conclusion
N-gram graphs and proximity graphs: bringing
summarization, machine learning and
bioinformatics to the same neighborhood
George Giannakopoulos1,2
1NCSR Demokritos, Greece
2SciFY NPC, Greece
April 2014

Introductions
NCSR Demokritos and IIT
NCSR “Demokritos”: Biggest science research center in
Greece
Visit: http://www.iit.demokritos.gr/

Introductions
Greece
5 institutes covering Biology, Material Science, Nuclear
Physics, Nanotechnology, Informatics and
Telecommunications, ...

Introductions
Greece
Institute of Informatics and Telecommunications (IIT)

Introductions
Greece
Intelligent Information Systems (IIS) Domain

Introductions
Greece
Telecommunication Systems (TELECOM) Domain

Introductions
Greece
Telecommunication Systems (TELECOM) Domain
Services and Measurements (S&M) Domain

Introductions
SKEL Lab – Part of IIS Domain
Content Analysis and Knowledge Technologies Group (CAKT)
Complex Event Recognition Group (CER)
Creativity Research Unit (CRU)
Pesonalisation & Social Network Analysis Group (PerSoNA)
Robotics Activity Group (RoboSKEL)
Visit: http://www.iit.demokritos.gr/skel

Introductions
Why are we here?
Talk about a framework based on neighborhood/proximity

Introductions
Why are we here?
Discuss its strengths and weaknesses

Introductions
Why are we here?
Get a glimpse of its generic applicability

Introductions
Why are we here?
Get a glimpse of its generic applicability
Foster potential collaborations

Introductions
Intuition — Text
People can read even when words are spelled wnorg

Introductions
Intuition — Text
People can read even when words are spelled wnorg
But order does play some role: not it does?

Introductions
First thoughts
We can deal with noise.

Introductions
First thoughts
Exact sequence is useful, but not critical.

Introductions
First thoughts
Exact sequence is useful, but not critical.
Proximity does play a role in many settings.

Introductions
An N-gram Graph
_cde
_bcd
1,00
_abc
1,00
_def
1,00
1,00
1,00
1,00 1,00
1,00
1,00
1,00
1,00
1,00
Indicates
neighborhood
Edges are important
Vertices are unique
Edge weights can
have diﬀerent
semantics (usually:
co-occurrence)

Introductions
Extraction Process (Natural Language Processing example)
Given a string...
Example
String: abcde

Introductions
Given a string...
Extract n-grams...
Example
String: abcde
Char. N-grams (Rank 3): abc, bcd, cde

Introductions
Given a string...
Determine neighborhood (window size Dwin)...
Example
String: abcde
Edges (Window Size 1): abc-bcd, bcd-cde

Introductions
Given a string...
Assign weights to edges. DONE!
Example
String: abcde
Weights (Freq.): abc-bcd (1.0) , bcd-cde (1.0)

Introductions
N-gram graph parameters
n: describes the length/order of “atoms”
Dwin: describes the limits of a neighborhood of atoms

Introductions
Window-based Extraction of Neighborhood — Examples
Figure: N-gram Window Types (top to bottom): non-symmetric,
symmetric and gauss-normalized symmetric.

Introductions
N-gram Graph — Representation Examples
Figure: Graphs Representing the String abcdef (from left to right):
non-symmetric, symmetric and gauss-normalized symmetric. N-Grams of
Rank 3. Dwin value 2.

Introductions
What is an N-gram Graph?
Description of
co-existence/proximity/collocation/neighborhood
Statistics over the neighborhood
Allows diﬀerent analysis levels (character, word, ...)
Allows diﬀerent distance measures (1D, 2D, tree-based, ...)
Lossy compression (one graph – many “texts”?)
Restrictions upon relative positioning

Introductions
What is Important regarding the N-Gram Graph
Co-occurrence information inherent
Arbitrary fuzziness (parameters)
Generic applicability (domain agnostic)
Paths provide more info on neighborhood

Introductions
N-gram Graph vs ...
Bag-of-words More information
Probabilistic sequential models Not probabilistic (by default), not
strictly sequential
Automata No explicit transitions
Neural networks No input/output; just representation

N-Gram Graph Generic Operators
Nice things we would like to do...
Represent sets of items with a single graph (e.g. classiﬁcation)
Compare things in the graph world (e.g. clustering)

Nice things we would like to do...
Represent sets of items with a single graph (e.g. classiﬁcation)
Compare things in the graph world (e.g. clustering)
Talk about graphs in the vector space

Merging or Union ∪ and Update U
Similarity function sim

Representing Sets of Graphs
A representative graph for a set
Represents an “average” case (like a centroid!)
Common edges: transferred and weights averaged to the
result graph
Non-common edges: simply copied to the result graph
(non-linearity)
But merging graphs can be tricky...

Merge vs. Update (1)
_abc
_bcd
1.00 ∪
_abc
_bcd
2.00 =
_abc
_bcd
1.50
1 + 2
2
= 1.5

_abc
_bcd
1.50 ∪
_abc
_bcd
3.00 =
_abc
_bcd
2.25
1.5+3
2 = 2.25, but what if we want 1+2+3
3 = 2?

updatedValue = oldValue + l × (newValue − oldValue) (1)
where 0 ≤ l ≤ 1 is the learning factor
Representative (or Centroid) Graph
Use update operator with learning factor: 1
instanceCount , where
instanceCount is the number of instances that will be described by
the graph after the update.

l =
1
3
update(
_abc
_bcd
1.50 ,
_abc
_bcd
3.00 ,l) =
_abc
_bcd
2.00
1.5 +
1
3
× (3 − 1.5) =
3
2
+
1
3
×
3
2
=
4
2
= 2

Claim
We can describe groups/classes/sets of things as one graph!

N-gram Graph – Similarity (1)
Size Similarity: Number of Edges
Co-occurrence Similarity: Existence of Edges
Value Similarity: Existence and Weight of Edges
Derived Measures: Normalized Value Similarity, Overall
similarity

|G| the size of a graph (edgecount)

Size Similarity of G1, G2: SS = min(|G1|,|G2|)
max(|G1|,|G2|)

max(|G1|,|G2|)
Containment Similarity: Each common edge adds 1
min(|G1|,|G2|)
to a sum.

max(|G1|,|G2|)
min(|G1|,|G2|)
to a sum.
Value Similarity: Using weights, every common edge adds
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS

max(|G1|,|G2|)
min(|G1|,|G2|)
to a sum.
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS
Normalized Value Similarity, SS factored out: NVS = VS
SS

max(|G1|,|G2|)
min(|G1|,|G2|)
to a sum.
min(wi
e ,wj
e )
max(wi
e ,wj
e )
SS
Normalized Value Similarity, SS factored out: NVS = VS
SS
Overall Similarity for n ∈ [Lmin, LMAX]: Weighted sum of rank
similarity.

N-gram Graph – Size Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
|G1| = 2 |G2| = 4
Result: 2
4 = 0.5

N-gram Graph – Containment Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result: 1
2 + 1
2 = 1.0

N-gram Graph – Value Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result:
1.0
1.0
4 +
4.0
8.0
4 = 1
4 + 1
8 = 0.375

N-gram Graph – Normalized Value Similarity
Example
a
b
1.0
c
8.0
a
b
1.0
c
4.0
d
1.0
e
1.0
Result: VS
SS = 0.375
0.5 = 0.75

Questions (...and catching one’s breath...)

Summary evaluation
Evaluating summaries
Given a set of texts
...and a set of ideal summaries
...can we grade other summaries similarly to humans?

Summary evaluation
Main NGG-based approaches
AutoSummENG compare summary graph to model summaries
individualy [Giannakopoulos et al., 2008]
MeMoG compare summary graph to combined model
graph[Giannakopoulos and Karkaletsis, 2011]
N-POWER use diﬀerent levels of NGG analysis and train
estimation model[Giannakopoulos et al., 2014]

Summary evaluation
AutoSummENG
VS(ModelGraph1, PeerGraph)= 0.5
AutoSummENG Grade: 0.5+0.6+0.7
3 = 0.6

Summary evaluation
MeMoG
VS(RepresentativeModelGraph, PeerGraph)= 0.45
MeMoG Grade: 0.45

Summary evaluation
The power of combined n-gram graphs: N-POWER
Features: Base evaluation scores (“n-gram graph”-based)
Target: Human summary grade, Grammaticality,
Responsiveness, ...
Regression: Combines features to estimate the target

Summary evaluation
Datasets
AESOP task data (Text Analysis Conferences)
TAC 2009
352 model summaries
4840 peer summaries
35 AESOP metrics, 2 baselines
TAC 2010
368 model summaries
3956 peer summaries
27 AESOP metrics, 3 baselines

Summary evaluation
Improving
Setting Set A
Year x features Pearson Spearman Kendall
2009
Baseline: ROUGE-SU4 0.33 0.30 0.22
Baseline: BE 0.26 0.27 0.20
Baseline comb. 0.34 0.29 0.21
NPowER 0.60 0.42 0.32
All 0.83 0.80 0.65
2010
Baseline: ROUGE-SU4 0.61 0.61 0.47
Baseline: BE 0.48 0.50 0.38
Baseline comb. 0.61 0.60 0.47
NPowER 0.72 0.68 0.54
All 0.75 0.72 0.58
Table: Per summary correlation to Responsiveness.

Summary evaluation
Across sets and years
Setting Target: Responsiveness
Train Year Test Year Pearson Spearman Kendall
2009 2010 0.72 0.65 0.52
2010 2009 0.61 0.47 0.35
Setting Target: Responsiveness
Train Set Test Set Pearson Spearman Kendall
A B 0.64 0.55 0.42
B A 0.63 0.55 0.42
Table: Correlations between NPowER grades and
Responsiveness across years (top half) and sets
(bottom half)

Summarization
Cluster texts: events

Summarization
Cluster sentences: subtopics

Summarization
Union of subtopics: the essence

Summarization
And then?...
Similarity of a sentence to the essence: salience
Similarity between sentences: redundancy

Summarization
And then?...
Similarity of a sentence to the essence: salience
Similarity between sentences: redundancy
Thus, NewSum [Giannakopoulos et al., 2014] was born

Summarization
A little more on NewSum
Using n-gram graphs
Android Application
Website version (http://www.newsumontheweb.org)
Applied to English, Greek, German
Highly rated by users on quality of summaries
Born in SciFY (http://www.scify.org)
Awarded (Efarmogiada #3)

Personalization and Classiﬁcation
N-gram graph information in personalization
User likes things based on
content (e.g. free-strings or images)
meta-data
recency
...and others
How can we create a common vector space, using NGGs?

Combining multi-modal features
"Interesting"
Content Model
"Critical"
Content Model
abcdef
Content
String
Representation
Content Graph
"Uninteresting"
Content Model
(1,0,1,0,...,0.42,0.27,0.8)
Feature Vector
Type} Content
N-Gram Graph
Normalized Value
Similarity
Figure: Embedding into vector space
Used for adaptive entity subscription services
[Giannakopoulos and Palpanas, 2009]

Topic classiﬁcation
Figure: Classiﬁcation only with NGG features

Results over diﬀerent document types
Dreuters Dblogs Dtwitter
Vector Model 76.67% 49.81% 45.22%
Bigrams 56.02% 56.19% 53.94%
Trigrams 64.33% 61.53% 55.73%
Four-grams 71.19% 63.51% 51.42%
Bigram Graphs 50.81% 49.57% 41.08%
Trigram Graphs 90.71% 62.33% 62.94%
Four-gram Graphs 93.71% 64.94% 57.61%
Table: Precision of Naive Bayes Multinomial
Classiﬁer [Giannakopoulos et al., 2012]

Bioinformatics
Classification of Constrained DNA Elements
The setting
DNA sequences
Different classes (based on different criteria)
Question: Can we perform classification using NGGs?

Bioinformatics
Datasets
UltraConserved Noncoding Elements (UCNEs)
EU100 nonexonic CNEs (EU100nx CNEs)
Amniotic and Mammalian CNEs
Worm and Insect UCNEs

Bioinformatics
Results
Task Description NGG GS
worm exons vs. surrogates 57.7 61.05
human exons vs. surrogates 74.3 63.41
insect exons vs. surrogates 52.36 59.45
worm UCNEs vs. surrogates 56.76 55.62
human UCNEs vs. surrogates 82.63 72.00
human EU100nx CNEs vs. surrogates 76.43 63.75
amniotic CNEs vs. surrogates 78.62 63.00
mammalian CNEs vs. surrogates 75.85 55.65
insect UCNEs vs. surrogates 64.15 62.65
Average 68.76 61.84
Table: Constrained DNA sequences versus background surrogates
[Polychronopoulos et al., 2014]

Bioinformatics
Still alive! Questions?
Next: Just the conclusion.

Strengths and weaknesses
Strengths
Language-neutral due to statistical nature
No preprocessing required
More than bag-of-words (subword level, sequence information
kept)
More that the vector space representation (graph paths,
arbitrary fuzziness, ...)
Diﬀerent than statistical models (encodes restrictions on
neighborhood with arbitrary fuzziness, expresses possibility
more than probability in current operators, minimal data
needed)
Combinable with vector space
Makes use of uncommon events

Weaknesses
Somewhat demanding in computing power (BUT fully
parallelizable)
Diﬃcult to express generalization potential
Optimal parameters non-trivial to determine (but can be
done, e.g. [Giannakopoulos et al., 2008])
Can become noisy if a graph represents too many instances

Proximity graphs
Generalization of n-gram graphs
Applicable in any similarity space
Vertices can be anything (e.g. vectors)
They allow nesting/hierarchy (graph of graphs)
Already applied to behavior recognition from video

Where should I use n-gram graphs?
When neighborhood seems to contain information
When we have a measure of proximity
When a whole can be described by the neighborhood of its
parts
When uncommon co-occurrences (rare events) are of
importance
When I want to combine multi-modal concepts that co-occur

Ongoing domains of research
Material science (surface grading)
Video/scene analysis
Personalization
Classiﬁcation
e-Government (opinion mining, sentiment analysis)
N-gram graph theory

To remember...
Why n-gram graphs?
Generic framework for neighborhood/proximity patterns
Can express complex relations
Proven usefulness in a variety of domains
Already implemented in the JINSECT library
[Giannakopoulos, 2010]

To remember...
Thank you
N-gram graphs and proximity graphs: bringing
summarization, machine learning and
bioinformatics to the same neighborhood
George Giannakopoulos1,2
1NCSR Demokritos, Greece
2SciFY NPC, Greece
ggianna@iit.demokritos.gr

To remember...
Giannakopoulos, G. (2010).
JINSECT.
http://mloss.org/software/view/234/.
Giannakopoulos, G. and Karkaletsis, V. (2011).
AutoSummENG and MeMoG in evaluating guided summaries.
In TAC 2011 Workshop.
Giannakopoulos, G., Karkaletsis, V., Vouros, G., and Stamatopoulos, P. (2008).
Summarization system evaluation revisited: N-gram graphs.
ACM Trans. Speech Lang. Process., 5(3):1–39.
Giannakopoulos, G., Kiomourtzis, G., and Karkaletsis, V. (2014).
NewSum: “N-Gram Graph”-Based Summarization in the Real World.
IGI.
Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., and Tserpes, K. (2012).
Representation models for text classiﬁcation: a comparative analysis over three web document types.
ICPS. ACM.
Giannakopoulos, G. and Palpanas, T. (2009).
Adaptivity in entity subscription services.
In Proceedings of ADAPTIVE2009.
Polychronopoulos, D., Krithara, A., Nikolaou, C., Paliouras, G., Almirantis, Y., and Giannakopoulos, G.
(2014).
Analysis and classiﬁcation of constrained DNA elements with n-gram graphs and genomic signatures.

HalifaxNGGs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie HalifaxNGGs

Ähnlich wie HalifaxNGGs (20)

HalifaxNGGs