Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Dynamic modelling of document streams
1. A Genetic Algorithm for Dynamic
Modelling and Prediction of
Activity in Document Streams
Lourdes Araujo,JJ Merelo
lurdes@lsi.uned.es, jj@merelo.net
Dpto. Lenguajes y Sistemas Inform´ ticos
a
Universidad Nacional de Educaci´ n a Distancia
o
Dpto. Arquitectura y Tecnolog´a de Computadores
ı
Universidad de Granada
Spain
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.1/24
2. Why
Document
•
metadata, such as
arrival time help
organize document
streams.
Temporal
•
information help
make sense of
document streams
such as e-mails and
news items.
Its study combines
•
content analysis and
time series mode-
lling. A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.2/24
3. Showing interest
Hypothesis: Explosions in interest match points
•
in time where arrival intensity increases sharply.
In general, arrival time is quite irregular.
•
Y
#document arrivals
X
Time
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.3/24
4. Regularizing irregularity
A cost function, that reflects
•
how difficult is hiking from
one state to another, is
introduced.
Intervals of similar frequency
•
should be grouped in a sin-
gle state, so change of sta-
te will be penalyzed. But we
shouldn’t overdo it.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.4/24
5. Kleinberg’s model
The document stream is modeled as an infinite
•
state automaton, A, which emits messages with
different frequencies.
Each state has a frequency assigned.
•
Bursts are indicated by transitions from a lower
•
to a higher state.
Frequency changes are controlled by assigning
•
costs to state changes, avoiding small explosions
and making identification of real explosions
easier.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.5/24
6. Infinite state automaton model
Generation of time sequence
•
based on a exponential
distribution.
• Time interval x between
message i and i + 1
follows exponential
distribution function
f (x) = αe−αx , for α > 0.
• Expected value for the
interval is α−1 .
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.6/24
7. First things first: two state mo-
del
Basic model 2-State probabilistic automata A: q0
•
(low emission rate) y q1 (high).
q1
q0
n + 1 messages, n intervals: Bayes procedure
•
used to fit to a conditional probability of a state
sequence: q = (qi1 , · · · , qin ):
n
1−p
c(q|x) = b ln ( )+( −ln fit (xt ))
p t=1
where b = state transitions, 1st term: low number
of transitions, 2nd : states fit the sequence
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.7/24
8. To the infinite and beyond
Given a sequence of intervals x =
•
(x1 , x2 , · · · , xn ), a sequence q = (qi1 , · · · , qin )
that minimizes
n−1 n
c(q|x) = τ (it , it+1 ) + −ln fit (xt )
t=0 t=1
must be found
f is related to the resolution of discrete rates
•
within continuous emission rates, and τ the
facility of changing state.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.8/24
9. Infinite is a bit too much
A∗ that minimizes c(q|x) is restricted to Ak
•
s,γ s,γ
with k states.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
10. Infinite is a bit too much
A∗ that minimizes c(q|x) is restricted to Ak
•
s,γ s,γ
with k states.
We will use a evolutionary algorithm to find Ak .
•
s,γ
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
11. Infinite is a bit too much
A∗ that minimizes c(q|x) is restricted to Ak
•
s,γ s,γ
with k states.
We will use a evolutionary algorithm to find Ak .
•
s,γ
Finally!
•
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.9/24
12. Individual representation
n integer sequence,1 < qij < E, representing
•
automaton state and id i of last document in
sequence.
i arrives at 0 ≤ ti ≤ T (intervals xi = ti − ti−1 ).
•
···
t1 t2 tn
| qt1 , tk1 | qtk1 +1 , tk2 | · · · | qtf , tn |
Fitness function = cost function.
•
Initial population: documents chosen at random
•
that split the document stream in intervals, with
random states.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.10/24
14. Mutation
Several mutation
•
operators
• Increment state by
one
• Merge two genes,
state taken randomly
• Split a gene in two:
one with original
state, another ±1.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.12/24
15. Effect of crossover
500
400
Generation N.
300
stream a
200 stream b
stream c
100
10 20 30 40 50
Crossover rate %
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.13/24
16. Effect of mutation
500
400
Generation N.
300
200
stream a
100 stream b
stream c
0
0 5 10 15 20 25 30
Mutation rate %
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.14/24
17. Effect of population size
500
stream a
stream b
400 stream c
Generation N.
300
200
100
0
100 200 300 400 500
Population size
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.15/24
18. Effect of number of generations
9e+05
8e+05
7e+05
Cost function
6e+05
stream a
5e+05 stream b
stream c
4e+05
3e+05
2e+05
0 100 200 300 400 500
Generation N.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.16/24
19. Time results
State n. Viterbi Evo. Alg
Ex. time Cost Ex. time Cost (Av. Cost, Std. dev.)
15 2319.36 277402 1678.61 277712 (279385.6, 980.11)
20 3117.28 277306 2182.12 277528 (278980.4, 1114.91)
25 3835.37 277260 2033.81 277270 (279472.6, 1116.03)
Time comparison
4000
3000
time (s.)
2000
1000
Evolutionary algorithm
Viterbi
0
15 20 25
states
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.17/24
20. Predicting the state of new arri-
vals
Main point of this work:
•
to predict whether buzz
is going up or down.
Several possible
•
approaches: using
Viterbi algorithm over
the whole sequence, and
reusing evolutionary
algorithms.
Easy approach for a sin-
•
gle state: assume current
trend continues.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.18/24
21. Local approximation: results
Previous substream A. T. Old s. New s. Trend
· · · 38 38 39 41 49 49 ↓
52 12 0
· · · 41 49 49 52 68 69 ↑
69 3 4
· · · 88 89 90 90 91 92 →
95 0 0
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.19/24
22. But it breaks down after a while
date GA approx.
0(2004-04-02) 7(0.694669)
··· ···
74(2004-06-15) 14(0.797281)
75(2004-06-16) 24(0.970706)
76(2004-06-17) 19(0.87973)
77(2004-06-18) 19(0.87973) 19(0.87973)
78(2004-06-19) 0(0.605263) 19(0.87973)
79(2004-06-20) 0(0.605263) 19(0.87973)
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.20/24
23. Fast GA for modelling new arri-
vals
Using results of previous fitting
•
Chromosome extended, and last gene mutation
•
probability higher.
1
GA fit
approx. fit
0,9
Frequency
0,8
0,7
0,6
0 100
50 150
Time
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.21/24
24. Fast GA: Results
Subst. len. New Subs. len. T. w/out seed T. w/ seed
219900 100 141.45 (79.09)
3895.28
219000 1000 144.75 (81.96)
210000 10000 166.73 (79.32)
Subst. Len. New Subs. len. T. w/out seed T. w/ seed
3032 100 54.6
2632 500 92.247
5048.49
2132 1000 294.97
1132 2000 570.41
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.22/24
25. Conclusions
The presented system dynamically detects
•
changes on the trends of interest on a document
stream.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
26. Conclusions
The presented system dynamically detects
•
changes on the trends of interest on a document
stream.
An EA allows to deal with very large sequences
•
of documents in a reasonable time.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
27. Conclusions
The presented system dynamically detects
•
changes on the trends of interest on a document
stream.
An EA allows to deal with very large sequences
•
of documents in a reasonable time.
Extending this EA allows fitting a stream which
•
is an extension of a previously fitted substream in
a very short time.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
28. Conclusions
The presented system dynamically detects
•
changes on the trends of interest on a document
stream.
An EA allows to deal with very large sequences
•
of documents in a reasonable time.
Extending this EA allows fitting a stream which
•
is an extension of a previously fitted substream in
a very short time.
We plan to study correlations among document
•
streams, to automatically detect the occurrence of
new topics composed of multi-word concepts.
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.23/24
29. The end
Thanks for your attention
•
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.24/24
30. The end
Thanks for your attention
•
Any question?
•
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.24/24