Using timeseries extraction the mining.ppt

1
Data Mining:
Concepts and Techniques
— Chapter 8 —
8.2 Mining time-series data
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2010 Jiawei Han and Micheline Kamber. All rights reserved.

March 15, 2024 Data Mining: Concepts and Techniques 2

3
Mining Time-Series Data
 A time series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals
 Time series analysis: A subfield of statistics, comprises methods
that attempt to understand such time series, often either to
understand the underlying context of the data points or to make
forecasts (or predictions)
 Methods for time series analyses
 Frequency-domain methods: Model-free analyses, well-suited to
exploratory investigations
 spectral analysis vs. wavelet analysis
 Time-domain methods: Auto-correlation and cross-correlation
analysis
 Motif-based time-series analysis
 Applications
 Financial: stock price, inflation
 Industry: power consumption
 Scientific: experiment results
 Meteorological: precipitation

4
Regression Analysis
Trend Analysis
Similarity Search in Time Series Data
Motif-Based Search and Mining in Time
Series Data
Summary

5
Time-Series Data Analysis: Prediction &
Regression Analysis
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees

6
What is Regression?
 Modeling the relationship between one response
variable and one or more predictor variables
 Analyzing the confidence of the model
 E.g, height v.s weight

7
Regression Yields Analytical Model
 Discrete data points →Analytical model
 General relationship
 Easy calculation
 Further analysis
 Application - Prediction

8
Application - Detrending
 Obtain the trend for irregular data series
 Subtract trend
 Reveal oscillations
trend

9
Linear Regression - Single Predictor
 Model is linear
y = w0 + w1 x
where w0 (y-intercept) and w1
(slope) are regression
coefficients
 Method of least squares:
y: response
variable
x: predictor
variable
w1
w0
| |
1
| |
2
1
( )( )
1
( )
D
i i
i
D
i
i
x x y y
x x
w 

 



 x
w
y
w
1
0



10
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|,
y|D|)
 E.g., for 2-D data or
y = w0 + w1 x1+ w2 x2
 Solvable by
 Extension of least square method
(XTX ) W=Y →W = (XTX ) -1Y
 Commercial software (SAS, S-Plus) x1
x2
y
Linear Regression – Multiple Predictor

11
Nonlinear Regression with Linear Method
 Polynomial regression model
 E.g., y = w0 + w1 x + w2 x2 + w3 x3
Let x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Log-linear regression model
 E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 )
Let y’=log(y)
y’= w0 + w1 x + w2 x2 + w3 x3

12
Generalized Linear Regression
 Response y
 Distribution function in the exponential family
 Variance of y depends on E( y), not a constant
 E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 )
 Examples
 Logistic regression (binomial regression): probability of
some event occurring
 Poisson regression: number of customers
 …
 References: Nelder and Wedderburn, 1972; McCullagh and
Nelder, 1989

13
Regression Tree (Breiman et al., 1984)
Figure source: http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
 Partition the domain space
 Leaf: (1) a continuous-valued
prediction; (2) average value

14
Model Tree (Quinlan, 1992)
 Leaf – a linear equation
 More general than regression tree
Figure source: http://datamining.ihe.nl/research/model-trees.htm

15
Regression Trees and Model Trees
 Regression tree: proposed in CART system (Breiman et al. 1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the training tuples
that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than linear
regression when the data cannot be represented well by a simple
linear model

16
Predictive Modeling in
Multidimensional Databases
 Predictive modeling: Predict data values or construct
generalized linear models based on the database data
 One can only predict value ranges or category
distributions
 Method outline
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the prediction
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgment, etc.
 Multi-level prediction: drill-down and roll-up analysis

17
 Predictive modeling: Predict data values or construct
generalized linear models based on the database data
 One can only predict value ranges or category
distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the
prediction
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgment, etc.
 Multi-level prediction: drill-down and roll-up analysis
Predictive Modeling in Multidimensional Databases

19
References
 Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models.
Journal of the Royal Statistical Society A, 135, 370-384.
 C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman &
Hall, 1984.
 McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, 2nd ed.
Chapman and Hall, London.
 Breiman L, Friedman JH, Olshen RA, Stone CJ. (1984). Classification and
Regression Trees. Chapman &Hall (Wadsworth, Inc.): New York.
 Quinlan, J. R. (1992). Learning with continuous classes. In: Adams, , Sterling,
(Eds.), Proceedings of artificial intelligence'92, World Scientific, Singapore. pp.
343-348.
 Acknowledgment
 This presentation integrates Xiaopeng Li’s slides in his CS 512 class
presentation

20
Regression Analysis
Trend Analysis
Series Data
Summary

21
 A time series can be illustrated as a time-series graph
which describes a point moving with the passage of time

22
Categories of Time-Series Movements
 Categories of Time-Series Movements
 Long-term or trend movements (trend curve): general direction in
which a time series is moving over a long interval of time
 Cyclic movements or cycle variations: long term oscillations about
a trend line or curve
 e.g., business cycles, may or may not be periodic
 Seasonal movements or seasonal variations
 i.e, almost identical patterns that a time series appears to
follow during corresponding months of successive years.
 Irregular or random movements
 Time series analysis: decomposition of a time series into these four
basic movements
 Additive Modal: TS = T + C + S + I
 Multiplicative Modal: TS = T  C  S  I

23
Estimation of Trend Curve
 The freehand method
 Fit the curve by looking at the graph
 Costly and barely reliable for large-scaled data mining
 The least-square method
 Find the curve minimizing the sum of the squares of
the deviation of points on the curve from the
corresponding data points
 The moving-average method

24
Moving Average
 Moving average of order n
 Smoothes the data
 Eliminates cyclic, seasonal and irregular movements
 Loses the data at the beginning or end of a series
 Sensitive to outliers (can be reduced by weighted
moving average)

25
Trend Discovery in Time-Series (1):
Estimation of Seasonal Variations
 Seasonal index
 Set of numbers showing the relative values of a variable during
the months of the year
 E.g., if the sales during October, November, and December are
80%, 120%, and 140% of the average monthly sales for the
whole year, respectively, then 80, 120, and 140 are seasonal
index numbers for these months
 Deseasonalized data
 Data adjusted for seasonal variations for better trend and cyclic
analysis
 Divide the original monthly data by the seasonal index numbers
for the corresponding months

Seasonal Index
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10 11 12
Month
Seasonal Index
Raw data from
http://www.bbk.ac.uk/man
op/man/docs/QII_2_2003
%20Time%20series.pdf

27
Trend Discovery in Time-Series (2)
 Estimation of cyclic variations
 If (approximate) periodicity of cycles occurs, cyclic
index can be constructed in much the same manner as
seasonal indexes
 Estimation of irregular variations
 By adjusting the data for trend, seasonal and cyclic
variations
 With the systematic analysis of the trend, cyclic, seasonal,
and irregular components, it is possible to make long- or
short-term predictions with reasonable quality

28
Regression Analysis
Trend Analysis
Series Data
Summary

29
Similarity Search in Time-Series Analysis
 Normal database query finds exact match
 Similarity search finds data sequences that differ only
slightly from the given query sequence
 Two categories of similarity queries
 Whole matching: find a sequence that is similar to the
query sequence
 Subsequence matching: find all pairs of similar
sequences
 Typical Applications
 Financial market
 Market basket data analysis
 Scientific databases
 Medical diagnosis

30
Data Transformation
 Many techniques for signal analysis require the data to
be in the frequency domain
 Usually data-independent transformations are used
 The transformation matrix is determined a priori
 discrete Fourier transform (DFT)
 discrete wavelet transform (DWT)
 The distance between two signals in the time domain is
the same as their Euclidean distance in the frequency
domain

31
Discrete Fourier Transform
 DFT does a good job of concentrating energy in the first
few coefficients
 If we keep only first a few coefficients in DFT, we can
compute the lower bounds of the actual distance
 Feature extraction: keep the first few coefficients (F-index)
as representative of the sequence

32
DFT (continued)
 Parseval’s Theorem
 The Euclidean distance between two signals in the time
domain is the same as their distance in the frequency
domain
 Keep the first few (say, 3) coefficients underestimates the
distance and there will be no false dismissals!







1
0
2
1
0
2
|
|
|
|
n
f
f
n
t
t X
x
|
]
)[
(
]
)[
(
|
|
]
[
]
[
|
3
0
2
0
2

 






f
n
t
f
Q
F
f
S
F
t
Q
t
S 


33
Multidimensional Indexing in Time-Series
 Multidimensional index construction
 Constructed for efficient accessing using the first few
Fourier coefficients
 Similarity search
 Use the index to retrieve the sequences that are at
most a certain small distance away from the query
sequence
 Perform post-processing by computing the actual
distance between sequences in the time domain and
discard any false matches

34
Subsequence Matching
 Break each sequence into a set of
pieces of window with length w
 Extract the features of the
subsequence inside the window
 Map each sequence to a “trail” in
the feature space
 Divide the trail of each sequence
into “subtrails” and represent each
of them with minimum bounding
rectangle
 Use a multi-piece assembly
algorithm to search for longer
sequence matches

35
Analysis of Similar Time Series

36
Enhanced Similarity Search Methods
 Allow for gaps within a sequence or differences in offsets
or amplitudes
 Normalize sequences with amplitude scaling and offset
translation
 Two subsequences are considered similar if one lies within
an envelope of  width around the other, ignoring outliers
 Two sequences are said to be similar if they have enough
non-overlapping time-ordered pairs of similar
subsequences
 Parameters specified by a user or expert: sliding window
size, width of an envelope for similarity, maximum gap,
and matching fraction

37
Steps for Performing a Similarity Search
 Atomic matching
 Find all pairs of gap-free windows of a small length that
are similar
 Window stitching
 Stitch similar windows to form pairs of large similar
subsequences allowing gaps between atomic matches
 Subsequence Ordering
 Linearly order the subsequence matches to determine
whether enough similar pieces exist

38
Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group

39
Query Languages for Time Sequences
 Time-sequence query language
 Should be able to specify sophisticated queries like
Find all of the sequences that are similar to some sequence in class
A, but not similar to any sequence in class B
 Should be able to support various kinds of queries: range queries,
all-pair queries, and nearest neighbor queries
 Shape definition language
 Allows users to define and query the overall shape of time
sequences
 Uses human readable series of sequence transitions or macros
 Ignores the specific details
 E.g., the pattern up, Up, UP can be used to describe
increasing degrees of rising slopes
 Macros: spike, valley, etc.

40
Regression Analysis
Trend Analysis
Series Data
Summary

41
Sequence Distance
 A function that measures the differentness of two
sequences (of possibly unequal length)
 Example: Euclidean Distance between TS Q,C



n
i i
i c
q
C
Q
D 1
2
)
(
)
,
(

42
Motif: Basic Concepts
 What is a motif? A previously unknown, frequently
occurring sequential pattern
 Match: Given subsequences Q,C ⊆ T,
C is a match for Q iff for some R
 Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match
Q. If p = q or ∄ non-match N = T[s..*] such that s between
p,q then match is non-trivial.
(i.e. C,Q must be separated by a non-match)
 1-Motif: the subsequence with most non-trivial matches
(least variance decides ties)
 k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k)
R
C
Q
D 
)
,
(

43
SAX: Symbolic Aggregate approXimation
Dim. Reduction/Compression
 “Symbolic Aggregate approXimation”
SAX : ℝ → ∑
SAX : ↦ ccbaabbbabcbcb
 Essentially an alphabet over the Piecewise Aggregate
Approximation (PAA) rank
 Faster, simpler, more compression, yet on par with DFT,
DWT and other dim. reductions

45
SAX Algorithm
Parameters: alphabet size, word (segment) length (or output
rate)
1. Select probability distribution for TS
2. z-Normalize TS
3. PAA: Within each time interval, calculate aggregated value
(mean) of the segment
4. Partition TS range by equal-area partitioning the PDF into
n partitions (eq. freq. binning)
5. Label each segment with arank ∈∑ for aggregate’s
corresponding partition rank

46
Finding Motifs in a Time Series
EMMA Algorithm: Finds 1-(k-)motif of fixed length n
 SAX Compression (Dim. Reduction)
 Possible to store D(i,j) ∀(i,j) ∈ ∑∑
 Allows use of various distance measures (Minkowski, Dynamic Time
Warping)
 Multiple Tiers
 Tier 1: Uses sliding window to hash length-w SAX subsequences
(aw addresses, total size O(m)).
Bucket B with most collisions & buckets with
MINDIST(B) < R form neighborhood of B.
 Tier 2: Neighborhood is pruned using more precise ADM
algorithm. Ni with max. matches is 1-motif. Early stop if |ADM
matches| > maxk>i(|neighborhoodk|)

47
Hashing
c e c a b b c b a c c e c a b b c b a c
c c c c b b c c d c
w
n
2 4 2 0 1 1 2 1 0 2
5
2 2 2 2 1 1 2 2 3 2
5
2 4 2 0 1 1 2 1 0 2
5
… …
… …
…
… …
…
…
…
…

Classification in Time Series
 Application: Finance, Medicine
 1-Nearest Neighbor
 Pros: accurate, robust, simple
 Cons: time and space complexity (lazy learning);
results are not interpretable
0 200 400 600 800 1000 1200

false nettles
stinging nettles
false nettles
Shapelet
stinging nettles
false nettles stinging
nettles
Leaf Decision Tree
Shapelet Dictionary
5.
1
yes
no
I
I
0 1

Testing the utility of a candidate shapelet
 Arrange the time series objects
 based on the distance from candidate
 Find the optimal split point (maximal information gain)
 Pick the candidate achieving best utility as the shapelet
Split Point
0
candidate
Information gain

false nettles stinging nettles
Leaf Decision Tree
Shapelet Dictionary
5.
1
yes
no
I
I
0 1
false nettles
stinging nettles
false nettles
false nettles
Shapelet
stinging nettles
Classification

52
Regression Analysis
Trend Analysis
Series Data
Summary

53
Summary
 Time series analysis is an important research field
in data mining
 Regression Analysis
 Trend Analysis
 Similarity Search in Time Series Data
 Motif-Based Search and Mining in Time Series
Data

54
References on Time-Series Similarity Search
 R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. FODO’93
(Foundations of Data Organization and Algorithms).
 R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise,
scaling, and translation in time-series databases. VLDB'95.
 R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95.
 C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series
databases. SIGMOD'94.
 J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A Symbolic Representation of Time Series, with Implications
for Streaming Algorithms”, Data Mining and Knowledge discovery, 2003
 P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining Motifs in Massive Time Series Databases”, ICDM’02
 D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97.
 Y. Moon, K. Whang, W. Loh. Duality Based Subsequence Matching in Time-Series Databases, ICDE’02
 B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time
warping. ICDE'98.
 B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for
co-evolving time sequences. ICDE'00.
 D. Shasha and Y. Zhu. High Performance Discovery in Time Series: Techniques and Case Studies,
SPRINGER, 2004
 L. Ye and E. Keogh, “Time Series Shapelets: A New Primitive for Data Mining”, KDD’09

Using timeseries extraction the mining.ppt

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Using timeseries extraction the mining.ppt

Ähnlich wie Using timeseries extraction the mining.ppt (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using timeseries extraction the mining.ppt