The presentation will be delivered by Thanh-Nguyen Ngo at the 14th Asia-Pacific Web Conference (APWeb) on April 12th, 2012 in Kunming, China.
Publication: http://bit.ly/yD18Vj
Abstract:
This paper presents a new approach to measuring similarity over massive time-series data. Our approach is built on two principles: one is to parallelize the large amount computation using a scalable cloud serving system, called TimeCloud. The another is to benet from the lter-and-renement approach for query processing, such that similarity computation is eciently performed over approximated data at the lter step, and then the following renement step measures precise similarities for only a small number of candidates resulted from the ltering. To this end, we establish a set of rm theoretical backgrounds, as well as techniques for processing kNN queries. Our experimental results suggest that the approach proposed is ecient and scalable.
3. Motivation
The demand for storing and processing massive time-series in
the cloud is growing rapidly
Measuring a similarity is a fundamental operation in a wide
range of applications that process temporally ordered data
Computing similar time-series over a large volume of data still
remains as a difficult problem
4. Model-Based Time-Series
Definition (Time-Series)
A time-series t of length n is a temporally ordered sequence
t = [t1 , . . . , tn ] where point in time i is mapped to a d-dimensional
attribute vector ti = (ti1 , . . . , tid ) of values tij with j ∈ {1, . . . , d}.
A time-series is called univariate for d = 1 and multivariate for
d > 1.
5. Model-Based Time-Series
Definition (Common Points)
Two points of two time-series are called common if they occur at
the same time.
Definition (Common Interval)
The common interval of two segments or two time-series is the
greatest interval [a, b] such that time a and b belong to both
segments or time series. Two segments limited by the common
interval are called common segments.
6. Model-Based Similarity Measure
Definition (Euclidean Distance)
The Euclidean distance between two time-series is also the
Euclidean distance of their common segments s = [s1 , . . . , sn ] and
t = [t1 , . . . , tn ] of length n, and it is defined as:
n
Eucl(s, t) = (si − ti )2
i=1
7. Model-Based Similarity Measure
Definition (Maximum Error Bound of Time-Series)
Given a time-series t = [t1 , . . . , tn ] and its representation
t = [t1 , . . . , tn ] in its model. The maximum error bound of t over
its model is a value meb(t) such that:
|ti − ti | ≤ meb(t), ∀i = 1..n
8. Model-Based Similarity Measure
Theorem
Given two time-series s, t and their representations s , t in their
models. Assume the common segments of s and t have n time
series points. Then,
√
|Eucl(s, t) − Eucl(s , t )| ≤ n(meb(s) + meb(t))
9. kNN Procesing - The Filter Stage
Theorem
Let ti and q be representations of ti and q in their models
respectively. Let di be the distance between ti and q with the
maximum error ei . Let ai = di − ei and bi = di + ei . Without loss
of generality, assume b1 ≤ . . . ≤ bn . The candidate set
S = {ti |ai ≤ bk } contains k nearest time-series of q and is
minimal.
10. kNN Procesing - The Refinement Stage
Theorem
Let ti and q be representations of ti and q in their models
respectively. Let di be the distance between ti and q with the
maximum error ei . Let ai = di − ei and bi = di + ei . Without loss
of generality, assume a1 ≤ . . . ≤ am . The set
R = {ti |bi ≤ am−k+1 } is a subset of the result set.
11. Experiments
2.4GHz Intel Core2 Quad CPU
Java implementation, Ubuntu 10.10
Default parameters
length of time series: 512
number of nearest neighbors: 10
error ratio: 3%
number of time series: 1, 000
16. Conclusion
Process kNN queries based on model-based similarity
measures
Establish a set of theoretical foundations for approximated
time-series data processing
Build query processing mechanisms on the filter-and-refine
approach
Run more than three times faster than straightforward
processing
Facilitate scalability of the computation using the TimeCloud
system