Model based similarity measure in time cloud

•

0 gefällt mir•303 views

The presentation will be delivered by Thanh-Nguyen Ngo at the 14th Asia-Pacific Web Conference (APWeb) on April 12th, 2012 in Kunming, China. Publication: http://bit.ly/yD18Vj Abstract: This paper presents a new approach to measuring similarity over massive time-series data. Our approach is built on two principles: one is to parallelize the large amount computation using a scalable cloud serving system, called TimeCloud. The another is to benet from the lter-and-renement approach for query processing, such that similarity computation is eciently performed over approximated data at the lter step, and then the following renement step measures precise similarities for only a small number of candidates resulted from the ltering. To this end, we establish a set of rm theoretical backgrounds, as well as techniques for processing kNN queries. Our experimental results suggest that the approach proposed is ecient and scalable.

Technologie

Model-Based Similarity
Measure in TimeCloud

Thanh-Nguyen Ngo
Hoyoung Jeung
Karl Aberer

LSIR – IC – EPFL

February 2012

Ouline

Motivation
Model-Based Time-Series
Model-Based Similarity Measure
kNN Processing
Experiments
Conclusion

Motivation

The demand for storing and processing massive time-series in
the cloud is growing rapidly
Measuring a similarity is a fundamental operation in a wide
range of applications that process temporally ordered data
Computing similar time-series over a large volume of data still
remains as a diﬃcult problem

Model-Based Time-Series

Deﬁnition (Time-Series)
A time-series t of length n is a temporally ordered sequence
t = [t1 , . . . , tn ] where point in time i is mapped to a d-dimensional
attribute vector ti = (ti1 , . . . , tid ) of values tij with j ∈ {1, . . . , d}.
A time-series is called univariate for d = 1 and multivariate for
d > 1.

Model-Based Time-Series

Deﬁnition (Common Points)
Two points of two time-series are called common if they occur at
the same time.

Deﬁnition (Common Interval)
The common interval of two segments or two time-series is the
greatest interval [a, b] such that time a and b belong to both
segments or time series. Two segments limited by the common
interval are called common segments.

Model-Based Similarity Measure

Deﬁnition (Euclidean Distance)
The Euclidean distance between two time-series is also the
Euclidean distance of their common segments s = [s1 , . . . , sn ] and
t = [t1 , . . . , tn ] of length n, and it is deﬁned as:

n
Eucl(s, t) = (si − ti )2
i=1

Model-Based Similarity Measure

Deﬁnition (Maximum Error Bound of Time-Series)
Given a time-series t = [t1 , . . . , tn ] and its representation
t = [t1 , . . . , tn ] in its model. The maximum error bound of t over
its model is a value meb(t) such that:

|ti − ti | ≤ meb(t), ∀i = 1..n

Model-Based Similarity Measure

Theorem
Given two time-series s, t and their representations s , t in their
models. Assume the common segments of s and t have n time
series points. Then,
√
|Eucl(s, t) − Eucl(s , t )| ≤ n(meb(s) + meb(t))

kNN Procesing - The Filter Stage

Theorem
Let ti and q be representations of ti and q in their models
respectively. Let di be the distance between ti and q with the
maximum error ei . Let ai = di − ei and bi = di + ei . Without loss
of generality, assume b1 ≤ . . . ≤ bn . The candidate set
S = {ti |ai ≤ bk } contains k nearest time-series of q and is
minimal.

kNN Procesing - The Reﬁnement Stage

Theorem
Let ti and q be representations of ti and q in their models
respectively. Let di be the distance between ti and q with the
maximum error ei . Let ai = di − ei and bi = di + ei . Without loss
of generality, assume a1 ≤ . . . ≤ am . The set
R = {ti |bi ≤ am−k+1 } is a subset of the result set.

Experiments

2.4GHz Intel Core2 Quad CPU
Java implementation, Ubuntu 10.10
Default parameters
length of time series: 512
number of nearest neighbors: 10
error ratio: 3%
number of time series: 1, 000

Conclusion

Process kNN queries based on model-based similarity
measures
Establish a set of theoretical foundations for approximated
time-series data processing
Build query processing mechanisms on the ﬁlter-and-reﬁne
approach
Run more than three times faster than straightforward
processing
Facilitate scalability of the computation using the TimeCloud
system

Weitere ähnliche Inhalte

Was ist angesagt?

Combinatorial OptimizationInstitute of Technology, Nirma University

Lecture 3 complexityMadhu Niket

The discrete fourier transform (dsp) 4HIMANSHU DIWAKAR

Box-fitting algorithm presentationRidlo Wibowo

Kmeans with canopy clusteringSeongHyun Jeong

Role of Tensors in Machine LearningAnima Anandkumar

Hprec7.3stevenhbills

Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়

Lesson 5 Nov 3ingroy

Digital Signal Processing[ECEG-3171]-Ch1_L04Rediet Moges

Python-List comprehensionColin Su

Digital Signal Processing[ECEG-3171]-Ch1_L06Rediet Moges

Chapter 10 dsHanif Durad

Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...Tomonari Masada

Absorbing Random Walk CentralityMichael Mathioudakis

L1 intro2 supervised_learningYogendra Singh

Firefly exact MCMC for Big DataGianvito Siciliano

Ch8sadhanakumble

Scalable k-means plus plusPrabin Giri, PhD Student

Was ist angesagt? (19)

Combinatorial Optimization

Lecture 3 complexity

The discrete fourier transform (dsp) 4

Box-fitting algorithm presentation

Kmeans with canopy clustering

Role of Tensors in Machine Learning

Hprec7.3

Optimal Chain Matrix Multiplication Big Data Perspective

Lesson 5 Nov 3

Digital Signal Processing[ECEG-3171]-Ch1_L04

Python-List comprehension

Digital Signal Processing[ECEG-3171]-Ch1_L06

Chapter 10 ds

Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...

Absorbing Random Walk Centrality

L1 intro2 supervised_learning

Firefly exact MCMC for Big Data

Ch8

Scalable k-means plus plus

Ähnlich wie Model based similarity measure in time cloud

Accelerating Dynamic Time Warping Subsequence Search with GPUDavide Nardone

Lecture 1 (ADSP).pptxHarisMasood20

multi threaded and distributed algorithms Dr Shashikant Athawale

Introduction to data structures and complexity.pptxPJS KUMAR

A novel approach for high speed convolution of finite and infinite length seq...eSAT Journals

PakddSiswanto .

ch2-1ssuserb83554

A novel approach for high speed convolution of finiteeSAT Publishing House

ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCESijitjournal

A Combination of Wavelet Artificial Neural Networks Integrated with Bootstrap...IJERA Editor

Can recurrent neural networks warp timeDanbi Cho

Computational Complexity: Complexity ClassesAntonis Antonopoulos

Data Structure & Algorithms - Mathematicalbabuk110

Lecture9babak danyal

Lecture9 Signal and Systemsbabak danyal

Numerical MethodsTeja Ande

Clock Skew Compensation Algorithm Immune to Floating-Point Precision LossXi'an Jiaotong-Liverpool University

Design and Analysis of Algorithms Exam HelpProgramming Exam Help

1 Sampling and Signal Reconstruction.pdfMohamedshabana38

Case Study(Analysis of Algorithm.pdfShaistaRiaz4

Ähnlich wie Model based similarity measure in time cloud (20)

Accelerating Dynamic Time Warping Subsequence Search with GPU

Lecture 1 (ADSP).pptx

multi threaded and distributed algorithms

Introduction to data structures and complexity.pptx

A novel approach for high speed convolution of finite and infinite length seq...

Pakdd

ch2-1

A novel approach for high speed convolution of finite

ON RUN-LENGTH-CONSTRAINED BINARY SEQUENCES

A Combination of Wavelet Artificial Neural Networks Integrated with Bootstrap...

Can recurrent neural networks warp time

Computational Complexity: Complexity Classes

Data Structure & Algorithms - Mathematical

Lecture9

Lecture9 Signal and Systems

Numerical Methods

Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss

Design and Analysis of Algorithms Exam Help

1 Sampling and Signal Reconstruction.pdf

Case Study(Analysis of Algorithm.pdf

Mehr von PlanetData Network of Excellence

Dl2014 slidesPlanetData Network of Excellence

A Contextualized Knowledge Repository for Open Data about TrentinoPlanetData Network of Excellence

On Leveraging Crowdsourcing Techniques for Schema Matching NetworksPlanetData Network of Excellence

Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence

Privacy-Preserving Schema ReusePlanetData Network of Excellence

Pay-as-you-go Reconciliation in Schema Matching NetworksPlanetData Network of Excellence

Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamPlanetData Network of Excellence

On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence

Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...PlanetData Network of Excellence

Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchPlanetData Network of Excellence

SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence

CLODA: A Crowdsourced Linked Open Data ArchitecturePlanetData Network of Excellence

Scalable Nonmonotonic Reasoning over RDF Data Using MapReducePlanetData Network of Excellence

Data and Knowledge Evolution PlanetData Network of Excellence

Evolution of Workflow Provenance Information in the Presence of Custom Infere...PlanetData Network of Excellence

Access Control for RDF graphs using Abstract ModelsPlanetData Network of Excellence

Arrays in Databases, the next frontier?PlanetData Network of Excellence

Abstract Access Control Model for Dynamic RDF DatasetsPlanetData Network of Excellence

Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence

Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...PlanetData Network of Excellence

Mehr von PlanetData Network of Excellence (20)

Dl2014 slides

A Contextualized Knowledge Repository for Open Data about Trentino

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Towards Enabling Probabilistic Databases for Participatory Sensing

Privacy-Preserving Schema Reuse

Pay-as-you-go Reconciliation in Schema Matching Networks

Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream

On the need for a W3C community group on RDF Stream Processing

Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...

Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch

SciQL, Bridging the Gap between Science and Relational DBMS

CLODA: A Crowdsourced Linked Open Data Architecture

Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce

Data and Knowledge Evolution

Evolution of Workflow Provenance Information in the Presence of Custom Infere...

Access Control for RDF graphs using Abstract Models

Arrays in Databases, the next frontier?

Abstract Access Control Model for Dynamic RDF Datasets

Towards Parallel Nonmonotonic Reasoning with Billions of Facts

Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...

Kürzlich hochgeladen

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Key Features Of Token Development (1).pptxLBM Solutions

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Install Stable Diffusion in windows machinePadma Pradeep

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Kürzlich hochgeladen (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

08448380779 Call Girls In Civil Lines Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Injustice - Developers Among Us (SciFiDevCon 2024)

SQL Database Design For Developers at php[tek] 2024

How to Remove Document Management Hurdles with X-Docs?

Key Features Of Token Development (1).pptx

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Pigging Solutions in Pet Food Manufacturing

Advanced Test Driven-Development @ php[tek] 2024

Maximizing Board Effectiveness 2024 Webinar.pptx

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Install Stable Diffusion in windows machine

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Model based similarity measure in time cloud

1. Model-Based Similarity Measure in TimeCloud Thanh-Nguyen Ngo Hoyoung Jeung Karl Aberer LSIR – IC – EPFL February 2012

2. Ouline Motivation Model-Based Time-Series Model-Based Similarity Measure kNN Processing Experiments Conclusion

3. Motivation The demand for storing and processing massive time-series in the cloud is growing rapidly Measuring a similarity is a fundamental operation in a wide range of applications that process temporally ordered data Computing similar time-series over a large volume of data still remains as a diﬃcult problem

4. Model-Based Time-Series Deﬁnition (Time-Series) A time-series t of length n is a temporally ordered sequence t = [t1 , . . . , tn ] where point in time i is mapped to a d-dimensional attribute vector ti = (ti1 , . . . , tid ) of values tij with j ∈ {1, . . . , d}. A time-series is called univariate for d = 1 and multivariate for d > 1.

5. Model-Based Time-Series Deﬁnition (Common Points) Two points of two time-series are called common if they occur at the same time. Deﬁnition (Common Interval) The common interval of two segments or two time-series is the greatest interval [a, b] such that time a and b belong to both segments or time series. Two segments limited by the common interval are called common segments.

6. Model-Based Similarity Measure Deﬁnition (Euclidean Distance) The Euclidean distance between two time-series is also the Euclidean distance of their common segments s = [s1 , . . . , sn ] and t = [t1 , . . . , tn ] of length n, and it is deﬁned as: n Eucl(s, t) = (si − ti )2 i=1

7. Model-Based Similarity Measure Deﬁnition (Maximum Error Bound of Time-Series) Given a time-series t = [t1 , . . . , tn ] and its representation t = [t1 , . . . , tn ] in its model. The maximum error bound of t over its model is a value meb(t) such that: |ti − ti | ≤ meb(t), ∀i = 1..n

8. Model-Based Similarity Measure Theorem Given two time-series s, t and their representations s , t in their models. Assume the common segments of s and t have n time series points. Then, √ |Eucl(s, t) − Eucl(s , t )| ≤ n(meb(s) + meb(t))

9. kNN Procesing - The Filter Stage Theorem Let ti and q be representations of ti and q in their models respectively. Let di be the distance between ti and q with the maximum error ei . Let ai = di − ei and bi = di + ei . Without loss of generality, assume b1 ≤ . . . ≤ bn . The candidate set S = {ti |ai ≤ bk } contains k nearest time-series of q and is minimal.

10. kNN Procesing - The Reﬁnement Stage Theorem Let ti and q be representations of ti and q in their models respectively. Let di be the distance between ti and q with the maximum error ei . Let ai = di − ei and bi = di + ei . Without loss of generality, assume a1 ≤ . . . ≤ am . The set R = {ti |bi ≤ am−k+1 } is a subset of the result set.

11. Experiments 2.4GHz Intel Core2 Quad CPU Java implementation, Ubuntu 10.10 Default parameters length of time series: 512 number of nearest neighbors: 10 error ratio: 3% number of time series: 1, 000

12. Model-Based View Construction

13. Eﬀect of Maximum Error Ratios

14. Eﬀect of Number of Nearest Neighbors

15. Eﬀect of Number of Time Series

16. Conclusion Process kNN queries based on model-based similarity measures Establish a set of theoretical foundations for approximated time-series data processing Build query processing mechanisms on the ﬁlter-and-reﬁne approach Run more than three times faster than straightforward processing Facilitate scalability of the computation using the TimeCloud system

17. Questions?

Model based similarity measure in time cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Model based similarity measure in time cloud

Ähnlich wie Model based similarity measure in time cloud (20)

Mehr von PlanetData Network of Excellence

Mehr von PlanetData Network of Excellence (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Model based similarity measure in time cloud