SlideShare a Scribd company logo
1 of 52
Download to read offline
Grouping techniques for facing
Volume and Velocity in Big Data
How to do it using HistDAWass package
for clustering Histogram-valued data
Antonio Irpino, PhD
University of Campania ”L. Vanvitelli”
Dept. of Mathematics and Physics
Caserta, Italy
antonio.irpino@unicampania.it
June, 4th, 2018
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 1 / 52
1 A very short introduction on some aspects of Big Data
2 A very short intro to clustering
3 Hard-partitive algorithms
4 Hierarchical clustering
5 Other implemented methods
6 Open research issues and main references
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 2 / 52
A very short introduction on some aspects of Big Data
A very short introduction on some aspects of Big Data
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 3 / 52
A very short introduction on some aspects of Big Data
Some Big data properties
From Wikipedia:
“Big data is data sets that are so voluminous and complex that traditional data-processing
application software are inadequate to deal with them.”
Big data can be described by the following characteristics:
Volume The quantity of generated and stored
data. The size of the data determines the value
and potential insight, and whether it can be
considered big data or not.
Variety The type and nature of the data. This
helps people who analyze it to effectively use the
resulting insight. Big data draws from text, images,
audio, video; plus it completes missing pieces
through data fusion.
Velocity In this context, the speed at which
the data is generated and processed to meet the
demands and challenges that lie in the path of
growth and development. Big data is often
available in real-time.
Veracity The data quality of captured data
can vary greatly, affecting the accurate analysis.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 4 / 52
A very short introduction on some aspects of Big Data
Facing Volume and Velocity
Example 1: a network of wireless sensors
collecting and sharing data.
Example 2: features extracted by an
image database.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 5 / 52
A very short introduction on some aspects of Big Data
A suggestion for analysing big data
Mizuta (2016) suggests to use Mini Data for the analysis of Big Data.
Mini Data
Mini data of big data are defined as data set which contains an important information
about the big data, but its size and/or structure are realistic to deal with. For building
Mini data some tools can be used: Sampling, Variable Selection, Dimension Reduction,
Feature extraction and . . .
Symbolization Symbolic Data Analysis (SDA) was proposed. Symbolic data are
descripted with interval valued, distribution valued, combinations of them, or other
complex structured values. The target object that are analyzed are called concepts. The
concepts are typical examples of mini data.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 6 / 52
A very short introduction on some aspects of Big Data
A proposal for describing such new objects:
Symbolic Data Analysis and distributional data (Bock and Diday 2000)
The measurement done on an object for a variable may have several values: namely, data
are, or might be, multi-valued.
Especially, if an object is an higher order statistical unit, namely, generalizes a set of
individual measurements (a Region, a City, a market segment, a typology,. . . ). But, it is
not only this!
Concurrent approaches
Functional data analysis (Data are functions!)
Compositional data analysis (Compositions obey the Aitchison geometry!)
Object oriented data analysis (Data live in particular spaces, which are not always
Euclidean!)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 7 / 52
A very short intro to clustering
A very short intro to clustering
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 8 / 52
A very short intro to clustering
What is clustering?
A Clustering method is an exploratory tool that looks for groups in data!
Clustering is widely used (Hennig 2015) for
delimitation of species of plants or animals in biology,
medical classification of diseases,
discovery and segmentation of settlements and periods in archaeology,
image segmentation and object recognition,
social stratification,
market segmentation,
efficient organization of data bases for search queries.
There are also quite general tasks for which clustering is applied in many subject areas:
exploratory data analysis looking for “interesting patterns” without prescribing any
specific interpretation, potentially creating new research questions and hypotheses,
information reduction and structuring of sets of entities from any subject area for
simplification, effective communication, or effective access/action such as complexity
reduction for further data analysis, or classification systems,
investigating the correspondence of a clustering in specific data with other groupings
or characteristics, either hypothesized or derived from other data
WOW! but. . . what is a cluster?
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 9 / 52
A very short intro to clustering
What are “true clusters”?
Hennig (2015) lists a set of ideal properties while doing (or validating) clustering:
1 Within-cluster dissimilarities should be small.
2 Between-cluster dissimilarities should be large.
3 Clusters should be fitted well by certain homogeneous probability models such as the
Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial
process models.
4 Members of a cluster should be well represented by its centroid.
5 The dissimilarity matrix of the data should be well represented by the clustering (i.e.,
by the ultrametric induced by a dendrogram, or by defining a binary metric “in same
cluster/in different clusters”).
6 Clusters should be stable.
7 Clusters should correspond to connected areas in data space with high density.
8 The areas in data space corresponding to clusters should have certain characteristics
(such as being convex or linear).
9 It should be possible to characterize the clusters using a small number of variables.
10 Clusters should correspond well to an externally given partition or values of one or
more variables that were not used for computing the clustering.
11 Features should be approximately independent within clusters.
12 All clusters should have roughly the same size.
13 The number of clusters should be low
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 10 / 52
A very short intro to clustering
Types of clusterings
Considering the obtained partition:
1 Hard clustering (an object must belong to a single group)
2 Fuzzy or possibilistic clustering (an object belongs to a cluster accordingly to a
membership degree)
Considering how data are aggregated
1 Partitive clustering
1 K-means, K medoids, Dynamic clustering
2 Density based clustering
3 Model based clustering (Latent class modeling: e.g. Gaussian Mixtures Models)
2 Hierarchical clustering
1 bottom-up (aggregating recursively objects)
2 top-down (dividing the whole set recursively)
The most part of algorithms are based on the choice of a
similarity/dissimilarity/distance between data
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 11 / 52
A very short intro to clustering
Distances for distributions
Abbreviation Metric
D Discrepancy
H Hellinger distance
I Relative entropy (or Kullback-Leibler divergence)
K Kolmogorov (or Uniform) metric
L Levi metric
P Prokhorov metric
S Separation distance
W Wasserstein (or Kantorovich) metric
χ2
χ2
distance
The L2
2 Wassertein distance is: d2
W (Yi , Yj ) =
1
0
[Qi (t) − Qj (t)]2
dt
Where Qi (t) is a quantile function (namely, the inverse of th Cumulative distribution
function). It has some nice properties in clustering (R. Verde and Irpino 2007) and basic
statistics have been developed (Irpino and Verde 2015). Methods have been implemented
in R in a package called HistDAWass (Histogram Data Analysis with Wasserstein distance).
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 12 / 52
A very short intro to clustering
Wasserstein distance: a nice property
Total Distance = Position + Internal Variability
Internal Variability = Size + Shape
d2
W (Yi , Yj ) =
1
0
[Qi (t) − Qj (t)]2
dt
= (µi − µj )2
+
+ (σi − σj )2
+
+2σi σj [1 − CorrQQ(Y (i), Y (j))]
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 13 / 52
Hard-partitive algorithms
Hard-partitive algorithms
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 14 / 52
Hard-partitive algorithms
Dynamic clustering (a generalization of k-means algorithm)
The dynamic clustering algorithm: after initialization, a two-step algorithm looks for the
best partition into k classes and the best representation of clusters.
We assume that the prototype of the cluster Ck (k = 1, . . . , K) is also represented by a
vector gk = (gk1, . . . , gkp), where gkj is a histogram. DCA looks for the partition
P = (C1, . . . , Ck ) of E in K clusters. The corresponding set of K prototypes
G = (g1, . . . , gK ) such that the following adequacy criterion of best fitting between the
clusters and their prototypes is locally minimized:
∆(G, P) =
K
k=1 i∈Ck
d2
W (yi , gk ). (1)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 15 / 52
Hard-partitive algorithms
The Algorithm
The DCA algorithm
1 Initialize the algorithm
1 Set a number k of clusters
2 Set T = 0
3 Generate a random partition of the objects P(0)
4 Compute the criterion (the Within-cluster sum of Squares), CRIT(0)
2 Representation step
1 Set T = T + 1
2 Compute the prototypes of each cluster using P(T − 1)
3 Allocation step
1 Allocate objects to the nearest prototype obtaining the partition P(T)
2 Compute CRIT(T)
4 STOP CONDITION
If CRIT(T) < CRIT(T − 1) goto step 2, else return results.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 16 / 52
Hard-partitive algorithms
The WH_kmeans function
The function uses L2 Wasserstein-based statistics
results=WH_kmeans(x, # A MatH object
k, # The number of required clusters
rep=5, # How many time itis initialized
simplify=FALSE, # A flag for speeding up,
# approximating data
qua=10, # If symplify=TRUE how many quantiles
# are used for approximating the
# distributions
standardize=FALSE) # Do you need to standardize variables?
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 17 / 52
Hard-partitive algorithms
The output of WH_kmeans function
results A list. It contains the best solution among the repetitions, i.e. the one
having the minimum criterion.
results$IDX A vector. The clusters at which the objects are assigned.
results$cardinality A vector. The size of each final cluster.
results$centers A MatH object with the description of centers.
results$Crit A number. The criterion (Within-cluster Sum of squared distances from
the centers).
results$quality A number. The percentage of Total SS explained by the model. (The
higher the better)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 18 / 52
Hard-partitive algorithms
Adaptive distances-based dynamic clustering (A. Irpino, Verde, and
De Carvalho 2014)
A system of weights are calculated for the variables, for their components, cluster-wise or
globally. The system of weights is useful if data are clustered into non-spherical classes.
We assume that the prototype of the cluster Ck (k = 1, . . . , K) is also represented by a
vector gk = (gk1, . . . , gkp), where gkj is a histogram. As in the standard adaptive DCA, the
proposed methods look for the partition P = (C1, . . . , Ck ) of E in K clusters. The
corresponding set of K prototypes G = (g1, . . . , gK ) and a set of K different adaptive
distances d = (d1, . . . , dK ) depend on a set Λ of positive weights associated with the
clusters, such that the following adequacy criterion of the best fitting between the clusters
and their prototypes is locally minimized:
∆(G, Λ, P) =
K
k=1 i∈Ck
d(yi , gk |Λ). (2)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 19 / 52
Hard-partitive algorithms
The adaptive distances
One weight for each variable
d(yi , gk |Λ) =
p
j=1
λj d2
W (yij , gkj ) (3)
Two weights for each variable (one for each component of the distance)
d(yi , gk |Λ) =
p
j=1
λj,¯y (¯yij − ¯ygkj )2
+
p
j=1
λj,Dispd2
W (yc
ij , gc
kj ) (4)
One weight for each variable and each cluster
d(yi , gk |Λ) =
p
j=1
λk
j d2
W (yij , gkj ) (5)
Two weights for each variable and each cluster
d(yi , gk |Λ) =
p
j=1
λk
j,¯y (¯yij − ¯ygkj )2
+
p
j=1
λk
j,Dispd2
W (yc
ij , gc
kj ) (6)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 20 / 52
Hard-partitive algorithms
Two possible functions for computing the weights and four possible
combinations of weights
The system of weights may be
Multiplicative: the product of weights is fixed (generally equal to one)
Additive: the sum of weights is fixed (generally equal to one)
Ways for assigning weights.
1 A weight for each variable
2 A weight for each variable and each cluster
3 A weight for each component of a distributional variable (we mean the position and
the variability component related to the decomposition of the L2 Wasserstein distance)
4 A weight for each component and each cluster
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 21 / 52
Hard-partitive algorithms
The algorithm
The Adaptive DCA algorithm
1 Initialize the algorithm
1 Set T = 0, a number k of clusters, initialize weights W (0).
2 Generate a random partition of the objects P(0)
3 Compute the criterion (the Within-cluster sum of Squares), CRIT(0)
2 Representation step (Fix the Partition and the Weights)
1 Set T = T + 1. Compute the prototypes G(T) of each cluster using P(T − 1) and
W (T − 1).
3 Weighting step (Fix the Prototypes and the Weights)
1 Compute the weight system W (T) using G(T) and P(T − 1)
4 Allocation step (Fix the Weights and Prototypes)
1 Assign objects to the nearest prototype in G(T) using W (T), obtaining the partition
P(T)
2 Compute CRIT(T)
5 STOP CONDITION If CRIT(T) < CRIT(T − 1) goto step 2, else return results.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 22 / 52
Hard-partitive algorithms
The WH_adaptive_kmeans function
results= WH_adaptive.kmeans(x, k,
schema = 1, init, rep,
simplify = FALSE, qua = 10, standardize = FALSE,
weight.sys = "PROD", theta = 2, init.weights = "EQUAL")
Parameter Description
x A MatH object (a matrix of distributionH).
k An integer, the number of groups.
schema a number from 1 to 4:
1 A weight for each variable (default)
2 A weight for the average and the dispersion component of each variable
3 Same as 1 but a different set of weights for each cluster
4 Same as 2 but a different set of weights for each cluster
init (optional, do not use) initialization for partitioning the data default is ’RPART’
rep An integer, maximum number of repetitions of the algorithm (default rep=5).
simplify A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up
the algorithm.
qua An integer, if simplify=TRUE is the number of quantiles used for re-coding the histograms.
standardize A logic value (default is FALSE). If TRUE, histogram-valued data are standardized,variable
by variable, using the Wasserstein based standard deviation.
weight.sys a string. Weights may add to one (’SUM’) or their product is equal to 1 (’PROD’, default).
theta a number. A parameter if weight.sys=’SUM’, default is 2.
init.weights a string how to initialize weights: ’EQUAL’ (default), all weights are the same, ’RANDOM’,
weights are initialized at random.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 23 / 52
Hard-partitive algorithms
The output
Name description
results A list.Returns the best solution among the repetitions, i.e.
the one having the minimum sum of squares criterion.
results$IDX A vector. The final clusters labels of the objects.
results$cardinality A vector. The cardinality of each final cluster.
results$proto A MatH object with the description of centers.
results$weights A matrix of weights for each component of each variable
and each cluster.
results$Crit A number. The criterion (Weighted Within-cluster SS) value
at the end of the run.
results$TOTSSQ The total SSQ computed with the system of weights.
results$BSQ The Between-clusters SSQ computed with the system of
weights.
results$WSQ The Within-clusters SSQ computed with the system of
weights.
results$quality A number. The proportion of TSS explained by the model.
(The higher the better)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 24 / 52
Hard-partitive algorithms
An application on a temperature dataset of USA
In this example, we use data of mean monthly temperatures observed in 48 states of US.
Raw data are free available at the National Climatic Data Center website of US
(http://www1.ncdc.noaa.gov/pub/data/cirs/). The original dataset drd964x.tmpst.txt
contains the sequential Time Biased Corrected state climatic division monthly Average
Temperatures recorded in the 48 (Hawaii and Alaska are not present in the dataset) states
of US from 1895 to 2014.
R code about this example is available here
First of all you can access to data and R scripts from this link. USA_GIS (https:
//www.dropbox.com/sh/c21stseobdroub7/AAABVZzDR0k2ZPvT2eSleNova?dl=0)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 25 / 52
Hard-partitive algorithms
A sketch of the data
load("USA_TMP.RData")
plot(USA_TMP_MAT)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 25 50 0 25 50 20 40 60 30 40 50 60 70 50 60 70 80 60 70 80 60 70 80 90 60 70 80 50 60 70 80 40 50 60 70 80 20 40 60 0 20 40 60
Wyoming
Wisconsin
West Virginia
Washington
Virginia
Vermont
Utah
Texas
Tennessee
South Dakota
South Carolina
Rhode Island
Pennsylvania
Oregon
Oklahoma
Ohio
North Dakota
North Carolina
New York
New Mexico
New Jersey
New Hampshire
Nevada
Nebraska
Montana
Missouri
Mississippi
Minnesota
Michigan
Massachusetts
Maryland
Maine
Louisiana
Kentucky
Kansas
Iowa
Indiana
Illinois
Idaho
Georgia
Florida
Delaware
Connecticut
Colorado
California
Arkansas
Arizona
Alabama
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 26 / 52
Hard-partitive algorithms
Performing a DCA (Partitive model)
DCA.5k.resu=WH_kmeans(USA_TMP_MAT,k = 5,rep = 20)
# we consider the best result among 20 runs
$solution$cardinality
IDX
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5
6 11 12 7 12
$solution$centers
a matrix of distributions
12 variables 5 rows
....
$solution$Crit
[1] 3860.546
$quality
[1] 0.9015709
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 27 / 52
Hard-partitive algorithms
DCA on USA_TMP_MAT: the prototypes
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5
6 11 12 7 12
$solution$Crit $quality
3860.546 0.9015709
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 28 / 52
Hard-partitive algorithms
DCA on USA k=5: the map
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 29 / 52
Hard-partitive algorithms
What is the best (or a suitable) number of clusters
The Kalinski-Harabastz index (pseurdo F score)
n number of objects
k number of clusters
CH(k) =
BSS/(n − k)
WSS/(k − 1)
The highest the best!
No_of_k Crit Qual CH_Index
2 13368.39 0.66 90.78
3 6787.43 0.83 109.27
4 5046.14 0.87 100.87
5 3860.55 0.90 99.94
6 3371.69 0.92 90.63
7 3048.76 0.92 82.26
8 2577.18 0.94 82.42
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 30 / 52
Hard-partitive algorithms
DCA_results for k=3 prototypes
$solution$cardinality IDX
Cl.1 Cl.2 Cl.3
13 14 21
$solution$centers
a matrix of distributions 12 variables 3 rows
$Crit $quality
6787.425 0.8269466
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 31 / 52
Hard-partitive algorithms
DCA on USA k=3: the map
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 32 / 52
Hard-partitive algorithms
DCA interpretation using QPI
TSS WSS BSS Perc. of TSS Perc. of quality
TSS(i)
TSS
BSS(i)
TSS(i)
Jan 6140.30 966.83 5173.47 15.66 84.25
Feb 5989.12 922.01 5067.11 15.27 84.61
Mar 4576.23 617.61 3958.62 11.67 86.50
Apr 3029.37 469.43 2559.94 7.72 84.50
May 2232.27 491.71 1740.56 5.69 77.97
Jun 1861.36 537.50 1323.87 4.75 71.12
Jul 1200.60 340.22 860.38 3.06 71.66
Aug 1451.14 347.46 1103.68 3.70 76.06
Sep 2128.07 383.84 1744.23 5.43 81.96
Oct 2401.76 397.58 2004.17 6.12 83.45
Nov 3379.04 561.03 2818.02 8.62 83.40
Dec 4832.32 752.21 4080.10 12.32 84.43
Total 39221.58 6787.43 32434.15 100.00 82.69
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 33 / 52
Hard-partitive algorithms
DCA position and variability components
TSS WSS BSS BSSc BSSv % of q. pos. var.
BSS(i)
TSS(i)
comp. comp.
Jan 6140.30 966.83 5173.47 5161.36 12.11 84.25 99.77 0.23
Feb 5989.12 922.01 5067.11 5054.33 12.78 84.61 99.75 0.25
Mar 4576.23 617.61 3958.62 3951.77 6.84 86.50 99.83 0.17
Apr 3029.37 469.43 2559.94 2555.28 4.66 84.50 99.82 0.18
May 2232.27 491.71 1740.56 1735.77 4.79 77.97 99.73 0.27
Jun 1861.36 537.50 1323.87 1321.33 2.53 71.12 99.81 0.19
Jul 1200.60 340.22 860.38 855.84 4.54 71.66 99.47 0.53
Aug 1451.14 347.46 1103.68 1100.55 3.13 76.06 99.72 0.28
Sep 2128.07 383.84 1744.23 1742.92 1.31 81.96 99.92 0.08
Oct 2401.76 397.58 2004.17 2001.44 2.73 83.45 99.86 0.14
Nov 3379.04 561.03 2818.02 2811.37 6.65 83.40 99.76 0.24
Dec 4832.32 752.21 4080.10 4064.92 15.18 84.43 99.63 0.37
Total 39221.58 6787.43 32434.15 32356.89 77.26 82.69 99.76 0.24
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 34 / 52
Hard-partitive algorithms
An example on poulation pyramids
We considered population age-sex pyramids data
collected by the Census Bureau of USA in 2014.
A population pyramid is a common way to represent
jointly the distribution of sex and age of people living
in a given administrative unit (city, region or country,
for instance).
In this dataset (available in the HistDAWass package
with the name Age_Pyramids_2014), each country
(228 countries) is represented by two histograms
describing the age distribution for the male and the
female population. Both distributions are represented
by vertically juxtaposing, and the representation is
similar to a pyramid. The shape of pyramids varies
according to the distribution of the age in the
population and it is related to the development of a
country.
World population pyramid 2014
8.5 6.5 4.5 2.5 0.5 0 2 4 6 8
0−4
5−9
10−14
15−19
20−24
25−29
30−34
35−39
40−44
45−49
50−54
55−59
60−64
65−69
70−74
75−79
80−84
85−89
90−94
95−99
100+
Male Age Female
% %
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 35 / 52
Hard-partitive algorithms
DCA with adaptive distances
In this example, we use the Age_pyramids dataset. We fix k = 4.
Adaptive DCA weights for each schema
A weight for each variable
(Schema 1)
Weights
Male.pop 0.99957
Fem.pop 1.00043
A weight for each
component of each
variable (Schema=2)
Weights
Male.pop P 1.04032
Male.pop V 0.92821
Fem.pop P 0.96125
Fem.pop V 1.07734
A weight for each variable and each cluster (Schema 3)
Cl.1 Cl.2 Cl.3 Cl.4
Male.pop 1.0486 0.9093 0.9838 1.0846
Fem.pop 0.9536 1.0998 1.0165 0.9220
A weight for each component of each variable and
each cluster (Schema 4)
Cl.1 Cl.2 Cl.3 Cl.4
Male.pop P 1.0799 0.9958 1.0797 1.0106
Male.pop V 1.0926 0.9445 0.9764 0.8157
Fem.pop P 0.9260 1.0042 0.9261 0.9895
Fem.pop V 0.9153 1.0587 1.0241 1.2259
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 36 / 52
Hard-partitive algorithms
Silhouette index an internal validity index
A internal validity index takes into account one or both the following information
Cluster Cohesion: Measures how closely related are objects in a cluster. (Example:
the within sum of squares)
Cluster Separation: Measure how distinct or well-separated a cluster is from other
clusters (Example: the between sum of squares)
An internal validity index that accounts for cohesion and separation is the Silhoutte
index
It is an average of the silhouette value assigned to each observation s(i) = b(i)−a(i)
max(a(i),b(i))
where a(i) is the average distance from the assigned cluster, b(i) is the average from the
second best cluster.
Method Silhouette Index
Base 0.775862
Schema 1 0.775867
Schema 2 0.887746
Schema 3 0.899954
Schema 4 0.839478
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 37 / 52
Hard-partitive algorithms
An application on an Activity Recognition dataset: 8 people walking
(Altun, Barshan, and Tunçel 2010) create an extensive dataset for activity recognition
Available at the UCI Machine Learning Repository
The data set consists of:
19 activities: sitting, standing, ascending/desdencing
stairs, walk 4kmH flat, running, rowing, jumping,
playing basketball, and more.
8 people between 20 and 30 years old (4 male, 4
female)
Each person freely performed each activity during 5
minutes
45 measurements (5 times triaxial gyroscope,
accelerometer, magnetometer) are recorder at a rate
of 25Hz
We don’t use magnetometer in this application
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 38 / 52
Hard-partitive algorithms
An application on the AR dataset: 8 people walking 4kmH flat.
(Each row is a 5 sec.-window of measurements)
TO_xacc TO_yacc TO_zacc TO_xgyr TO_ygyr TO_zgyr RA_xacc RA_yacc RA_zacc RA_xgyr
5 10 15 -4 0 4 -5 0 5 10 15 -2 0 2 4 -2 -1 0 1 2 -1 0 1 -10 0 10 20 -5 0 5 10 15 20 -5 0 5 -4 0 4 8
p8-a10-s10
p8-a10-s09
p8-a10-s08
p8-a10-s07
p8-a10-s06
p8-a10-s05
p8-a10-s04
p8-a10-s03
p8-a10-s02
p8-a10-s01
p7-a10-s10
p7-a10-s09
p7-a10-s08
p7-a10-s07
p7-a10-s06
p7-a10-s05
p7-a10-s04
p7-a10-s03
p7-a10-s02
p7-a10-s01
p6-a10-s10
p6-a10-s09
p6-a10-s08
p6-a10-s07
p6-a10-s06
p6-a10-s05
p6-a10-s04
p6-a10-s03
p6-a10-s02
p6-a10-s01
p5-a10-s10
p5-a10-s09
p5-a10-s08
p5-a10-s07
p5-a10-s06
p5-a10-s05
p5-a10-s04
p5-a10-s03
p5-a10-s02
p5-a10-s01
p4-a10-s10
p4-a10-s09
p4-a10-s08
p4-a10-s07
p4-a10-s06
p4-a10-s05
p4-a10-s04
p4-a10-s03
p4-a10-s02
p4-a10-s01
p3-a10-s10
p3-a10-s09
p3-a10-s08
p3-a10-s07
p3-a10-s06
p3-a10-s05
p3-a10-s04
p3-a10-s03
p3-a10-s02
p3-a10-s01
p2-a10-s10
p2-a10-s09
p2-a10-s08
p2-a10-s07
p2-a10-s06
p2-a10-s05
p2-a10-s04
p2-a10-s03
p2-a10-s02
p2-a10-s01
p1-a10-s10
p1-a10-s09
p1-a10-s08
p1-a10-s07
p1-a10-s06
p1-a10-s05
p1-a10-s04
p1-a10-s03
p1-a10-s02
p1-a10-s01
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 39 / 52
Hard-partitive algorithms
A PCA on histograms: the individual plots
-6
-4
-2
0
2
4
-5.0 -2.5 0.0 2.5 5.0
Comp. 1 (27.32%)
Comp.2(13.17%)
people
1-F
2-F
3-M
4-M
5-M
6-F
7-F
8-M
PCA - Walking on a treadmill at 4 km/h in flat position
First factorial plane (40.50% of explained inertia)
1-F
6-F
2-F
7-F
3-M
4-M
8-M 5-M
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 40 / 52
Hard-partitive algorithms
Dynamic clustering external validity indexes
Indexes:
ARI= adjusted Rand Index (accuracy), PUR=purity, FM=Folks-Mallows index, NMI=Normalized mutual information
Methods:
KM= Dynamic clustering (aka, K-means), KMst= Dynamic clustering whith stand. variables
ADA1 = 1 weight for each variable, ADA2 = 2 weights for each variable (one for each component),
ADA3 = 1 weight for each variable and each cluster, ADA4 = 2 weights for each variable (one for each component) and each
cluster,
Mets ARI PUR FM NMI ARI PUR FM NMI ARI PUR FM NMI
KM 0.4652 0.5000 0.6051 0.6980 0.5555 0.6250 0.6608 0.7716 0.5627 0.6417 0.6626 0.7821
KMst 0.4489 0.5000 0.5995 0.7088 0.6330 0.6250 0.7141 0.8077 0.6811 0.7479 0.7468 0.8505
ADA_1 0.4448 0.5000 0.5962 0.7038 0.5043 0.6229 0.6330 0.7714 0.6766 0.7458 0.7434 0.8505
ADA_2 0.4126 0.5000 0.5744 0.6713 0.5059 0.6229 0.6331 0.7672 0.6884 0.7500 0.7530 0.8613
ADA_3 0.4225 0.5000 0.5838 0.6938 0.5077 0.6250 0.6334 0.7646 0.6884 0.7500 0.7530 0.8613
ADA_4 0.4090 0.5000 0.5679 0.6557 0.5024 0.6250 0.6258 0.7446 0.6674 0.7500 0.7454 0.8676
Mets ARI PUR FM NMI ARI PUR FM NMI
KM 0.5996 0.6417 0.6880 0.7987 0.6377 0.7479 0.7079 0.8338
KMst 0.7371 0.8208 0.7865 0.8931 0.8311 0.8750 0.8569 0.9171
ADA_1 0.7371 0.8208 0.7865 0.8931 0.8156 0.8667 0.8433 0.9015
ADA_2 0.7435 0.8229 0.7917 0.8990 0.8336 0.8750 0.8592 0.9366
ADA_3 0.7761 0.8208 0.8134 0.8914 0.8839 0.9396 0.8988 0.9356
ADA_4 0.8446 0.8646 0.8706 0.9241 0.7088 0.8042 0.7583 0.8499
k=4 k=5 k=6
k=7 k=8
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 41 / 52
Hierarchical clustering
Hierarchical clustering
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 42 / 52
Hierarchical clustering
Hierarchical clustering
results= WH_hclust (x, simplify=FALSE, qua=10, standardize=FALSE, distance="
method="complete")
Input
Input param. Description
x A MatH object (a matrix of distributionH)
simplify A logic value (default is FALSE), if TRUE histograms are recomputed in order to
speed-up the algorithm.
qua An integer, if simplify=TRUE is the number of quantiles used for re-codify the
histograms.
standardize A logic value (default is FALSE). If TRUE histogram-valued data are standardized,
variable by variable, using the Wasserstein-based standard deviation. Use if one
wants to have variables with std equal to one.
distance A string default WDIST the L2 Wasserstein distance (other distances will be imple-
mented)
method A string, default="complete", is the the agglomeration method to be used.
This should be (an unambiguous abbreviation of) one of ward.D, ward.D2, single,
complete, average (= UPGMA), mcquitty (= WPGMA), median (= WPGMC) or
centroid (= UPGMC).
Output
An object of the class hclust which describes the tree produced by the method.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 43 / 52
Hierarchical clustering
Application on the Age_pyramids dataset: the script
Work_data=Age_Pyramids_2014[2:229,2:3] #Take a part of the data
Hward=WH_hclust(Work_data,method = "ward.D2") #Do the dirty work
# cut dendrogram in 4 clusters
hc=Hward
hcd=as.dendrogram(hc)
clusMember = cutree(hc, 4)
labelColors = c("red", "yellow", "green", "purple")
# function to get color labels
...
# using dendrapply
clusDendro = dendrapply(hcd, colLab)
# make plot
clusDendro<-assign_values_to_leaves_nodePar(clusDendro, 0.5, "lab.cex")
plot(clusDendro)
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 44 / 52
Hierarchical clustering
Show tree
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 45 / 52
Hierarchical clustering
Show map
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 46 / 52
Hierarchical clustering
Show barycenters
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 47 / 52
Other implemented methods
Other implemented methods
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 48 / 52
Other implemented methods
Other methods
Kohonen Self Organizing Maps
Fuzzy c-means
Adaptive distances-based Fuzzy c-means
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 49 / 52
Open research issues and main references
Open research issues and main references
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 50 / 52
Open research issues and main references
Some open research issues
Considering the problem of finding “true clusters”
How to combine qualitative and quantitative data (hetereogeneity)
How to consider clustering as a predictive method (a great challenge!)
We imagine that we have a set of variables defining clusters and a set of explicative
variables (for validating clusters). Is it possible to define a general strategy for
predictive clustering? This may be relevant in several applicative fields: marketing,
time series forecasting, . . .
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 51 / 52
Open research issues and main references
References
Altun, Kerem, Billur Barshan, and Orkun Tunçel. 2010. “Comparative Study on Classifying Human Activities with Miniature
Inertial and Magnetic Sensors.” Pattern Recognition 43 (10): 3605–20. doi:https://doi.org/10.1016/j.patcog.2010.04.019.
Bock, H.H., and E. Diday. 2000. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from
Complex Data. Springer verlag.
Hennig, Christian. 2015. “What Are the True Clusters?” Pattern Recognition Letters 64: 53–62.
doi:https://doi.org/10.1016/j.patrec.2015.04.009.
Irpino, A., R. Verde, and F.A.T. De Carvalho. 2014. “Dynamic Clustering of Histogram Data Based on Adaptive Squared
Wasserstein Distances.” Expert Systems with Applications 41 (7): 3351–66. doi:http://dx.doi.org/10.1016/j.eswa.2013.12.001.
Irpino, Antonio, and Rosanna Verde. 2015. “Basic Statistics for Distributional Symbolic Variables: A New Metric-Based
Approach.” Advances in Data Analysis and Classification 9 (2). Springer Berlin Heidelberg: 143–75.
doi:10.1007/s11634-014-0176-4.
Mizuta, Masahiro. 2016. “Mini Data Approach to Big Data.” Medical Imaging and Information Sciences 33 (1): 1–3.
doi:10.11318/mii.33.1.
Verde, Rosanna, and Antonio Irpino. 2007. “Dynamic Clustering of Histogram Data: Using the Right Metric.” In Selected
Contributions in Data Analysis and Classification, edited by Paula Brito, Guy Cucumel, Patrice Bertrand, and Francisco Carvalho,
123–34. Studies in Classification, Data Analysis, and Knowledge Organization. Berlin, Heidelberg: Springer Berlin Heidelberg.
Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 52 / 52

More Related Content

What's hot

Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
An Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sAn Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sijtsrd
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- ReduceBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduceijircee
 
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAutomatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAntonio Moreno
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methodsimprovemed
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
A preliminary survey on optimized multiobjective metaheuristic methods for da...
A preliminary survey on optimized multiobjective metaheuristic methods for da...A preliminary survey on optimized multiobjective metaheuristic methods for da...
A preliminary survey on optimized multiobjective metaheuristic methods for da...ijcsit
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
Basic course on computer-based methods
Basic course on computer-based methodsBasic course on computer-based methods
Basic course on computer-based methodsimprovemed
 
Competitive advantage from Data Mining: some lessons learnt ...
Competitive advantage from Data Mining: some lessons learnt ...Competitive advantage from Data Mining: some lessons learnt ...
Competitive advantage from Data Mining: some lessons learnt ...butest
 
Hybrid Algorithm for Clustering Mixed Data Sets
Hybrid Algorithm for Clustering Mixed Data SetsHybrid Algorithm for Clustering Mixed Data Sets
Hybrid Algorithm for Clustering Mixed Data SetsIOSR Journals
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing withijcsa
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 

What's hot (19)

Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
An Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sAn Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’s
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
31 34
31 3431 34
31 34
 
Exploration versus exploitation information filtering challenges joys and l...
Exploration versus exploitation   information filtering challenges joys and l...Exploration versus exploitation   information filtering challenges joys and l...
Exploration versus exploitation information filtering challenges joys and l...
 
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- ReduceBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce
 
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAutomatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methods
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
A preliminary survey on optimized multiobjective metaheuristic methods for da...
A preliminary survey on optimized multiobjective metaheuristic methods for da...A preliminary survey on optimized multiobjective metaheuristic methods for da...
A preliminary survey on optimized multiobjective metaheuristic methods for da...
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
Basic course on computer-based methods
Basic course on computer-based methodsBasic course on computer-based methods
Basic course on computer-based methods
 
Competitive advantage from Data Mining: some lessons learnt ...
Competitive advantage from Data Mining: some lessons learnt ...Competitive advantage from Data Mining: some lessons learnt ...
Competitive advantage from Data Mining: some lessons learnt ...
 
Hybrid Algorithm for Clustering Mixed Data Sets
Hybrid Algorithm for Clustering Mixed Data SetsHybrid Algorithm for Clustering Mixed Data Sets
Hybrid Algorithm for Clustering Mixed Data Sets
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing with
 
Bx044461467
Bx044461467Bx044461467
Bx044461467
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 

Similar to Grouping techniques for facing Volume and Velocity in the Big Data

A Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamA Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamIIRindia
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancerpaperpublications3
 
An Analysis of Outlier Detection through clustering method
An Analysis of Outlier Detection through clustering methodAn Analysis of Outlier Detection through clustering method
An Analysis of Outlier Detection through clustering methodIJAEMSJORNAL
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...tuxette
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxYogeshGairola2
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Hybrid Approach for Brain Tumour Detection in Image Segmentation
Hybrid Approach for Brain Tumour Detection in Image SegmentationHybrid Approach for Brain Tumour Detection in Image Segmentation
Hybrid Approach for Brain Tumour Detection in Image Segmentationijtsrd
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsJason Hattrick-Simpers
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Miningbutest
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisKaty Allen
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problemcsandit
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
 

Similar to Grouping techniques for facing Volume and Velocity in the Big Data (20)

A Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data StreamA Survey on Cluster Based Outlier Detection Techniques in Data Stream
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
An Analysis of Outlier Detection through clustering method
An Analysis of Outlier Detection through clustering methodAn Analysis of Outlier Detection through clustering method
An Analysis of Outlier Detection through clustering method
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
clustering.ppt
clustering.pptclustering.ppt
clustering.ppt
 
G045033841
G045033841G045033841
G045033841
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
SpectralClassificationOfStars
SpectralClassificationOfStarsSpectralClassificationOfStars
SpectralClassificationOfStars
 
Hybrid Approach for Brain Tumour Detection in Image Segmentation
Hybrid Approach for Brain Tumour Detection in Image SegmentationHybrid Approach for Brain Tumour Detection in Image Segmentation
Hybrid Approach for Brain Tumour Detection in Image Segmentation
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
On Machine Learning and Data Mining
On Machine Learning and Data MiningOn Machine Learning and Data Mining
On Machine Learning and Data Mining
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problem
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 

More from Facultad de Informática UCM

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?Facultad de Informática UCM
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...Facultad de Informática UCM
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersFacultad de Informática UCM
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmFacultad de Informática UCM
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingFacultad de Informática UCM
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroFacultad de Informática UCM
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...Facultad de Informática UCM
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Facultad de Informática UCM
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFacultad de Informática UCM
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoFacultad de Informática UCM
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCFacultad de Informática UCM
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Facultad de Informática UCM
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Facultad de Informática UCM
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Facultad de Informática UCM
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windFacultad de Informática UCM
 

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Grouping techniques for facing Volume and Velocity in the Big Data

  • 1. Grouping techniques for facing Volume and Velocity in Big Data How to do it using HistDAWass package for clustering Histogram-valued data Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicampania.it June, 4th, 2018 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 1 / 52
  • 2. 1 A very short introduction on some aspects of Big Data 2 A very short intro to clustering 3 Hard-partitive algorithms 4 Hierarchical clustering 5 Other implemented methods 6 Open research issues and main references Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 2 / 52
  • 3. A very short introduction on some aspects of Big Data A very short introduction on some aspects of Big Data Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 3 / 52
  • 4. A very short introduction on some aspects of Big Data Some Big data properties From Wikipedia: “Big data is data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them.” Big data can be described by the following characteristics: Volume The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. Variety The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion. Velocity In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Veracity The data quality of captured data can vary greatly, affecting the accurate analysis. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 4 / 52
  • 5. A very short introduction on some aspects of Big Data Facing Volume and Velocity Example 1: a network of wireless sensors collecting and sharing data. Example 2: features extracted by an image database. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 5 / 52
  • 6. A very short introduction on some aspects of Big Data A suggestion for analysing big data Mizuta (2016) suggests to use Mini Data for the analysis of Big Data. Mini Data Mini data of big data are defined as data set which contains an important information about the big data, but its size and/or structure are realistic to deal with. For building Mini data some tools can be used: Sampling, Variable Selection, Dimension Reduction, Feature extraction and . . . Symbolization Symbolic Data Analysis (SDA) was proposed. Symbolic data are descripted with interval valued, distribution valued, combinations of them, or other complex structured values. The target object that are analyzed are called concepts. The concepts are typical examples of mini data. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 6 / 52
  • 7. A very short introduction on some aspects of Big Data A proposal for describing such new objects: Symbolic Data Analysis and distributional data (Bock and Diday 2000) The measurement done on an object for a variable may have several values: namely, data are, or might be, multi-valued. Especially, if an object is an higher order statistical unit, namely, generalizes a set of individual measurements (a Region, a City, a market segment, a typology,. . . ). But, it is not only this! Concurrent approaches Functional data analysis (Data are functions!) Compositional data analysis (Compositions obey the Aitchison geometry!) Object oriented data analysis (Data live in particular spaces, which are not always Euclidean!) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 7 / 52
  • 8. A very short intro to clustering A very short intro to clustering Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 8 / 52
  • 9. A very short intro to clustering What is clustering? A Clustering method is an exploratory tool that looks for groups in data! Clustering is widely used (Hennig 2015) for delimitation of species of plants or animals in biology, medical classification of diseases, discovery and segmentation of settlements and periods in archaeology, image segmentation and object recognition, social stratification, market segmentation, efficient organization of data bases for search queries. There are also quite general tasks for which clustering is applied in many subject areas: exploratory data analysis looking for “interesting patterns” without prescribing any specific interpretation, potentially creating new research questions and hypotheses, information reduction and structuring of sets of entities from any subject area for simplification, effective communication, or effective access/action such as complexity reduction for further data analysis, or classification systems, investigating the correspondence of a clustering in specific data with other groupings or characteristics, either hypothesized or derived from other data WOW! but. . . what is a cluster? Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 9 / 52
  • 10. A very short intro to clustering What are “true clusters”? Hennig (2015) lists a set of ideal properties while doing (or validating) clustering: 1 Within-cluster dissimilarities should be small. 2 Between-cluster dissimilarities should be large. 3 Clusters should be fitted well by certain homogeneous probability models such as the Gaussian or a uniform distribution on a convex set, or by linear, time series or spatial process models. 4 Members of a cluster should be well represented by its centroid. 5 The dissimilarity matrix of the data should be well represented by the clustering (i.e., by the ultrametric induced by a dendrogram, or by defining a binary metric “in same cluster/in different clusters”). 6 Clusters should be stable. 7 Clusters should correspond to connected areas in data space with high density. 8 The areas in data space corresponding to clusters should have certain characteristics (such as being convex or linear). 9 It should be possible to characterize the clusters using a small number of variables. 10 Clusters should correspond well to an externally given partition or values of one or more variables that were not used for computing the clustering. 11 Features should be approximately independent within clusters. 12 All clusters should have roughly the same size. 13 The number of clusters should be low Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 10 / 52
  • 11. A very short intro to clustering Types of clusterings Considering the obtained partition: 1 Hard clustering (an object must belong to a single group) 2 Fuzzy or possibilistic clustering (an object belongs to a cluster accordingly to a membership degree) Considering how data are aggregated 1 Partitive clustering 1 K-means, K medoids, Dynamic clustering 2 Density based clustering 3 Model based clustering (Latent class modeling: e.g. Gaussian Mixtures Models) 2 Hierarchical clustering 1 bottom-up (aggregating recursively objects) 2 top-down (dividing the whole set recursively) The most part of algorithms are based on the choice of a similarity/dissimilarity/distance between data Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 11 / 52
  • 12. A very short intro to clustering Distances for distributions Abbreviation Metric D Discrepancy H Hellinger distance I Relative entropy (or Kullback-Leibler divergence) K Kolmogorov (or Uniform) metric L Levi metric P Prokhorov metric S Separation distance W Wasserstein (or Kantorovich) metric χ2 χ2 distance The L2 2 Wassertein distance is: d2 W (Yi , Yj ) = 1 0 [Qi (t) − Qj (t)]2 dt Where Qi (t) is a quantile function (namely, the inverse of th Cumulative distribution function). It has some nice properties in clustering (R. Verde and Irpino 2007) and basic statistics have been developed (Irpino and Verde 2015). Methods have been implemented in R in a package called HistDAWass (Histogram Data Analysis with Wasserstein distance). Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 12 / 52
  • 13. A very short intro to clustering Wasserstein distance: a nice property Total Distance = Position + Internal Variability Internal Variability = Size + Shape d2 W (Yi , Yj ) = 1 0 [Qi (t) − Qj (t)]2 dt = (µi − µj )2 + + (σi − σj )2 + +2σi σj [1 − CorrQQ(Y (i), Y (j))] Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 13 / 52
  • 14. Hard-partitive algorithms Hard-partitive algorithms Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 14 / 52
  • 15. Hard-partitive algorithms Dynamic clustering (a generalization of k-means algorithm) The dynamic clustering algorithm: after initialization, a two-step algorithm looks for the best partition into k classes and the best representation of clusters. We assume that the prototype of the cluster Ck (k = 1, . . . , K) is also represented by a vector gk = (gk1, . . . , gkp), where gkj is a histogram. DCA looks for the partition P = (C1, . . . , Ck ) of E in K clusters. The corresponding set of K prototypes G = (g1, . . . , gK ) such that the following adequacy criterion of best fitting between the clusters and their prototypes is locally minimized: ∆(G, P) = K k=1 i∈Ck d2 W (yi , gk ). (1) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 15 / 52
  • 16. Hard-partitive algorithms The Algorithm The DCA algorithm 1 Initialize the algorithm 1 Set a number k of clusters 2 Set T = 0 3 Generate a random partition of the objects P(0) 4 Compute the criterion (the Within-cluster sum of Squares), CRIT(0) 2 Representation step 1 Set T = T + 1 2 Compute the prototypes of each cluster using P(T − 1) 3 Allocation step 1 Allocate objects to the nearest prototype obtaining the partition P(T) 2 Compute CRIT(T) 4 STOP CONDITION If CRIT(T) < CRIT(T − 1) goto step 2, else return results. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 16 / 52
  • 17. Hard-partitive algorithms The WH_kmeans function The function uses L2 Wasserstein-based statistics results=WH_kmeans(x, # A MatH object k, # The number of required clusters rep=5, # How many time itis initialized simplify=FALSE, # A flag for speeding up, # approximating data qua=10, # If symplify=TRUE how many quantiles # are used for approximating the # distributions standardize=FALSE) # Do you need to standardize variables? Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 17 / 52
  • 18. Hard-partitive algorithms The output of WH_kmeans function results A list. It contains the best solution among the repetitions, i.e. the one having the minimum criterion. results$IDX A vector. The clusters at which the objects are assigned. results$cardinality A vector. The size of each final cluster. results$centers A MatH object with the description of centers. results$Crit A number. The criterion (Within-cluster Sum of squared distances from the centers). results$quality A number. The percentage of Total SS explained by the model. (The higher the better) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 18 / 52
  • 19. Hard-partitive algorithms Adaptive distances-based dynamic clustering (A. Irpino, Verde, and De Carvalho 2014) A system of weights are calculated for the variables, for their components, cluster-wise or globally. The system of weights is useful if data are clustered into non-spherical classes. We assume that the prototype of the cluster Ck (k = 1, . . . , K) is also represented by a vector gk = (gk1, . . . , gkp), where gkj is a histogram. As in the standard adaptive DCA, the proposed methods look for the partition P = (C1, . . . , Ck ) of E in K clusters. The corresponding set of K prototypes G = (g1, . . . , gK ) and a set of K different adaptive distances d = (d1, . . . , dK ) depend on a set Λ of positive weights associated with the clusters, such that the following adequacy criterion of the best fitting between the clusters and their prototypes is locally minimized: ∆(G, Λ, P) = K k=1 i∈Ck d(yi , gk |Λ). (2) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 19 / 52
  • 20. Hard-partitive algorithms The adaptive distances One weight for each variable d(yi , gk |Λ) = p j=1 λj d2 W (yij , gkj ) (3) Two weights for each variable (one for each component of the distance) d(yi , gk |Λ) = p j=1 λj,¯y (¯yij − ¯ygkj )2 + p j=1 λj,Dispd2 W (yc ij , gc kj ) (4) One weight for each variable and each cluster d(yi , gk |Λ) = p j=1 λk j d2 W (yij , gkj ) (5) Two weights for each variable and each cluster d(yi , gk |Λ) = p j=1 λk j,¯y (¯yij − ¯ygkj )2 + p j=1 λk j,Dispd2 W (yc ij , gc kj ) (6) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 20 / 52
  • 21. Hard-partitive algorithms Two possible functions for computing the weights and four possible combinations of weights The system of weights may be Multiplicative: the product of weights is fixed (generally equal to one) Additive: the sum of weights is fixed (generally equal to one) Ways for assigning weights. 1 A weight for each variable 2 A weight for each variable and each cluster 3 A weight for each component of a distributional variable (we mean the position and the variability component related to the decomposition of the L2 Wasserstein distance) 4 A weight for each component and each cluster Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 21 / 52
  • 22. Hard-partitive algorithms The algorithm The Adaptive DCA algorithm 1 Initialize the algorithm 1 Set T = 0, a number k of clusters, initialize weights W (0). 2 Generate a random partition of the objects P(0) 3 Compute the criterion (the Within-cluster sum of Squares), CRIT(0) 2 Representation step (Fix the Partition and the Weights) 1 Set T = T + 1. Compute the prototypes G(T) of each cluster using P(T − 1) and W (T − 1). 3 Weighting step (Fix the Prototypes and the Weights) 1 Compute the weight system W (T) using G(T) and P(T − 1) 4 Allocation step (Fix the Weights and Prototypes) 1 Assign objects to the nearest prototype in G(T) using W (T), obtaining the partition P(T) 2 Compute CRIT(T) 5 STOP CONDITION If CRIT(T) < CRIT(T − 1) goto step 2, else return results. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 22 / 52
  • 23. Hard-partitive algorithms The WH_adaptive_kmeans function results= WH_adaptive.kmeans(x, k, schema = 1, init, rep, simplify = FALSE, qua = 10, standardize = FALSE, weight.sys = "PROD", theta = 2, init.weights = "EQUAL") Parameter Description x A MatH object (a matrix of distributionH). k An integer, the number of groups. schema a number from 1 to 4: 1 A weight for each variable (default) 2 A weight for the average and the dispersion component of each variable 3 Same as 1 but a different set of weights for each cluster 4 Same as 2 but a different set of weights for each cluster init (optional, do not use) initialization for partitioning the data default is ’RPART’ rep An integer, maximum number of repetitions of the algorithm (default rep=5). simplify A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. qua An integer, if simplify=TRUE is the number of quantiles used for re-coding the histograms. standardize A logic value (default is FALSE). If TRUE, histogram-valued data are standardized,variable by variable, using the Wasserstein based standard deviation. weight.sys a string. Weights may add to one (’SUM’) or their product is equal to 1 (’PROD’, default). theta a number. A parameter if weight.sys=’SUM’, default is 2. init.weights a string how to initialize weights: ’EQUAL’ (default), all weights are the same, ’RANDOM’, weights are initialized at random. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 23 / 52
  • 24. Hard-partitive algorithms The output Name description results A list.Returns the best solution among the repetitions, i.e. the one having the minimum sum of squares criterion. results$IDX A vector. The final clusters labels of the objects. results$cardinality A vector. The cardinality of each final cluster. results$proto A MatH object with the description of centers. results$weights A matrix of weights for each component of each variable and each cluster. results$Crit A number. The criterion (Weighted Within-cluster SS) value at the end of the run. results$TOTSSQ The total SSQ computed with the system of weights. results$BSQ The Between-clusters SSQ computed with the system of weights. results$WSQ The Within-clusters SSQ computed with the system of weights. results$quality A number. The proportion of TSS explained by the model. (The higher the better) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 24 / 52
  • 25. Hard-partitive algorithms An application on a temperature dataset of USA In this example, we use data of mean monthly temperatures observed in 48 states of US. Raw data are free available at the National Climatic Data Center website of US (http://www1.ncdc.noaa.gov/pub/data/cirs/). The original dataset drd964x.tmpst.txt contains the sequential Time Biased Corrected state climatic division monthly Average Temperatures recorded in the 48 (Hawaii and Alaska are not present in the dataset) states of US from 1895 to 2014. R code about this example is available here First of all you can access to data and R scripts from this link. USA_GIS (https: //www.dropbox.com/sh/c21stseobdroub7/AAABVZzDR0k2ZPvT2eSleNova?dl=0) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 25 / 52
  • 26. Hard-partitive algorithms A sketch of the data load("USA_TMP.RData") plot(USA_TMP_MAT) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0 25 50 0 25 50 20 40 60 30 40 50 60 70 50 60 70 80 60 70 80 60 70 80 90 60 70 80 50 60 70 80 40 50 60 70 80 20 40 60 0 20 40 60 Wyoming Wisconsin West Virginia Washington Virginia Vermont Utah Texas Tennessee South Dakota South Carolina Rhode Island Pennsylvania Oregon Oklahoma Ohio North Dakota North Carolina New York New Mexico New Jersey New Hampshire Nevada Nebraska Montana Missouri Mississippi Minnesota Michigan Massachusetts Maryland Maine Louisiana Kentucky Kansas Iowa Indiana Illinois Idaho Georgia Florida Delaware Connecticut Colorado California Arkansas Arizona Alabama Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 26 / 52
  • 27. Hard-partitive algorithms Performing a DCA (Partitive model) DCA.5k.resu=WH_kmeans(USA_TMP_MAT,k = 5,rep = 20) # we consider the best result among 20 runs $solution$cardinality IDX Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 6 11 12 7 12 $solution$centers a matrix of distributions 12 variables 5 rows .... $solution$Crit [1] 3860.546 $quality [1] 0.9015709 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 27 / 52
  • 28. Hard-partitive algorithms DCA on USA_TMP_MAT: the prototypes Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 6 11 12 7 12 $solution$Crit $quality 3860.546 0.9015709 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 28 / 52
  • 29. Hard-partitive algorithms DCA on USA k=5: the map Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 29 / 52
  • 30. Hard-partitive algorithms What is the best (or a suitable) number of clusters The Kalinski-Harabastz index (pseurdo F score) n number of objects k number of clusters CH(k) = BSS/(n − k) WSS/(k − 1) The highest the best! No_of_k Crit Qual CH_Index 2 13368.39 0.66 90.78 3 6787.43 0.83 109.27 4 5046.14 0.87 100.87 5 3860.55 0.90 99.94 6 3371.69 0.92 90.63 7 3048.76 0.92 82.26 8 2577.18 0.94 82.42 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 30 / 52
  • 31. Hard-partitive algorithms DCA_results for k=3 prototypes $solution$cardinality IDX Cl.1 Cl.2 Cl.3 13 14 21 $solution$centers a matrix of distributions 12 variables 3 rows $Crit $quality 6787.425 0.8269466 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 31 / 52
  • 32. Hard-partitive algorithms DCA on USA k=3: the map Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 32 / 52
  • 33. Hard-partitive algorithms DCA interpretation using QPI TSS WSS BSS Perc. of TSS Perc. of quality TSS(i) TSS BSS(i) TSS(i) Jan 6140.30 966.83 5173.47 15.66 84.25 Feb 5989.12 922.01 5067.11 15.27 84.61 Mar 4576.23 617.61 3958.62 11.67 86.50 Apr 3029.37 469.43 2559.94 7.72 84.50 May 2232.27 491.71 1740.56 5.69 77.97 Jun 1861.36 537.50 1323.87 4.75 71.12 Jul 1200.60 340.22 860.38 3.06 71.66 Aug 1451.14 347.46 1103.68 3.70 76.06 Sep 2128.07 383.84 1744.23 5.43 81.96 Oct 2401.76 397.58 2004.17 6.12 83.45 Nov 3379.04 561.03 2818.02 8.62 83.40 Dec 4832.32 752.21 4080.10 12.32 84.43 Total 39221.58 6787.43 32434.15 100.00 82.69 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 33 / 52
  • 34. Hard-partitive algorithms DCA position and variability components TSS WSS BSS BSSc BSSv % of q. pos. var. BSS(i) TSS(i) comp. comp. Jan 6140.30 966.83 5173.47 5161.36 12.11 84.25 99.77 0.23 Feb 5989.12 922.01 5067.11 5054.33 12.78 84.61 99.75 0.25 Mar 4576.23 617.61 3958.62 3951.77 6.84 86.50 99.83 0.17 Apr 3029.37 469.43 2559.94 2555.28 4.66 84.50 99.82 0.18 May 2232.27 491.71 1740.56 1735.77 4.79 77.97 99.73 0.27 Jun 1861.36 537.50 1323.87 1321.33 2.53 71.12 99.81 0.19 Jul 1200.60 340.22 860.38 855.84 4.54 71.66 99.47 0.53 Aug 1451.14 347.46 1103.68 1100.55 3.13 76.06 99.72 0.28 Sep 2128.07 383.84 1744.23 1742.92 1.31 81.96 99.92 0.08 Oct 2401.76 397.58 2004.17 2001.44 2.73 83.45 99.86 0.14 Nov 3379.04 561.03 2818.02 2811.37 6.65 83.40 99.76 0.24 Dec 4832.32 752.21 4080.10 4064.92 15.18 84.43 99.63 0.37 Total 39221.58 6787.43 32434.15 32356.89 77.26 82.69 99.76 0.24 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 34 / 52
  • 35. Hard-partitive algorithms An example on poulation pyramids We considered population age-sex pyramids data collected by the Census Bureau of USA in 2014. A population pyramid is a common way to represent jointly the distribution of sex and age of people living in a given administrative unit (city, region or country, for instance). In this dataset (available in the HistDAWass package with the name Age_Pyramids_2014), each country (228 countries) is represented by two histograms describing the age distribution for the male and the female population. Both distributions are represented by vertically juxtaposing, and the representation is similar to a pyramid. The shape of pyramids varies according to the distribution of the age in the population and it is related to the development of a country. World population pyramid 2014 8.5 6.5 4.5 2.5 0.5 0 2 4 6 8 0−4 5−9 10−14 15−19 20−24 25−29 30−34 35−39 40−44 45−49 50−54 55−59 60−64 65−69 70−74 75−79 80−84 85−89 90−94 95−99 100+ Male Age Female % % Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 35 / 52
  • 36. Hard-partitive algorithms DCA with adaptive distances In this example, we use the Age_pyramids dataset. We fix k = 4. Adaptive DCA weights for each schema A weight for each variable (Schema 1) Weights Male.pop 0.99957 Fem.pop 1.00043 A weight for each component of each variable (Schema=2) Weights Male.pop P 1.04032 Male.pop V 0.92821 Fem.pop P 0.96125 Fem.pop V 1.07734 A weight for each variable and each cluster (Schema 3) Cl.1 Cl.2 Cl.3 Cl.4 Male.pop 1.0486 0.9093 0.9838 1.0846 Fem.pop 0.9536 1.0998 1.0165 0.9220 A weight for each component of each variable and each cluster (Schema 4) Cl.1 Cl.2 Cl.3 Cl.4 Male.pop P 1.0799 0.9958 1.0797 1.0106 Male.pop V 1.0926 0.9445 0.9764 0.8157 Fem.pop P 0.9260 1.0042 0.9261 0.9895 Fem.pop V 0.9153 1.0587 1.0241 1.2259 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 36 / 52
  • 37. Hard-partitive algorithms Silhouette index an internal validity index A internal validity index takes into account one or both the following information Cluster Cohesion: Measures how closely related are objects in a cluster. (Example: the within sum of squares) Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters (Example: the between sum of squares) An internal validity index that accounts for cohesion and separation is the Silhoutte index It is an average of the silhouette value assigned to each observation s(i) = b(i)−a(i) max(a(i),b(i)) where a(i) is the average distance from the assigned cluster, b(i) is the average from the second best cluster. Method Silhouette Index Base 0.775862 Schema 1 0.775867 Schema 2 0.887746 Schema 3 0.899954 Schema 4 0.839478 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 37 / 52
  • 38. Hard-partitive algorithms An application on an Activity Recognition dataset: 8 people walking (Altun, Barshan, and Tunçel 2010) create an extensive dataset for activity recognition Available at the UCI Machine Learning Repository The data set consists of: 19 activities: sitting, standing, ascending/desdencing stairs, walk 4kmH flat, running, rowing, jumping, playing basketball, and more. 8 people between 20 and 30 years old (4 male, 4 female) Each person freely performed each activity during 5 minutes 45 measurements (5 times triaxial gyroscope, accelerometer, magnetometer) are recorder at a rate of 25Hz We don’t use magnetometer in this application Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 38 / 52
  • 39. Hard-partitive algorithms An application on the AR dataset: 8 people walking 4kmH flat. (Each row is a 5 sec.-window of measurements) TO_xacc TO_yacc TO_zacc TO_xgyr TO_ygyr TO_zgyr RA_xacc RA_yacc RA_zacc RA_xgyr 5 10 15 -4 0 4 -5 0 5 10 15 -2 0 2 4 -2 -1 0 1 2 -1 0 1 -10 0 10 20 -5 0 5 10 15 20 -5 0 5 -4 0 4 8 p8-a10-s10 p8-a10-s09 p8-a10-s08 p8-a10-s07 p8-a10-s06 p8-a10-s05 p8-a10-s04 p8-a10-s03 p8-a10-s02 p8-a10-s01 p7-a10-s10 p7-a10-s09 p7-a10-s08 p7-a10-s07 p7-a10-s06 p7-a10-s05 p7-a10-s04 p7-a10-s03 p7-a10-s02 p7-a10-s01 p6-a10-s10 p6-a10-s09 p6-a10-s08 p6-a10-s07 p6-a10-s06 p6-a10-s05 p6-a10-s04 p6-a10-s03 p6-a10-s02 p6-a10-s01 p5-a10-s10 p5-a10-s09 p5-a10-s08 p5-a10-s07 p5-a10-s06 p5-a10-s05 p5-a10-s04 p5-a10-s03 p5-a10-s02 p5-a10-s01 p4-a10-s10 p4-a10-s09 p4-a10-s08 p4-a10-s07 p4-a10-s06 p4-a10-s05 p4-a10-s04 p4-a10-s03 p4-a10-s02 p4-a10-s01 p3-a10-s10 p3-a10-s09 p3-a10-s08 p3-a10-s07 p3-a10-s06 p3-a10-s05 p3-a10-s04 p3-a10-s03 p3-a10-s02 p3-a10-s01 p2-a10-s10 p2-a10-s09 p2-a10-s08 p2-a10-s07 p2-a10-s06 p2-a10-s05 p2-a10-s04 p2-a10-s03 p2-a10-s02 p2-a10-s01 p1-a10-s10 p1-a10-s09 p1-a10-s08 p1-a10-s07 p1-a10-s06 p1-a10-s05 p1-a10-s04 p1-a10-s03 p1-a10-s02 p1-a10-s01 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 39 / 52
  • 40. Hard-partitive algorithms A PCA on histograms: the individual plots -6 -4 -2 0 2 4 -5.0 -2.5 0.0 2.5 5.0 Comp. 1 (27.32%) Comp.2(13.17%) people 1-F 2-F 3-M 4-M 5-M 6-F 7-F 8-M PCA - Walking on a treadmill at 4 km/h in flat position First factorial plane (40.50% of explained inertia) 1-F 6-F 2-F 7-F 3-M 4-M 8-M 5-M Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 40 / 52
  • 41. Hard-partitive algorithms Dynamic clustering external validity indexes Indexes: ARI= adjusted Rand Index (accuracy), PUR=purity, FM=Folks-Mallows index, NMI=Normalized mutual information Methods: KM= Dynamic clustering (aka, K-means), KMst= Dynamic clustering whith stand. variables ADA1 = 1 weight for each variable, ADA2 = 2 weights for each variable (one for each component), ADA3 = 1 weight for each variable and each cluster, ADA4 = 2 weights for each variable (one for each component) and each cluster, Mets ARI PUR FM NMI ARI PUR FM NMI ARI PUR FM NMI KM 0.4652 0.5000 0.6051 0.6980 0.5555 0.6250 0.6608 0.7716 0.5627 0.6417 0.6626 0.7821 KMst 0.4489 0.5000 0.5995 0.7088 0.6330 0.6250 0.7141 0.8077 0.6811 0.7479 0.7468 0.8505 ADA_1 0.4448 0.5000 0.5962 0.7038 0.5043 0.6229 0.6330 0.7714 0.6766 0.7458 0.7434 0.8505 ADA_2 0.4126 0.5000 0.5744 0.6713 0.5059 0.6229 0.6331 0.7672 0.6884 0.7500 0.7530 0.8613 ADA_3 0.4225 0.5000 0.5838 0.6938 0.5077 0.6250 0.6334 0.7646 0.6884 0.7500 0.7530 0.8613 ADA_4 0.4090 0.5000 0.5679 0.6557 0.5024 0.6250 0.6258 0.7446 0.6674 0.7500 0.7454 0.8676 Mets ARI PUR FM NMI ARI PUR FM NMI KM 0.5996 0.6417 0.6880 0.7987 0.6377 0.7479 0.7079 0.8338 KMst 0.7371 0.8208 0.7865 0.8931 0.8311 0.8750 0.8569 0.9171 ADA_1 0.7371 0.8208 0.7865 0.8931 0.8156 0.8667 0.8433 0.9015 ADA_2 0.7435 0.8229 0.7917 0.8990 0.8336 0.8750 0.8592 0.9366 ADA_3 0.7761 0.8208 0.8134 0.8914 0.8839 0.9396 0.8988 0.9356 ADA_4 0.8446 0.8646 0.8706 0.9241 0.7088 0.8042 0.7583 0.8499 k=4 k=5 k=6 k=7 k=8 Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 41 / 52
  • 42. Hierarchical clustering Hierarchical clustering Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 42 / 52
  • 43. Hierarchical clustering Hierarchical clustering results= WH_hclust (x, simplify=FALSE, qua=10, standardize=FALSE, distance=" method="complete") Input Input param. Description x A MatH object (a matrix of distributionH) simplify A logic value (default is FALSE), if TRUE histograms are recomputed in order to speed-up the algorithm. qua An integer, if simplify=TRUE is the number of quantiles used for re-codify the histograms. standardize A logic value (default is FALSE). If TRUE histogram-valued data are standardized, variable by variable, using the Wasserstein-based standard deviation. Use if one wants to have variables with std equal to one. distance A string default WDIST the L2 Wasserstein distance (other distances will be imple- mented) method A string, default="complete", is the the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of ward.D, ward.D2, single, complete, average (= UPGMA), mcquitty (= WPGMA), median (= WPGMC) or centroid (= UPGMC). Output An object of the class hclust which describes the tree produced by the method. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 43 / 52
  • 44. Hierarchical clustering Application on the Age_pyramids dataset: the script Work_data=Age_Pyramids_2014[2:229,2:3] #Take a part of the data Hward=WH_hclust(Work_data,method = "ward.D2") #Do the dirty work # cut dendrogram in 4 clusters hc=Hward hcd=as.dendrogram(hc) clusMember = cutree(hc, 4) labelColors = c("red", "yellow", "green", "purple") # function to get color labels ... # using dendrapply clusDendro = dendrapply(hcd, colLab) # make plot clusDendro<-assign_values_to_leaves_nodePar(clusDendro, 0.5, "lab.cex") plot(clusDendro) Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 44 / 52
  • 45. Hierarchical clustering Show tree Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 45 / 52
  • 46. Hierarchical clustering Show map Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 46 / 52
  • 47. Hierarchical clustering Show barycenters Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 47 / 52
  • 48. Other implemented methods Other implemented methods Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 48 / 52
  • 49. Other implemented methods Other methods Kohonen Self Organizing Maps Fuzzy c-means Adaptive distances-based Fuzzy c-means Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 49 / 52
  • 50. Open research issues and main references Open research issues and main references Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 50 / 52
  • 51. Open research issues and main references Some open research issues Considering the problem of finding “true clusters” How to combine qualitative and quantitative data (hetereogeneity) How to consider clustering as a predictive method (a great challenge!) We imagine that we have a set of variables defining clusters and a set of explicative variables (for validating clusters). Is it possible to define a general strategy for predictive clustering? This may be relevant in several applicative fields: marketing, time series forecasting, . . . Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 51 / 52
  • 52. Open research issues and main references References Altun, Kerem, Billur Barshan, and Orkun Tunçel. 2010. “Comparative Study on Classifying Human Activities with Miniature Inertial and Magnetic Sensors.” Pattern Recognition 43 (10): 3605–20. doi:https://doi.org/10.1016/j.patcog.2010.04.019. Bock, H.H., and E. Diday. 2000. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer verlag. Hennig, Christian. 2015. “What Are the True Clusters?” Pattern Recognition Letters 64: 53–62. doi:https://doi.org/10.1016/j.patrec.2015.04.009. Irpino, A., R. Verde, and F.A.T. De Carvalho. 2014. “Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances.” Expert Systems with Applications 41 (7): 3351–66. doi:http://dx.doi.org/10.1016/j.eswa.2013.12.001. Irpino, Antonio, and Rosanna Verde. 2015. “Basic Statistics for Distributional Symbolic Variables: A New Metric-Based Approach.” Advances in Data Analysis and Classification 9 (2). Springer Berlin Heidelberg: 143–75. doi:10.1007/s11634-014-0176-4. Mizuta, Masahiro. 2016. “Mini Data Approach to Big Data.” Medical Imaging and Information Sciences 33 (1): 1–3. doi:10.11318/mii.33.1. Verde, Rosanna, and Antonio Irpino. 2007. “Dynamic Clustering of Histogram Data: Using the Right Metric.” In Selected Contributions in Data Analysis and Classification, edited by Paula Brito, Guy Cucumel, Patrice Bertrand, and Francisco Carvalho, 123–34. Studies in Classification, Data Analysis, and Knowledge Organization. Berlin, Heidelberg: Springer Berlin Heidelberg. Antonio Irpino, PhD University of Campania ”L. Vanvitelli” Dept. of Mathematics and Physics Caserta, Italy antonio.irpino@unicGrouping techniques for facing Volume and Velocity in Big Data June, 4th, 2018 52 / 52