# American Institute of Mathematical Sciences

August  2019, 2(3): 215-250. doi: 10.3934/mfc.2019015

## Using distribution analysis for parameter selection in repstream

 Curtin University, Kent Street, Perth, Western Australia 6102, Australia

* Corresponding author: Ross Callister

Received  April 2019 Published  September 2019

One of the most significant challenges in data clustering is the evolution of the data distributions over time. Many clustering algorithms have been introduced to deal specifically with streaming data, but common amongst them is that they require users to set input parameters. These inform the algorithm about the criteria under which data points may be clustered together. Setting the initial parameters for a clustering algorithm is itself a non-trivial task, but the evolution of the data distribution over time could mean even optimally set parameters could become non-optimal as the stream evolves. In this paper we extend the RepStream algorithm, a combination graph and density-based clustering algorithm, in a way which allows the primary input parameter, the $K$ value, to be automatically adjusted over time. We introduce a feature called the edge distribution score which we compute for data in memory, as well as introducing an incremental method for adjusting the $K$ parameter over time based on this score. We evaluate our methods against RepStream itself, and other contemporary stream clustering algorithms, and show how our method of automatically adjusting the $K$ value over time leads to higher quality clustering output even when the initial parameters are set poorly.

Citation: Ross Callister, Duc-Son Pham, Mihai Lazarescu. Using distribution analysis for parameter selection in repstream. Mathematical Foundations of Computing, 2019, 2 (3) : 215-250. doi: 10.3934/mfc.2019015
##### References:

show all references

##### References:
An illustration of the representative and point-level sparse graphs in the RepStream algorithm
The representative vertex $R_1$ and $R_2$ share a reciprocal connection at the representative level, and are also density related. The density radius of the vertices is shown as $DR_1$ and $DR_2$ respectively, which is the average distance to neighbours at the point level, multiplied by the alpha scaling factor
Intra and Inter-class edges. The Edge $E_1$ is considered an inter-class edge as it connects two vertices $R_1$ and $R_2$ that belong to different ground-truth classes. Edge $E_2$ connects two vertices belonging to the same class and thus is considered an intra-class edge
An illustration of relative edge lengths of nearest neighbours in the middle of a cluster versus near the edge of a cluster
Different cases for the distribution of edge lengths
Visualisation of the DS1 dataset
Visualisation of the DS2 dataset
Two dimensional representations of the 5 classes in the SynTest dataset. The main class is always present and steadily changes shape, the smaller classes appear at various points through the dataset, as shown in Figure 9
The class presence of the classes in the SynTest dataset. A marker indicates that the class is present in the dataset during the given time window
The evolution of the Closer dataset, showing slices of its 3 sections
Comparative purity for Tree Cover dataset
Comparative purity for Tree Cover dataset
Comparative purity for KDD 99' Cup dataset
Comparative purity for KDD 99' Cup dataset
Comparative purity for DS1 dataset
The $K$ value selected by our dynamic $K$ method on the DS1 dataset
Comparative purity for DS2 dataset
Comparative purity for SynTest dataset
Comparative purity for Closer dataset
Comparative purity for Tree Cover dataset
Comparative purity for KDD 99' Cup dataset
The $K$ value selected by our dynamic $K$ method on the KDD Dataset
F-Measure comparison vs RepStream using optimal parameters on DS1 dataset
F-Measure comparison vs RepStream using optimal parameters on DS2 dataset
F-Measure comparison vs RepStream using optimal parameters on SynTest dataset
F-Measure comparison vs RepStream using optimal parameters on Closer dataset
F-Measure comparison vs RepStream using optimal parameters on Tree Cover dataset
F-Measure comparison vs RepStream using optimal parameters on KDD 99' Cup dataset
Table showing the input parameters used for each algorithm
 Algorithm name Density Parameters Grid Granularity Reclustering Method Decay Parameter Distance Parameter Other Parameters CluStream $k$-means $\delta$ $\alpha$, $l$ Snapshot parameters SWClustering $\beta$ $k$-means $\epsilon$, $N$ Window parameters DenStream $\mu$, $\beta$ $\epsilon$ HPStream $\lambda$ $r$ $l$ Projected Dimensions D-Stream $C_m$, $C_l$, $\beta$ $len$ $\lambda$ ExCC Grid Granularity in all dimensions MR-Stream $C_H$, $C_L$, $\beta$, $\mu$ Hierarchical $\lambda$ $\epsilon$ $H$ Heirarchical limit FlockStream $\beta$, $\mu$ $\lambda$ $\epsilon$ DeBaRa $dPts$ $f$ DBStream $\alpha$, $w\_min$ $\lambda$, $t\_gap$ $r$ SNCStream $\psi$, $\beta$ $\lambda$ $\epsilon$ $K$ Graph connectivity PASCAL Evolutionary Evolutionary EDA hyper-parameters HASTREAM $minPts$ Hierarchical $\lambda$ BEStream $\Delta$, $\tau$ $\lambda$ $\xi$ $\theta$ Direction parameter ADStream $\xi$, $\epsilon$ $\lambda$ PatchWork $ratio$, $minPoints$, $minCells$ $\epsilon$ RepStream $\lambda$ $\alpha$ $K$ Graph connectivity
 Algorithm name Density Parameters Grid Granularity Reclustering Method Decay Parameter Distance Parameter Other Parameters CluStream $k$-means $\delta$ $\alpha$, $l$ Snapshot parameters SWClustering $\beta$ $k$-means $\epsilon$, $N$ Window parameters DenStream $\mu$, $\beta$ $\epsilon$ HPStream $\lambda$ $r$ $l$ Projected Dimensions D-Stream $C_m$, $C_l$, $\beta$ $len$ $\lambda$ ExCC Grid Granularity in all dimensions MR-Stream $C_H$, $C_L$, $\beta$, $\mu$ Hierarchical $\lambda$ $\epsilon$ $H$ Heirarchical limit FlockStream $\beta$, $\mu$ $\lambda$ $\epsilon$ DeBaRa $dPts$ $f$ DBStream $\alpha$, $w\_min$ $\lambda$, $t\_gap$ $r$ SNCStream $\psi$, $\beta$ $\lambda$ $\epsilon$ $K$ Graph connectivity PASCAL Evolutionary Evolutionary EDA hyper-parameters HASTREAM $minPts$ Hierarchical $\lambda$ BEStream $\Delta$, $\tau$ $\lambda$ $\xi$ $\theta$ Direction parameter ADStream $\xi$, $\epsilon$ $\lambda$ PatchWork $ratio$, $minPoints$, $minCells$ $\epsilon$ RepStream $\lambda$ $\alpha$ $K$ Graph connectivity
Best and Worst $K$ values for RepStream, according to $F$ score
 Dataset Best K Best $F$ score Worst K Worst $F$ score DS1 7 0.7208 18 0.2767 DS2 7 0.6371 21 0.2594 SynTest 9 0.8614 5 0.5435 Closer 9 0.7989 5 0.4345 TreeCov 29 0.6108 5 0.2978 KDD99 30 0.7898 5 0.2636
 Dataset Best K Best $F$ score Worst K Worst $F$ score DS1 7 0.7208 18 0.2767 DS2 7 0.6371 21 0.2594 SynTest 9 0.8614 5 0.5435 Closer 9 0.7989 5 0.4345 TreeCov 29 0.6108 5 0.2978 KDD99 30 0.7898 5 0.2636
Comparison of our dynamic $K$ method versus RepStream with optimal and worst $K$ values
 Dataset Best F-Measure Worst F-Measure Dynamic $K$ DS1 0.7208 0.2767 0.5153 DS2 0.6371 0.2594 0.4137 SynTest 0.7989 0.5435 0.7091 Closer 0.8614 0.4345 0.8410 TreeCov 0.6108 0.2978 0.5920 KDD99 0.7898 0.2636 0.7882
 Dataset Best F-Measure Worst F-Measure Dynamic $K$ DS1 0.7208 0.2767 0.5153 DS2 0.6371 0.2594 0.4137 SynTest 0.7989 0.5435 0.7091 Closer 0.8614 0.4345 0.8410 TreeCov 0.6108 0.2978 0.5920 KDD99 0.7898 0.2636 0.7882
 [1] Daniel Roggen, Martin Wirz, Gerhard Tröster, Dirk Helbing. Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods. Networks & Heterogeneous Media, 2011, 6 (3) : 521-544. doi: 10.3934/nhm.2011.6.521 [2] Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial & Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565 [3] Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014 [4] Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang. A clustering based mate selection for evolutionary optimization. Big Data & Information Analytics, 2017, 2 (1) : 77-85. doi: 10.3934/bdia.2017010 [5] Gurkan Ozturk, Mehmet Tahir Ciftci. Clustering based polyhedral conic functions algorithm in classification. Journal of Industrial & Management Optimization, 2015, 11 (3) : 921-932. doi: 10.3934/jimo.2015.11.921 [6] Jiangchuan Fan, Xinyu Guo, Jianjun Du, Weiliang Wen, Xianju Lu, Brahmani Louiza. Analysis of the clustering fusion algorithm for multi-band color image. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1233-1249. doi: 10.3934/dcdss.2019085 [7] Lars Eirik Danielsen. Graph-based classification of self-dual additive codes over finite fields. Advances in Mathematics of Communications, 2009, 3 (4) : 329-348. doi: 10.3934/amc.2009.3.329 [8] Pengyu Yan, Shi Qiang Liu, Cheng-Hu Yang, Mahmoud Masoud. A comparative study on three graph-based constructive algorithms for multi-stage scheduling with blocking. Journal of Industrial & Management Optimization, 2019, 15 (1) : 221-233. doi: 10.3934/jimo.2018040 [9] Yongbin Ou, Cun-Quan Zhang. A new multimembership clustering method. Journal of Industrial & Management Optimization, 2007, 3 (4) : 619-624. doi: 10.3934/jimo.2007.3.619 [10] Baoli Shi, Zhi-Feng Pang, Jing Xu. Image segmentation based on the hybrid total variation model and the K-means clustering strategy. Inverse Problems & Imaging, 2016, 10 (3) : 807-828. doi: 10.3934/ipi.2016022 [11] Elissar Nasreddine. Two-dimensional individual clustering model. Discrete & Continuous Dynamical Systems - S, 2014, 7 (2) : 307-316. doi: 10.3934/dcdss.2014.7.307 [12] Elissar Nasreddine. Well-posedness for a model of individual clustering. Discrete & Continuous Dynamical Systems - B, 2013, 18 (10) : 2647-2668. doi: 10.3934/dcdsb.2013.18.2647 [13] Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems & Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025 [14] Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007 [15] Pawan Lingras, Farhana Haider, Matt Triff. Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data & Information Analytics, 2017, 2 (5) : 1-20. doi: 10.3934/bdia.2017018 [16] Guojun Gan, Kun Chen. A soft subspace clustering algorithm with log-transformed distances. Big Data & Information Analytics, 2016, 1 (1) : 93-109. doi: 10.3934/bdia.2016.1.93 [17] Richard L Buckalew. Cell cycle clustering and quorum sensing in a response / signaling mediated feedback model. Discrete & Continuous Dynamical Systems - B, 2014, 19 (4) : 867-881. doi: 10.3934/dcdsb.2014.19.867 [18] Ruiqi Yang, Dachuan Xu, Yicheng Xu, Dongmei Zhang. An adaptive probabilistic algorithm for online k-center clustering. Journal of Industrial & Management Optimization, 2019, 15 (2) : 565-576. doi: 10.3934/jimo.2018057 [19] Jinying Ma, Honglei Xu. Empirical analysis and optimization of capital structure adjustment. Journal of Industrial & Management Optimization, 2017, 13 (5) : 1-11. doi: 10.3934/jimo.2018191 [20] A. Procacci, Benedetto Scoppola. Convergent expansions for random cluster model with $q>0$ on infinite graphs. Communications on Pure & Applied Analysis, 2008, 7 (5) : 1145-1178. doi: 10.3934/cpaa.2008.7.1145

Impact Factor: