# American Institute of Mathematical Sciences

January  2016, 1(1): 93-109. doi: 10.3934/bdia.2016.1.93

## A soft subspace clustering algorithm with log-transformed distances

 1 Department of Mathematics, University of Connecticut, 196 Auditorium Rd, Storrs, CT 06269-3009, United States 2 Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT 06269, United States

Received  May 2015 Revised  August 2015 Published  September 2015

Entropy weighting used in some soft subspace clustering algorithms is sensitive to the scaling parameter. In this paper, we propose a novel soft subspace clustering algorithm by using log-transformed distances in the objective function. The proposed algorithm allows users to choose a value of the scaling parameter easily because the entropy weighting in the proposed algorithm is less sensitive to the scaling parameter. In addition, the proposed algorithm is less sensitive to noises because a point far away from its cluster center is given a small weight in the cluster center calculation. Experiments on both synthetic datasets and real datasets are used to demonstrate the performance of the proposed algorithm.
Citation: Guojun Gan, Kun Chen. A soft subspace clustering algorithm with log-transformed distances. Big Data & Information Analytics, 2016, 1 (1) : 93-109. doi: 10.3934/bdia.2016.1.93
##### References:
 [1] C. C. Aggarwal and C. K. Reddy (eds.), Data Clustering: Algorithms and Applications,, CRC Press, (2014). Google Scholar [2] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications,, in SIGMOD Record ACM Special Interest Group on Management of Data, 27 (1998), 94. doi: 10.1145/276304.276314. Google Scholar [3] S. Boutemedjet, D. Ziou and N. Bouguila, Model-based subspace clustering of non-gaussian data,, Neurocomputing, 73 (2010), 1730. doi: 10.1016/j.neucom.2009.11.044. Google Scholar [4] A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii and S. Venkatesan, Scalable k-means by ranked retrieval,, in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, (2014), 233. doi: 10.1145/2556195.2556260. Google Scholar [5] F. Cao, J. Liang and G. Jiang, An initialization method for the $k$-means algorithm using neighborhood model,, Computers & Mathematics with Applications, 58 (2009), 474. doi: 10.1016/j.camwa.2009.04.017. Google Scholar [6] M. E. Celebi, H. A. Kingravi and P. A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm,, Expert Systems with Applications, 40 (2013), 200. doi: 10.1016/j.eswa.2012.07.021. Google Scholar [7] X. Chen, Y. Ye, X. Xu and J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data,, Pattern Recognition, 45 (2012), 434. doi: 10.1016/j.patcog.2011.06.004. Google Scholar [8] M. de Souto, I. Costa, D. de Araujo, T. Ludermir and A. Schliep, Clustering cancer gene expression data: A comparative study,, BMC Bioinformatics, 9 (2008), 497. doi: 10.1186/1471-2105-9-497. Google Scholar [9] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan and D. Papadopoulos, Locally adaptive metrics for clustering high dimensional data,, Data Mining and Knowledge Discovery, 14 (2007), 63. doi: 10.1007/s10618-006-0060-8. Google Scholar [10] B. Duran and P. Odell, Cluster Analysis - A survey, vol. 100 of Lecture Notes in Economics and Mathematical Systems,, Springer-Verlage, (1974). Google Scholar [11] E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications,, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (2013), 2765. doi: 10.1109/TPAMI.2013.57. Google Scholar [12] G. Gan, Data Clustering in C++: An Object-Oriented Approach,, Data Mining and Knowledge Discovery Series, (2011). doi: 10.1201/b10814. Google Scholar [13] G. Gan and M. K.-P. Ng, Subspace clustering using affinity propagation,, Pattern Recognition, 48 (2015), 1455. doi: 10.1016/j.patcog.2014.11.003. Google Scholar [14] G. Gan and M. K.-P. Ng, Subspace clustering with automatic feature grouping,, Pattern Recognition, 48 (2015), 3703. doi: 10.1016/j.patcog.2015.05.016. Google Scholar [15] G. Gan and J. Wu, Subspace clustering for high dimensional categorical data,, ACM SIGKDD Explorations Newsletter, 6 (2004), 87. doi: 10.1145/1046456.1046468. Google Scholar [16] G. Gan and J. Wu, A convergence theorem for the fuzzy subspace clustering (FSC) algorithm,, Pattern Recognition, 41 (2008), 1939. doi: 10.1016/j.patcog.2007.11.011. Google Scholar [17] G. Gan, J. Wu and Z. Yang, A fuzzy subspace algorithm for clustering high dimensional data,, in Lecture Notes in Artificial Intelligence (eds. X. Li, (4093), 271. doi: 10.1007/11811305_30. Google Scholar [18] J. A. Hartigan, Clustering Algorithms,, Wiley, (1975). Google Scholar [19] J. Huang, M. Ng, H. Rong and Z. Li, Automated variable weighting in $k$-means type clustering,, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), 657. doi: 10.1109/TPAMI.2005.95. Google Scholar [20] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data,, Prentice-Hall, (1988). Google Scholar [21] L. Jing, M. Ng and J. Huang, An entropy weighting $k$-means algorithm for subspace clustering of high-dimensional sparse data,, IEEE Transactions on Knowledge and Data Engineering, 19 (2007), 1026. doi: 10.1109/TKDE.2007.1048. Google Scholar [22] H.-P. Kriegel, P. Kröger and A. Zimek, Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,, ACM Transactions on Knowledge Discovery from Data, 3 (2009), 1. doi: 10.1145/1497577.1497578. Google Scholar [23] J. Macqueen, Some methods for classification and analysis of multivariate observations,, in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics andProbability (eds. L. LeCam and J. Neyman), (1967), 281. Google Scholar [24] J. Peña, J. Lozano and P. Larrañaga, An empirical comparison of four initialization methods for the $k$-means algorithm,, Pattern Recognition Letters, 20 (1999), 1027. Google Scholar [25] L. Peng and J. Zhang, An entropy weighting mixture model for subspace clustering of high-dimensional data,, Pattern Recognition Letters, 32 (2011), 1154. doi: 10.1016/j.patrec.2011.03.003. Google Scholar [26] R. Xu and D. Wunsch, Clustering,, Wiley-IEEE Press, (2008). Google Scholar

show all references

##### References:
 [1] C. C. Aggarwal and C. K. Reddy (eds.), Data Clustering: Algorithms and Applications,, CRC Press, (2014). Google Scholar [2] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications,, in SIGMOD Record ACM Special Interest Group on Management of Data, 27 (1998), 94. doi: 10.1145/276304.276314. Google Scholar [3] S. Boutemedjet, D. Ziou and N. Bouguila, Model-based subspace clustering of non-gaussian data,, Neurocomputing, 73 (2010), 1730. doi: 10.1016/j.neucom.2009.11.044. Google Scholar [4] A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii and S. Venkatesan, Scalable k-means by ranked retrieval,, in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, (2014), 233. doi: 10.1145/2556195.2556260. Google Scholar [5] F. Cao, J. Liang and G. Jiang, An initialization method for the $k$-means algorithm using neighborhood model,, Computers & Mathematics with Applications, 58 (2009), 474. doi: 10.1016/j.camwa.2009.04.017. Google Scholar [6] M. E. Celebi, H. A. Kingravi and P. A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm,, Expert Systems with Applications, 40 (2013), 200. doi: 10.1016/j.eswa.2012.07.021. Google Scholar [7] X. Chen, Y. Ye, X. Xu and J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data,, Pattern Recognition, 45 (2012), 434. doi: 10.1016/j.patcog.2011.06.004. Google Scholar [8] M. de Souto, I. Costa, D. de Araujo, T. Ludermir and A. Schliep, Clustering cancer gene expression data: A comparative study,, BMC Bioinformatics, 9 (2008), 497. doi: 10.1186/1471-2105-9-497. Google Scholar [9] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan and D. Papadopoulos, Locally adaptive metrics for clustering high dimensional data,, Data Mining and Knowledge Discovery, 14 (2007), 63. doi: 10.1007/s10618-006-0060-8. Google Scholar [10] B. Duran and P. Odell, Cluster Analysis - A survey, vol. 100 of Lecture Notes in Economics and Mathematical Systems,, Springer-Verlage, (1974). Google Scholar [11] E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications,, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (2013), 2765. doi: 10.1109/TPAMI.2013.57. Google Scholar [12] G. Gan, Data Clustering in C++: An Object-Oriented Approach,, Data Mining and Knowledge Discovery Series, (2011). doi: 10.1201/b10814. Google Scholar [13] G. Gan and M. K.-P. Ng, Subspace clustering using affinity propagation,, Pattern Recognition, 48 (2015), 1455. doi: 10.1016/j.patcog.2014.11.003. Google Scholar [14] G. Gan and M. K.-P. Ng, Subspace clustering with automatic feature grouping,, Pattern Recognition, 48 (2015), 3703. doi: 10.1016/j.patcog.2015.05.016. Google Scholar [15] G. Gan and J. Wu, Subspace clustering for high dimensional categorical data,, ACM SIGKDD Explorations Newsletter, 6 (2004), 87. doi: 10.1145/1046456.1046468. Google Scholar [16] G. Gan and J. Wu, A convergence theorem for the fuzzy subspace clustering (FSC) algorithm,, Pattern Recognition, 41 (2008), 1939. doi: 10.1016/j.patcog.2007.11.011. Google Scholar [17] G. Gan, J. Wu and Z. Yang, A fuzzy subspace algorithm for clustering high dimensional data,, in Lecture Notes in Artificial Intelligence (eds. X. Li, (4093), 271. doi: 10.1007/11811305_30. Google Scholar [18] J. A. Hartigan, Clustering Algorithms,, Wiley, (1975). Google Scholar [19] J. Huang, M. Ng, H. Rong and Z. Li, Automated variable weighting in $k$-means type clustering,, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), 657. doi: 10.1109/TPAMI.2005.95. Google Scholar [20] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data,, Prentice-Hall, (1988). Google Scholar [21] L. Jing, M. Ng and J. Huang, An entropy weighting $k$-means algorithm for subspace clustering of high-dimensional sparse data,, IEEE Transactions on Knowledge and Data Engineering, 19 (2007), 1026. doi: 10.1109/TKDE.2007.1048. Google Scholar [22] H.-P. Kriegel, P. Kröger and A. Zimek, Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,, ACM Transactions on Knowledge Discovery from Data, 3 (2009), 1. doi: 10.1145/1497577.1497578. Google Scholar [23] J. Macqueen, Some methods for classification and analysis of multivariate observations,, in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics andProbability (eds. L. LeCam and J. Neyman), (1967), 281. Google Scholar [24] J. Peña, J. Lozano and P. Larrañaga, An empirical comparison of four initialization methods for the $k$-means algorithm,, Pattern Recognition Letters, 20 (1999), 1027. Google Scholar [25] L. Peng and J. Zhang, An entropy weighting mixture model for subspace clustering of high-dimensional data,, Pattern Recognition Letters, 32 (2011), 1154. doi: 10.1016/j.patrec.2011.03.003. Google Scholar [26] R. Xu and D. Wunsch, Clustering,, Wiley-IEEE Press, (2008). Google Scholar
 [1] Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial & Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565 [2] Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems & Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025 [3] Baoli Shi, Zhi-Feng Pang, Jing Xu. Image segmentation based on the hybrid total variation model and the K-means clustering strategy. Inverse Problems & Imaging, 2016, 10 (3) : 807-828. doi: 10.3934/ipi.2016022 [4] Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007 [5] Ruiqi Yang, Dachuan Xu, Yicheng Xu, Dongmei Zhang. An adaptive probabilistic algorithm for online k-center clustering. Journal of Industrial & Management Optimization, 2019, 15 (2) : 565-576. doi: 10.3934/jimo.2018057 [6] Yongbin Ou, Cun-Quan Zhang. A new multimembership clustering method. Journal of Industrial & Management Optimization, 2007, 3 (4) : 619-624. doi: 10.3934/jimo.2007.3.619 [7] Elissar Nasreddine. Two-dimensional individual clustering model. Discrete & Continuous Dynamical Systems - S, 2014, 7 (2) : 307-316. doi: 10.3934/dcdss.2014.7.307 [8] Elissar Nasreddine. Well-posedness for a model of individual clustering. Discrete & Continuous Dynamical Systems - B, 2013, 18 (10) : 2647-2668. doi: 10.3934/dcdsb.2013.18.2647 [9] Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang. A clustering based mate selection for evolutionary optimization. Big Data & Information Analytics, 2017, 2 (1) : 77-85. doi: 10.3934/bdia.2017010 [10] Gurkan Ozturk, Mehmet Tahir Ciftci. Clustering based polyhedral conic functions algorithm in classification. Journal of Industrial & Management Optimization, 2015, 11 (3) : 921-932. doi: 10.3934/jimo.2015.11.921 [11] Sung Ha Kang, Berta Sandberg, Andy M. Yip. A regularized k-means and multiphase scale segmentation. Inverse Problems & Imaging, 2011, 5 (2) : 407-429. doi: 10.3934/ipi.2011.5.407 [12] Pawan Lingras, Farhana Haider, Matt Triff. Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data & Information Analytics, 2017, 2 (5) : 1-20. doi: 10.3934/bdia.2017018 [13] Richard L Buckalew. Cell cycle clustering and quorum sensing in a response / signaling mediated feedback model. Discrete & Continuous Dynamical Systems - B, 2014, 19 (4) : 867-881. doi: 10.3934/dcdsb.2014.19.867 [14] Jiangchuan Fan, Xinyu Guo, Jianjun Du, Weiliang Wen, Xianju Lu, Brahmani Louiza. Analysis of the clustering fusion algorithm for multi-band color image. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1233-1249. doi: 10.3934/dcdss.2019085 [15] Daniel Roggen, Martin Wirz, Gerhard Tröster, Dirk Helbing. Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods. Networks & Heterogeneous Media, 2011, 6 (3) : 521-544. doi: 10.3934/nhm.2011.6.521 [16] Willem Mélange, Herwig Bruneel, Bart Steyaert, Dieter Claeys, Joris Walraevens. A continuous-time queueing model with class clustering and global FCFS service discipline. Journal of Industrial & Management Optimization, 2014, 10 (1) : 193-206. doi: 10.3934/jimo.2014.10.193 [17] Juncheng Wei, Jun Yang. Toda system and interior clustering line concentration for a singularly perturbed Neumann problem in two dimensional domain. Discrete & Continuous Dynamical Systems - A, 2008, 22 (3) : 465-508. doi: 10.3934/dcds.2008.22.465 [18] Chao Mi, Jun Wang, Weijian Mi, Youfang Huang, Zhiwei Zhang, Yongsheng Yang, Jun Jiang, Postolache Octavian. Research on regional clustering and two-stage SVM method for container truck recognition. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1117-1133. doi: 10.3934/dcdss.2019077 [19] Anatole Katok, Federico Rodriguez Hertz. Uniqueness of large invariant measures for $\mathbb{Z}^k$ actions with Cartan homotopy data. Journal of Modern Dynamics, 2007, 1 (2) : 287-300. doi: 10.3934/jmd.2007.1.287 [20] Rainer Steinwandt, Adriana Suárez Corona. Attribute-based group key establishment. Advances in Mathematics of Communications, 2010, 4 (3) : 381-398. doi: 10.3934/amc.2010.4.381

Impact Factor: