# American Institute of Mathematical Sciences

doi: 10.3934/bdia.2017020

## A category-based probabilistic approach to feature selection

 1 School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China 2 Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Published  August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
##### References:
 [1] A. Daly, T. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41. doi: 10.1016/j.jocm.2016.09.005. [2] S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. [3] S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121. doi: 10.1109/PRDC.2004.1276562. [4] L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979. [5] L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925. [6] W. Huang, X. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347. [7] W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129. [8] W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819. doi: 10.1080/03610926.2014.930911. [9] S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. [10] J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625. [11] C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999. [12] STATCAN, 1998. Survey of Family Expenditures-1996. [13]

show all references

##### References:
 [1] A. Daly, T. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41. doi: 10.1016/j.jocm.2016.09.005. [2] S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. [3] S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121. doi: 10.1109/PRDC.2004.1276562. [4] L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979. [5] L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925. [6] W. Huang, X. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347. [7] W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129. [8] W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819. doi: 10.1080/03610926.2014.930911. [9] S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. [10] J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625. [11] C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999. [12] STATCAN, 1998. Survey of Family Expenditures-1996. [13]
Feature selection by the original variables
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
 Original Features $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 18 0.9429 0.9693 0.4797 2 46 0.9782 0.9877 0.7718 3 108 0.9907 0.9939 0.9076 4 192 1 1 0.9490
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 4 16 0.9445 0.9693 0.2098 4 24 0.9908 0.9939 0.2143 5 30 0.9962 0.9979 0.4669 6 38 1 1 0.6638
Feature selection by the original variables
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
 OrigVarFeatures $|\mbox{Domain}(X_2, Y)|$ $\tau(Y|X_2)$ $\lambda(Y|X_2)$ $EG$ 1 66 0.3005 0.3444 0.8201 2 252 0.3948 0.4391 0.9046 3 1830 0.4383 0.4648 0.9833
Feature selection by the dummy variables
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 Merged Features $|\mbox{Domain}(X'_2, Y)|$ $\tau(Y|X'_2)$ $\lambda(Y|X'_2)$ $EG$ 2 24 0.3242 0.3934 0.5491 2 36 0.3573 0.4165 0.6242 2 48 0.3751 0.4234 0.6388 3 96 0.3901 0.4234 0.7035 4 186 0.4017 0.4269 0.7774 4 282 0.4121 0.4317 0.8066 5 558 0.4221 0.4548 0.8782 6 966 0.4314 0.4768 0.8968 7 1716 0.4436 0.4856 0.9135
 [1] Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016 [2] Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial & Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515 [3] Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014 [4] Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667 [5] Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009 [6] Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331 [7] Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004 [8] Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129 [9] Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005 [10] Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45 [11] Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173 [12] Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161 [13] David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509 [14] Jiaoyan Wang, Jianzhong Su, Humberto Perez Gonzalez, Jonathan Rubin. A reliability study of square wave bursting $\beta$-cells with noise. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 569-588. doi: 10.3934/dcdsb.2011.16.569 [15] Yi-Kuei Lin, Cheng-Ta Yeh. Reliability optimization of component assignment problem for a multistate network in terms of minimal cuts. Journal of Industrial & Management Optimization, 2011, 7 (1) : 211-227. doi: 10.3934/jimo.2011.7.211 [16] Zhi Guo Feng, K. F. Cedric Yiu, K.L. Mak. Feature extraction of the patterned textile with deformations via optimal control theory. Discrete & Continuous Dynamical Systems - B, 2011, 16 (4) : 1055-1069. doi: 10.3934/dcdsb.2011.16.1055 [17] Lok Ming Lui, Yalin Wang, Tony F. Chan, Paul M. Thompson. Brain anatomical feature detection by solving partial differential equations on general manifolds. Discrete & Continuous Dynamical Systems - B, 2007, 7 (3) : 605-618. doi: 10.3934/dcdsb.2007.7.605 [18] Paweł Góra, Abraham Boyarsky, Zhenyang LI, Harald Proppe. Statistical and deterministic dynamics of maps with memory. Discrete & Continuous Dynamical Systems - A, 2017, 37 (8) : 4347-4378. doi: 10.3934/dcds.2017186 [19] Matthew B. Rudd. Statistical exponential formulas for homogeneous diffusion. Communications on Pure & Applied Analysis, 2015, 14 (1) : 269-284. doi: 10.3934/cpaa.2015.14.269 [20] Nils Raabe, Claus Weihs. Physical statistical modelling of bending vibrations. Conference Publications, 2011, 2011 (Special) : 1214-1223. doi: 10.3934/proc.2011.2011.1214

Impact Factor:

## Tools

Article outline

Figures and Tables