doi: 10.3934/bdia.2017020

A category-based probabilistic approach to feature selection

1. 

School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China

2. 

Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Published  August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
References:
[1]

A. DalyT. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41. doi: 10.1016/j.jocm.2016.09.005.

[2]

S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.

[3]

S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121. doi: 10.1109/PRDC.2004.1276562.

[4]

L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.

[5]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925.

[6]

W. HuangX. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.

[7]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129.

[8]

W. HuangY. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819. doi: 10.1080/03610926.2014.930911.

[9]

S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.

[10]

J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625.

[11]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.

[12]

STATCAN, 1998. Survey of Family Expenditures-1996.

[13]

http://archive.ics.uci.edu/ml/datasets/Mushroom

show all references

References:
[1]

A. DalyT. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41. doi: 10.1016/j.jocm.2016.09.005.

[2]

S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.

[3]

S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121. doi: 10.1109/PRDC.2004.1276562.

[4]

L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.

[5]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95. doi: 10.1007/BF02288925.

[6]

W. HuangX. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.

[7]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137. doi: 10.3934/bdia.2016.1.129.

[8]

W. HuangY. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819. doi: 10.1080/03610926.2014.930911.

[9]

S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.

[10]

J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625.

[11]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.

[12]

STATCAN, 1998. Survey of Family Expenditures-1996.

[13]

http://archive.ics.uci.edu/ml/datasets/Mushroom

Table 1.  Feature selection by the original variables
Original Features$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1180.94290.96930.4797
2460.97820.98770.7718
31080.99070.99390.9076
4192110.9490
Original Features$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1180.94290.96930.4797
2460.97820.98770.7718
31080.99070.99390.9076
4192110.9490
Table 2.  Feature selection by the dummy variables
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
4160.94450.96930.2098
4240.99080.99390.2143
5300.99620.99790.4669
638110.6638
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
4160.94450.96930.2098
4240.99080.99390.2143
5300.99620.99790.4669
638110.6638
Table 3.  Feature selection by the original variables
OrigVarFeatures$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1660.30050.34440.8201
22520.39480.43910.9046
318300.43830.46480.9833
OrigVarFeatures$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1660.30050.34440.8201
22520.39480.43910.9046
318300.43830.46480.9833
Table 4.  Feature selection by the dummy variables
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
2240.32420.39340.5491
2360.35730.41650.6242
2480.37510.42340.6388
3960.39010.42340.7035
41860.40170.42690.7774
42820.41210.43170.8066
55580.42210.45480.8782
69660.43140.47680.8968
717160.44360.48560.9135
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
2240.32420.39340.5491
2360.35730.41650.6242
2480.37510.42340.6388
3960.39010.42340.7035
41860.40170.42690.7774
42820.41210.43170.8066
55580.42210.45480.8782
69660.43140.47680.8968
717160.44360.48560.9135
[1]

Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial & Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515

[2]

Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667

[3]

Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014

[4]

Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009

[5]

Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331

[6]

Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004

[7]

Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129

[8]

Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005

[9]

Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45

[10]

Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173

[11]

Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161

[12]

David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509

[13]

Jiaoyan Wang, Jianzhong Su, Humberto Perez Gonzalez, Jonathan Rubin. A reliability study of square wave bursting $\beta$-cells with noise. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 569-588. doi: 10.3934/dcdsb.2011.16.569

[14]

Yi-Kuei Lin, Cheng-Ta Yeh. Reliability optimization of component assignment problem for a multistate network in terms of minimal cuts. Journal of Industrial & Management Optimization, 2011, 7 (1) : 211-227. doi: 10.3934/jimo.2011.7.211

[15]

Zhi Guo Feng, K. F. Cedric Yiu, K.L. Mak. Feature extraction of the patterned textile with deformations via optimal control theory. Discrete & Continuous Dynamical Systems - B, 2011, 16 (4) : 1055-1069. doi: 10.3934/dcdsb.2011.16.1055

[16]

Lok Ming Lui, Yalin Wang, Tony F. Chan, Paul M. Thompson. Brain anatomical feature detection by solving partial differential equations on general manifolds. Discrete & Continuous Dynamical Systems - B, 2007, 7 (3) : 605-618. doi: 10.3934/dcdsb.2007.7.605

[17]

Paweł Góra, Abraham Boyarsky, Zhenyang LI, Harald Proppe. Statistical and deterministic dynamics of maps with memory. Discrete & Continuous Dynamical Systems - A, 2017, 37 (8) : 4347-4378. doi: 10.3934/dcds.2017186

[18]

Matthew B. Rudd. Statistical exponential formulas for homogeneous diffusion. Communications on Pure & Applied Analysis, 2015, 14 (1) : 269-284. doi: 10.3934/cpaa.2015.14.269

[19]

Nils Raabe, Claus Weihs. Physical statistical modelling of bending vibrations. Conference Publications, 2011, 2011 (Special) : 1214-1223. doi: 10.3934/proc.2011.2011.1214

[20]

Anarina L. Murillo, Muntaser Safan, Carlos Castillo-Chavez, Elizabeth D. Capaldi Phillips, Devina Wadhera. Modeling eating behaviors: The role of environment and positive food association learning via a Ratatouille effect. Mathematical Biosciences & Engineering, 2016, 13 (4) : 841-855. doi: 10.3934/mbe.2016020

 Impact Factor: 

Article outline

Figures and Tables

[Back to Top]