doi: 10.3934/jimo.2019052

Improved SVRG for finite sum structure optimization with application to binary classification

1. 

School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China

2. 

School of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China

* Corresponding author: Wei Xue (E-mail: cswxue@ahut.edu.cn)

Received  May 2018 Revised  November 2018 Published  May 2019

This paper looks at a stochastic variance reduced gradient (SVRG) method for minimizing the sum of a finite number of smooth convex functions, which has been involved widely in the field of machine learning and data mining. Inspired by the excellent performance of two-point stepsize gradient method in batch learning, in this paper we present an improved SVRG algorithm, named stochastic two-point stepsize gradient method. Under some mild conditions, the proposed method achieves a linear convergence rate $ O(\rho^k) $ for smooth and strongly convex functions, where $ \rho\in(0.68, 1) $. Simulation experiments on several benchmark data sets are reported to demonstrate the performance of the proposed method.

Citation: Guangmei Shao, Wei Xue, Gaohang Yu, Xiao Zheng. Improved SVRG for finite sum structure optimization with application to binary classification. Journal of Industrial & Management Optimization, doi: 10.3934/jimo.2019052
References:
[1]

J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148. doi: 10.1093/imanum/8.1.141. Google Scholar

[2] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004. doi: 10.1017/CBO9780511804441. Google Scholar
[3]

Y. ChenL. Qi and Q. Wang, Computing extreme eigenvalues of large scale hankel tensors, J. Sci. Comput., 68 (2016), 716-738. doi: 10.1007/s10915-015-0155-8. Google Scholar

[4]

W. Cheng and Y. Dai, Gradient-based method with active set strategy for l1 optimization, Math. Comp., 87 (2018), 1283-1305. doi: 10.1090/mcom/3238. Google Scholar

[5]

Y. Dai and R. Fletcher, On the asymptotic behaviour of some new gradient methods, Math. Program., 103 (2005), 541-559. doi: 10.1007/s10107-004-0516-9. Google Scholar

[6]

Y. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming, Numer. Math., 100 (2005), 21-47. doi: 10.1007/s00211-004-0569-y. Google Scholar

[7]

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Adv. Neural Inf. Process. Syst., (2013), 315–323.Google Scholar

[8]

H. LiuZ. Liu and X. Dong, A new adaptive Barzilai and Borwein method for unconstrained optimization, Optim. Lett., 12 (2018), 845-873. doi: 10.1007/s11590-017-1150-9. Google Scholar

[9]

J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, Optim. Lett., 25 (2015), 829-855. doi: 10.1137/140957639. Google Scholar

[10]

E. Min, Y. Zhao, J. Long, C. Wu, K. Li and J. Yin, SVRG with adaptive epoch size, in Proc. IEEE Int. Joint Conf. Neural Netw., (2017), 2935–2942. doi: 10.1109/IJCNN.2017.7966219. Google Scholar

[11]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, 2004. doi: 10.1007/978-1-4419-8853-9. Google Scholar

[12]

A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, in Proc. Int. Conf. Artif. Intell. Statist., (2016), 195–203.Google Scholar

[13]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in Adv. Neural Inf. Process. Syst., (2016), 1574–1582.Google Scholar

[14]

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley and Sons, New Jersey, 2007. doi: 10.1002/9780470182963. Google Scholar

[15]

M. Raydan, On the Barzilai and Borwein choice of steplength for the gradient method, IMA J. Numer. Anal., 13 (1993), 321-326. doi: 10.1093/imanum/13.3.321. Google Scholar

[16]

H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407. doi: 10.1214/aoms/1177729586. Google Scholar

[17]

N. L. Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst., (2012), 2663–2671.Google Scholar

[18]

Z. Shen, H. Qian, T. Zhou and T. Mu, Adaptive variance reducing for stochastic gradient descent, in Proc. Int. Joint Conf. Artif. Intell., (2016), 1990–1996.Google Scholar

[19]

C. Tan, S. Ma, Y. Dai and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in Adv. Neural Inf. Process. Syst., (2016), 685–693.Google Scholar

[20]

Z. WenC. YangX. Liu and Y. Zhang, Trace-penalty minimization for large-scale eigenspace computation, J. Sci. Comput., 66 (2016), 1175-1203. doi: 10.1007/s10915-015-0061-0. Google Scholar

[21]

L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075. doi: 10.1137/140961791. Google Scholar

[22]

W. XueW. Zhang and J. Ren, Online leaming based on stochastic spectral gradient, Comput. Sci., 43 (2016), 47-51. Google Scholar

[23]

G. Yu and S. Niu, Multivariate spectral gradient projection method for nonlinear monotone equations with convex constraints, J. Ind. Manag. Optim., 9 (2013), 117-129. doi: 10.3934/jimo.2013.9.117. Google Scholar

[24]

B. ZhouL. Gao and Y. Dai, Gradient methods with adaptive step-sizes, Comput. Optim. Appl., 35 (2006), 69-86. doi: 10.1007/s10589-006-6446-0. Google Scholar

[25]

R. ZhouX. Shen and L. Niu, A fast algorithm for nonsmooth penalized clustering, Neurocomputing, 273 (2018), 583-592. doi: 10.1016/j.neucom.2017.08.048. Google Scholar

show all references

References:
[1]

J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148. doi: 10.1093/imanum/8.1.141. Google Scholar

[2] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004. doi: 10.1017/CBO9780511804441. Google Scholar
[3]

Y. ChenL. Qi and Q. Wang, Computing extreme eigenvalues of large scale hankel tensors, J. Sci. Comput., 68 (2016), 716-738. doi: 10.1007/s10915-015-0155-8. Google Scholar

[4]

W. Cheng and Y. Dai, Gradient-based method with active set strategy for l1 optimization, Math. Comp., 87 (2018), 1283-1305. doi: 10.1090/mcom/3238. Google Scholar

[5]

Y. Dai and R. Fletcher, On the asymptotic behaviour of some new gradient methods, Math. Program., 103 (2005), 541-559. doi: 10.1007/s10107-004-0516-9. Google Scholar

[6]

Y. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming, Numer. Math., 100 (2005), 21-47. doi: 10.1007/s00211-004-0569-y. Google Scholar

[7]

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Adv. Neural Inf. Process. Syst., (2013), 315–323.Google Scholar

[8]

H. LiuZ. Liu and X. Dong, A new adaptive Barzilai and Borwein method for unconstrained optimization, Optim. Lett., 12 (2018), 845-873. doi: 10.1007/s11590-017-1150-9. Google Scholar

[9]

J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, Optim. Lett., 25 (2015), 829-855. doi: 10.1137/140957639. Google Scholar

[10]

E. Min, Y. Zhao, J. Long, C. Wu, K. Li and J. Yin, SVRG with adaptive epoch size, in Proc. IEEE Int. Joint Conf. Neural Netw., (2017), 2935–2942. doi: 10.1109/IJCNN.2017.7966219. Google Scholar

[11]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, 2004. doi: 10.1007/978-1-4419-8853-9. Google Scholar

[12]

A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, in Proc. Int. Conf. Artif. Intell. Statist., (2016), 195–203.Google Scholar

[13]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, in Adv. Neural Inf. Process. Syst., (2016), 1574–1582.Google Scholar

[14]

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley and Sons, New Jersey, 2007. doi: 10.1002/9780470182963. Google Scholar

[15]

M. Raydan, On the Barzilai and Borwein choice of steplength for the gradient method, IMA J. Numer. Anal., 13 (1993), 321-326. doi: 10.1093/imanum/13.3.321. Google Scholar

[16]

H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407. doi: 10.1214/aoms/1177729586. Google Scholar

[17]

N. L. Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst., (2012), 2663–2671.Google Scholar

[18]

Z. Shen, H. Qian, T. Zhou and T. Mu, Adaptive variance reducing for stochastic gradient descent, in Proc. Int. Joint Conf. Artif. Intell., (2016), 1990–1996.Google Scholar

[19]

C. Tan, S. Ma, Y. Dai and Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, in Adv. Neural Inf. Process. Syst., (2016), 685–693.Google Scholar

[20]

Z. WenC. YangX. Liu and Y. Zhang, Trace-penalty minimization for large-scale eigenspace computation, J. Sci. Comput., 66 (2016), 1175-1203. doi: 10.1007/s10915-015-0061-0. Google Scholar

[21]

L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075. doi: 10.1137/140961791. Google Scholar

[22]

W. XueW. Zhang and J. Ren, Online leaming based on stochastic spectral gradient, Comput. Sci., 43 (2016), 47-51. Google Scholar

[23]

G. Yu and S. Niu, Multivariate spectral gradient projection method for nonlinear monotone equations with convex constraints, J. Ind. Manag. Optim., 9 (2013), 117-129. doi: 10.3934/jimo.2013.9.117. Google Scholar

[24]

B. ZhouL. Gao and Y. Dai, Gradient methods with adaptive step-sizes, Comput. Optim. Appl., 35 (2006), 69-86. doi: 10.1007/s10589-006-6446-0. Google Scholar

[25]

R. ZhouX. Shen and L. Niu, A fast algorithm for nonsmooth penalized clustering, Neurocomputing, 273 (2018), 583-592. doi: 10.1016/j.neucom.2017.08.048. Google Scholar

Figure 1.  Objective loss and test classification accuracy versus epochs on data set "a9a" (best viewed in color)
Figure 2.  Objective loss and test classification accuracy versus epochs on data set "w8a"
Figure 3.  Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "a9a"
Figure 4.  Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "w8a"
Figure 5.  Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "a1a"
Figure 6.  Evolutions of objective loss and test classification accuracy w.r.t. epochs for Eq. (14) on data set "w1a"
Figure 7.  Evolutions of objective loss and test classification accuracy versus epochs on data set "a9a" with different $ \eta_0 $
Figure 8.  Evolutions of objective loss and test classification accuracy versus epochs on data set "w8a" with different $ \eta_0 $
Figure 9.  Comparison results between SVM and STSG on test classification accuracy on four test data sets
Table 1.  Details of data sets
Data set $\sharp$ of training samples $\sharp$ of test samples $\sharp$ of dimension
a1a160530956123
a9a3256116281123
w1a247747272300
w8a4974914951300
Data set $\sharp$ of training samples $\sharp$ of test samples $\sharp$ of dimension
a1a160530956123
a9a3256116281123
w1a247747272300
w8a4974914951300
[1]

Ning Zhang, Qiang Wu. Online learning for supervised dimension reduction. Mathematical Foundations of Computing, 2019, 2 (2) : 95-106. doi: 10.3934/mfc.2019008

[2]

Wen-Chiao Cheng. Two-point pre-image entropy. Discrete & Continuous Dynamical Systems - A, 2007, 17 (1) : 107-119. doi: 10.3934/dcds.2007.17.107

[3]

Feliz Minhós, A. I. Santos. Higher order two-point boundary value problems with asymmetric growth. Discrete & Continuous Dynamical Systems - S, 2008, 1 (1) : 127-137. doi: 10.3934/dcdss.2008.1.127

[4]

Wenming Zou. Multiple solutions results for two-point boundary value problem with resonance. Discrete & Continuous Dynamical Systems - A, 1998, 4 (3) : 485-496. doi: 10.3934/dcds.1998.4.485

[5]

Frédéric Legoll, William Minvielle. Variance reduction using antithetic variables for a nonlinear convex stochastic homogenization problem. Discrete & Continuous Dynamical Systems - S, 2015, 8 (1) : 1-27. doi: 10.3934/dcdss.2015.8.1

[6]

Jerry L. Bona, Hongqiu Chen, Shu-Ming Sun, Bing-Yu Zhang. Comparison of quarter-plane and two-point boundary value problems: The KdV-equation. Discrete & Continuous Dynamical Systems - B, 2007, 7 (3) : 465-495. doi: 10.3934/dcdsb.2007.7.465

[7]

Xiao-Yu Zhang, Qing Fang. A sixth order numerical method for a class of nonlinear two-point boundary value problems. Numerical Algebra, Control & Optimization, 2012, 2 (1) : 31-43. doi: 10.3934/naco.2012.2.31

[8]

Shao-Yuan Huang, Shin-Hwa Wang. On S-shaped bifurcation curves for a two-point boundary value problem arising in a theory of thermal explosion. Discrete & Continuous Dynamical Systems - A, 2015, 35 (10) : 4839-4858. doi: 10.3934/dcds.2015.35.4839

[9]

Jerry Bona, Hongqiu Chen, Shu Ming Sun, B.-Y. Zhang. Comparison of quarter-plane and two-point boundary value problems: the BBM-equation. Discrete & Continuous Dynamical Systems - A, 2005, 13 (4) : 921-940. doi: 10.3934/dcds.2005.13.921

[10]

Marcel Lesieur. Two-point closure based large-eddy simulations in turbulence, Part 1: Isotropic turbulence. Discrete & Continuous Dynamical Systems - S, 2011, 4 (1) : 155-168. doi: 10.3934/dcdss.2011.4.155

[11]

Marcel Lesieur. Two-point closure based large-eddy simulations in turbulence. Part 2: Inhomogeneous cases. Discrete & Continuous Dynamical Systems - A, 2010, 28 (1) : 227-241. doi: 10.3934/dcds.2010.28.227

[12]

Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu. Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data & Information Analytics, 2017, 2 (1) : 59-68. doi: 10.3934/bdia.2017008

[13]

Aude Hofleitner, Tarek Rabbani, Mohammad Rafiee, Laurent El Ghaoui, Alex Bayen. Learning and estimation applications of an online homotopy algorithm for a generalization of the LASSO. Discrete & Continuous Dynamical Systems - S, 2014, 7 (3) : 503-523. doi: 10.3934/dcdss.2014.7.503

[14]

Roberto C. Alamino, Nestor Caticha. Bayesian online algorithms for learning in discrete hidden Markov models. Discrete & Continuous Dynamical Systems - B, 2008, 9 (1) : 1-10. doi: 10.3934/dcdsb.2008.9.1

[15]

Jiang Xie, Junfu Xu, Celine Nie, Qing Nie. Machine learning of swimming data via wisdom of crowd and regression analysis. Mathematical Biosciences & Engineering, 2017, 14 (2) : 511-527. doi: 10.3934/mbe.2017031

[16]

Mingbao Cheng, Shuxian Xiao, Guosheng Liu. Single-machine rescheduling problems with learning effect under disruptions. Journal of Industrial & Management Optimization, 2018, 14 (3) : 967-980. doi: 10.3934/jimo.2017085

[17]

Joon Kwon, Panayotis Mertikopoulos. A continuous-time approach to online optimization. Journal of Dynamics & Games, 2017, 4 (2) : 125-148. doi: 10.3934/jdg.2017008

[18]

René Henrion, Christian Küchler, Werner Römisch. Discrepancy distances and scenario reduction in two-stage stochastic mixed-integer programming. Journal of Industrial & Management Optimization, 2008, 4 (2) : 363-384. doi: 10.3934/jimo.2008.4.363

[19]

Lorella Fatone, Francesca Mariani, Maria Cristina Recchioni, Francesco Zirilli. Pricing realized variance options using integrated stochastic variance options in the Heston stochastic volatility model. Conference Publications, 2007, 2007 (Special) : 354-363. doi: 10.3934/proc.2007.2007.354

[20]

Donglei Du, Tianping Shuai. Errata to:''Optimal preemptive online scheduling to minimize $l_{p}$ norm on two processors''[Journal of Industrial and Management Optimization, 1(3) (2005), 345-351.]. Journal of Industrial & Management Optimization, 2008, 4 (2) : 339-341. doi: 10.3934/jimo.2008.4.339

2018 Impact Factor: 1.025

Article outline

Figures and Tables

[Back to Top]