COMID is an online algorithm which can ensure the structure of L1 regularization.Its stochastic convergence rate can be obtained directly from the regret bound in online settings.However,the derived final solution has poor sparsity because it only takes the form of averaging all previous T iterates.Naturally,the individual solution has nice sparisity.So it becomes more and more important to discuss individual convergence rates in the stochastic learning.In this paper,we focus on the regularized non-smooth loss problems.When the regularizer are L1 and L1+L2,we prove the individual convergence rates of COMID respectively.The extensive experiments on large-scale datasets demonstrate that the individual solution consistently improves the sparsity while keeping almost the same accuracy.For the datasets with poor sparse structure,the sparsity of solution is improved even up to four times.
姜纪远, 陶卿, 邵言剑, 汪群山. 随机COMID的瞬时收敛速率分析[J]. 电子学报, 2015, 43(9): 1850-1858.
JIANG Ji-yuan, TAO Qing, SHAO Yan-jian, WANG Qun-shan. The Analysis of Convergence Rate of Individual COMID Iterates. Chinese Journal of Electronics, 2015, 43(9): 1850-1858.
[1] 陶卿,朱烨雷,等.一种基于Comid的非光滑损失随机坐标下降方法[J].电子学报,2013,41(4):768-775. Tao Qing,Zhu Ye-lei,et al.A new comid-based stochastic coordinate descent method for non-smooth losses[J].Acta Electronica Sinica,2013,41(4):768-775.(in Chinese)
[2] Shalev-Shwartz S,Tewari A.Stochastic methods for l1-regularized loss minimization[J].Journal of Machine Learning Research,2011,12(Jun):1865-1892.
[3] Shalev-Shwartz S,Singer Y,Srebro N,et al.Pegasos:Primal estimated sub-gradient solver for svm[J].Mathematical Programming,2011,127(1):3-30.
[4] Shalev-Shwartz S,Zhang T.Proximal stochastic dual coordinate ascent[OL].http://arxiv.org/abs/1211.2717,2012-11-12/2012-12-01.
[5] Hsieh C J,Chang K W,et al.A dual coordinate descent method for large-scale linear SVM[A].Proceedings of the 25th International Conference on Machine Learning[C].USA:ACM Press,2008.408-415.
[6] Chang K W,Hsieh C J,Lin C J.Coordinate descent method for large-scale l2-loss linear support vector machines[J].Journal of Machine Learning Research,2008,9(Jun):1369-1398.
[7] Lin C J,Weng R C,Keerthi S S.Trust region newton method for logistic regression[J].Journal of Machine Learning Research,2008,9(Apr):627-650.
[8] Tibshirani R.Regression shrinkage and selection via the lasso[J].Journal of the Royal Statistical Society,1996,58(1):267-288.
[9] Zinkevich M.Online convex programming and generalized infinitesimal gradient ascent[A].Proceedings of the 20th International Conference on Machine Learning[C].USA:ACM Press,2003.928-936.
[10] Hazan E,Agarwal A,Kale S.Logarithmic regret algorithms for online convex optimization[J].Machine Learning,2007,69(2):169-192.
[11] Nemirovski A,Juditsky A,Lan G,et al.Robust stochastic approximation approach to stochastic programming[J].SIAM Journal on Optimization,2009,19(4):1574-1609.
[12] Yuan G X,Chang K W,Hsieh C J,et al.A comparison of optimization methods and software for large-scale l1-regularized linear classification[J].Journal of Machine Learning Research,2010,11(Nov):3183-3234.
[13] H Robbins,S Monro.A stochastic approximation method[J].The Annuals of Mathematical Statistics,1951,22(3):400-407.
[14] Nesterov Y.A method of solving a convex programming problem with convergence rate O(1/k2)[J].Soviet Mathematics Doklady,1983,27(2):372-376.
[15] Beck A,Teboulle M.Mirror descent and nonlinear projected subgradient methods for convex optimization[J].Operations Research Letters,2003,31(3):167-175.
[16] Xiao L.Dual averaging methods for regularized stochastic learning and online optimization[J].Journal of Machine Learning Research,2010,11(Oct):2543-2596.
[17] Duchi J,Shalev-Shwartz S,Singer Y,Tewari A.Composite objective mirror descent[A].Proceedings of the 23rd Annual Workshop on Computational Learning Theory[C].USA:ACM Press,2010.116-128.
[18] Shalev-Shwartz S.Online learning and online convex optimization[J].Foundations andTrends in Machine Learning,2011,4(2):107-194.
[19] Chen X,Lin Q,Pena J.Optimal regularized dual averaging methods for stochastic optimization[A].Advances in Neural Information Processing Systems[C].USA:ACM Press,2012.404-412.
[20] Shamir O,Zhang T.Stochastic gradient descent for non-smooth optimization:convergence results and optimal averaging schemes[A].Proceedings of the 30th International Conference on Machine Learning[C].USA:ACM Press,2013.71-79.
[21] Shamir O.Is averaging needed for strongly convex stochastic gradient descent[A].Proceedings of the 25th Annual Conference on Learning Theory.JMLR W & CP 23[C].Edinburgh.Scotland:JMLR,2012.1-47.
[22] Hazan E,Kale S.Beyond the regret minimization barrier:an optimal algorithm for stochastic strongly-convex optimization[A].Proceedings of the 24th Annual Conference on Learning Theory,JMLR W & CP 19[C].Budapest,Hungary:JMLR,2011.421-436.
[23] Rakhlin A,Shamir O,Sridharan K.Makinggradient descent optimal for strongly convex stochastic optimization[A].Proceedings of the 29th International Conference on Machine Learning[C].USA:ACM Press,2012.449-456.
[24] Lacoste-Julien S,Schmidt M,Bach F.A simpler approach to obtaining an o(1/t) convergence rate for projected stochastic subgradient descent[OL].http://arxiv.org/abs/1212.2002,2012-12-20/2013-03-01.
[25] Bregman L M.The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming[J].USSR Computational Mathematics and Mathematical Physics,1967,7(3):200-217.
[26] Fan R E,Chang K W,Hsieh C J,et al.LIBLINEAR:A library for large linear classification[J].Journal of Machine Learning Research,2008,9:1871-1874.