信息论基础与编码 (14).pdf-得力文库

资源描述

《信息论基础与编码 (14).pdf》由会员分享，可在线阅读，更多相关《信息论基础与编码 (14).pdf（11页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、Received December 23,2019,accepted January 18,2020,date of publication February 3,2020,date of current version February 12,2020.Digital Object Identifier 10.1109/ACCESS.2020.2971386Maximal Correlation RegressionXIANGXIANG XU1,(Student Member,IEEE),AND SHAO-LUN HUANG2,(Member,IEEE)1Department of Elec

2、tronic Engineering,Tsinghua University,Beijing 100084,China2DSIT Research Center,Tsinghua-Berkeley Shenzhen Institute,Shenzhen 518055,ChinaCorresponding author:Shao-Lun Huang()The work of Shao-Lun Huang was supported in part by the Natural Science Foundation of China under Grant 61807021,in part by

3、theShenzhen Science and Technology Research and Development Funds under Grant JCYJ20170818094022586,and in part by the Innovationand Entrepreneurship Project for Overseas High-Level Talents of Shenzhen under Grant KQJSCX20180327144037831.ABSTRACT In this paper,we propose a novel regression analysis

4、approach,called maximal correlationregression,by exploiting the ideas from the Hirschfeld-Gebelein-Rnyi(HGR)maximal correlation.Weshow that in supervised learning problems,the optimal weights in maximal correlation regression can beexpressed analytically with the relationships to the HGR maximal cor

5、relation functions,which revealstheoretical insights for our approach.In addition,we apply the maximal correlation regression to deeplearning,in which efficient training algorithms are proposed for learning the weights in hidden layers.Furthermore,we illustrate that the maximal correlation regressio

6、n is deeply connected to several existingapproaches in information theory and machine learning,including the universal feature selection problem,linear discriminant analysis,and the softmax regression.Finally,experiments on real datasets demonstratethat our approach can obtain performance comparable

7、 to the widely used softmax regression based-method.INDEX TERMS Artificialneuralnetworks,HGRmaximalcorrelation,lineardiscriminantanalysis,machinelearning algorithms,regression analysis,softmax regression.I.INTRODUCTIONTo apply deep learning to supervised learning problems,the typical routine is to g

8、enerate nonlinear features of datafrom the hidden layers in neural networks,and then applythe generated features to a linear classifier such as the soft-max regression.As such,the neural networks are designedto generate the features that fit the empirical distributionsof data through the softmax fun

9、ctions with appropriatelydesigned weights and bias.However,it is generally diffi-cult to rigorously analyze the optimal weights and bias forsoftmax regression,due to the intrinsic exponential structureof the softmax function.A recent progress in 1 showsthat for weakly dependent data and labels,desig

10、ning theinput features and weights to minimize the log loss in soft-max regression is equivalent to selecting the optimal nor-malized functions for the data and labels such that theirPearson correlation coefficient is maximized.In addition,it is further shown that such optimal functions coincide wit

11、hthe Hirschfeld-Gebelein-Rnyi(HGR)maximal correlationproblem 24 between the data and labels,which pro-vides an information quantification of features regarding tomachine learning tasks.However,to obtain such informationquantification,the weakly dependent assumption betweenThe associate editor coordi

12、nating the review of this manuscript andapproving it for publication was Jinjia Zhou.data and labels is required,which is often not held for generalproblems.In this paper,we exploit this idea to propose anovel regression analysis approach,called maximal correla-tion regression,which retains the theo

13、retical properties of theHGR maximal correlation functions,while being applicableto general data sets with comparable performance to softmaxregression.Specifically,let variables X and Y be the data and label,and PXY,PX,and PYbe the empirical joint and marginaldistributions of X,Y from some training

14、data set(xi,yi),fori=1,.,n.Then,the main idea of the maximal correlationregression is to approximate the conditional distribution PY|Xby the functionP(f,g,b)Y|X(y|x),PY(y)?1+fT(x)g(y)+b(y)?(1)where f(x)Rkis the input feature,1and g(y)and b(y)canbe viewed as the weights and bias2associated to the inp

15、utfeature for different values that Y can take as in the softmaxregression.As such,P(f,g,b)Y|Xcan be viewed as substituting the1In practice,f(x)could be given by the problem or generated by someparametrized model,such as neural networks,by taking x as the input to themodel.2Though the function P(f,g

16、,b)Y|X(y|x)can be equivalently expressed asPY(y)?fT(x)g(y)+b0(y)?,with the new bias b0defined as b0(y),b(y)+1,we adopt the current formulation to simplify the exposition.VOLUME 8,2020This work is licensed under a Creative Commons Attribution 4.0 License.For more information,see http:/creativecommons

17、.org/licenses/by/4.0/26591X.Xu,S.-L.Huang:MCRsoftmax function in the softmax regression to the factoringfunction for approximating the empirical posterior distribu-tion PY|X.Unlike the softmax function,there might exist x,ysuch that P(f,g,b)Y|X(y|x)0.Then,boththedifferencebetweenfT(x)g(y)andfT(x)g(y

18、),and the difference between the corresponding probabilitymodelsP(f,g,b)Y|X(y|x)and P(f,g,b)Y|X(y|x),can be expressed ashigher-order terms of the?.Therefore,our analyses in MCR can provide usefulinsights in understanding the much more complex featureextraction schemes in deep neural networks.IV.EXPE

19、RIMENTSTo illustrate the performance of the MCR approach inpractice,we design experiments on the MNIST 18 dataset,CIFAR-10 19,and CIFAR-100 19,which are among themostcommonlyuseddatasetsinverifyingdeeplearningalgo-rithms.In particular,the performance of MCR is comparedwith classical deep learning me

20、thod,referred to as the SL(Softmax and Log loss)method,where the feature is gener-ated by the same network f,while the posterior probabilityPY|Xis modeled by the softmax function and then optimizedvia minimizing the log loss.A.MNISTThe MNIST dataset of handwritten digits is one of the mostcommon toy

21、 datasets in computer vision,which has a train-ing set of 60,000 examples,and a test set of 10,000 exam-ples.To extract the feature from MNIST,we adopt a simpletwo-layer convolutional neural network(CNN)as depictedin Figure 2.In particular,we set the feature dimension tok=10.Then,boththeMCRandthecla

22、ssicalSLmodelaretrainedfor 100 epochs using ADADELTA 20,where the learn-ing rate is set to 1.0 with a decay factor of 0.95,and themini-batch size is set to 32.With the entire 60,000 images5We consider the ideal case where the previous hidden layers of theneural network have sufficient expressive pow

23、er,i.e.,f can take any desiredfunction.VOLUME 8,202026595X.Xu,S.-L.Huang:MCRTABLE 1.The test accuracy(mean std)of MCR on MNIST,with different numbers of samples used for training.The corresponding results for a classicalSL network are also listed for comparison.FIGURE 2.The CNN that generates the fe

24、ature ffor the MNIST dataset,where the gray blocks contains all the trainable parameters of thenetwork.FIGURE 3.The visualization of features fgenerated by MCR and theclassical SL network on the test set of MNIST,with n=1000 samplesused for training.of the training set used for training,both the MCR

25、 and theSL achieve a test accuracy of 98.9%.We then investigatethe performance of two methods,while using only a smallpart of samples in the training set for training.Specifically,we train the two models on the first n samples of the trainingset,where n is set to 200,400,800,1000,1600,and 2000,respe

26、ctively.Then,the resulting test accuracies for MCR andSL are shown in Table 1,where for each n,the mean and thestandarddeviation(std)ofthetestaccuraciesover10repeatedexperiments are reported.As shown in the table,comparedwith the classical SL method,our MCR approach has betterperformance,especiallyw

27、henthenumberoftrainingsamplesis small.To obtain further insights,we consider the feature fextracted by these two methods on the test set.In particu-lar,Figure 3 shows the extracted features with n=1000samples used for training,where the t-SNE 21 is used toobtainthetwo-dimensionalvisualizeddatafromth

28、eextractedten-dimensionalfeaturesf.Thevisualizationresultsdemon-strate the features extracted by the MCR approach havebetter class separability,which verifies our discussionin Section III-B.TABLE 2.Test accuracies(mean std,in percentages)of MCR and SL onCIFAR-10 for different mini-batch sizes.FIGURE

29、 4.Training and test accuracies of MCR method on CIFAR-10,compared with the results for the classical SL method.B.CIFAR-10WefurtherconductexperimentsontheCIFAR-10dataset 19,which consists of 50,000 training images and10,000 testing images in 10 classes.Inparticular,weuseResNet-1822togeneratethefeatu

30、refwith the dimension k=512 and follow the trainingsettings in 22,which have been tuned on the classical SLmodel.We use SGD with a weight decay of 5 104anda momentum of 0.9 and train both the MCR model andSL model up to 350 epochs.During the training process,the learning rate starts from 0.1 and is

31、divided by 10 at epoch150and250.Inaddition,thesimpledataaugmentation22isadopted:4 pixels are padded on each side,and a 3232 cropis randomly sampled from the padded image or its horizontalflip.Then,with the mini-batch size set to 64,128,256,and512,respectively,Table 2 summarizes the results for botha

32、pproaches,where the average and standard deviation of thetest accuracies over 10 repeated experiments are reported.In addition,Figure 4 shows the training and test accuracy forboth methods during the training process,for an experimentwith the mini-batch size 128.These results demonstrate thatMCRcano

33、btaincomparableperformanceinlabelprediction.Also,it is worth emphasizing that in training MCR,we adoptthe same hyperparameter settings as those for the classicalSL model,which have been carefully tuned to guarantee26596VOLUME 8,2020X.Xu,S.-L.Huang:MCRTABLE 3.Test accuracies(mean std,in percentages)o

34、f MCR and SL onCIFAR-100 for different mini-batch sizes.FIGURE 5.Training and test accuracies of MCR on CIFAR-100,comparedwith the results for the classical SL method.the performance.In practice,the performance of MCR canbe further improved via tuning the hyperparameters.C.CIFAR-100With settings sim

35、ilar to CIFAR-10,we conduct studies on theCIFAR-100 dataset 19,which contains 100 classes.Again,we use the ResNet-18 to generate the k=512 dimensionalfeature from input images and adopt the training settings thathavebeentunedontheSLmodel.Specifically,boththeMCRmodel and the SL model are trained usin

36、g SGD with a weightdecay of 5104and a momentum of 0.9 up to 200 epochs.During the training,the learning rate starts from 0.1 and isdivided by 5 at epoch 60,120,160.For data augmentation,in addition to the padding,cropping and flipping operation asin CIFAR-10,the image is also randomly rotated by an

37、angleranging from 15to 15.Then,with the mini-batch size set to 64,128,256,and512,respectively,Table 2 summarizes the results for bothapproaches,wherethemeanandstandarddeviationofthetestaccuracies over 10 repeated experiments are reported.Thetrainingandtestaccuraciesforbothmethodsduringthetrain-ing p

38、rocess in an experiment with the mini-batch size 128 arealso plotted in Figure 5.Though the hyperparameters havenot been further optimized for the MCR method,the resultsindicate that MCR can obtain performance comparable withtheclassicalSLmethod.Inparticular,fromTable2,MCRhasbetter performance when

39、the mini-batch size is set to 256 or512.In addition,Figure 5 shows that,MCR has a smallergap between the training accuracy and test accuracy than theclassical SL method,which suggests that MCR can be lessover-fitting.APPENDIX APROOF OF PROPOSITION 1First,we have,for all y0 Y,b(y0)P(f,g,b)Y|X(y|x)=PY

40、(y0)y,y0,andg(y0)P(f,g,b)Y|X(y|x)=PY(y)f(x)y,y0,where y,y0is the Kronecker delta:y,y0=(1,if y=y00,otherwise.As a result,we obtainb(y0)L(f,g,b)=2XxXPX(x)XyY?PY|X(y|x)P(f,g,b)Y|X(y|x)PY(y)b(y0)P(f,g,b)Y|X(y|x)?=2XxXPX(x)hPY|X(y0|x)P(f,g,b)Y|X(y0|x)i=2XxXPX(x)?PY|X(y0|x)PY(y0)(1+f(x)g(y0)+b(y0)?=2PY(y0

41、)?Tfg(y0)+b(y0)?andg(y0)L(f,g,b)=2XxXPX(x)XyY?PY|X(y|x)P(f,g,b)Y|X(y|x)PY(y)g(y0)P(f,g,b)Y|X(y|x)?=2XxXPX(x)hPY|X(y0|x)P(f,g,b)Y|X(y0|x)if(x)=2PY(y0)XxX?PX|Y(x|y0)PX(x)(1+fT(x)g(y0)+b(y0)?f(x)=2PY(y0)?Ef(X)|Y=y0 fEf(X)fT(X)g(y0)fb(y0)?=2PY(y0)?E?f(X)?Y=y0?E?f(X)fT(X)?g(y0)fb(y0)?.Then,since the opti

42、mal parameters gand bsatisfy,for eachy Y,b(y)=0andg(y)L(f,g,b)=0,we haveTfg(y)+b(y)=0(23)VOLUME 8,202026597X.Xu,S.-L.Huang:MCRandE?f(X)fT(X)?g(y)+fb(y)=E?f(X)?Y=y?.(24)Substituting(23)into(24),we obtainE?f(X)fT(X)?g(y)fTfg(y)=3fg(y)=E?f(X)?Y=y?,whichimplies(4a).Finally,(4b)isreadilyobtainedfrom(23).

43、APPENDIX BPROOF OF PROPOSITION 2It is convenient to first establish the following result.Lemma 1:For all f,g,and b,we haveL(f,g,b)=2(PXY,PXPY)2 H(f,g)+Tg3fg+XyYPY(y)hTfg(y)+b(y)i2,where 2(PXY,PXPY)is the chi-squared divergence from thejoint distribution PXYto the product distribution PXPY,i.e.,2(PXY

44、,PXPY)=XxXXyYPXY(x,y)PX(x)PY(y)2PX(x)PY(y).Using Lemma 1,we establish Proposition 2 as follows.Notethat we have,for all,g and b,L(f,g,b)=2(PXY,PXPY)2H(f,g)+Tg3fg+XyYPY(y)hTfg(y)+b(y)i2 2(PXY,PXPY)2H(f,g)(25)2(PXY,PXPY)2H(fH,gH)(26)=2(PXY,PXPY)2H(fH,gH)(27)=L(fH,gH,b),(28)where to obtain(25)we have u

45、sed the fact thatTg3fg+XyYPY(y)hTfg(y)+b(y)i2 0.(29)To obtain(26)we have used the fact that Hand gHare theparameters that maximize H(f,g),to obtain(27)we haveused that H(fH,gH)=H(fH,gH),and to obtain(28)wehave used the fact that(29)holds with equality for H,gHand b(y)=gTH(y)Ef(X;H).Hence,we have=H,g

46、=gHand,for all y Y,b(y)=gTH(y)Ef(X;H)=Tfg(y)with f,Ef(X;)=Ef(X;H).It remains only to establish Lemma 1.Proof of Lemma 1:We haveL(f,g,b)=XxXXyYhPXY(x,y)PX(x)P(f,g,b)Y|X(y|x)i2PX(x)PY(y)=XxXXyY1PX(x)PY(y)?PXY(x,y)PX(x)PY(y)?1+fT(x)g(y)+b(y)?2=XxXXyYPXY(x,y)PX(x)PY(y)2PX(x)PY(y)2XxXXyY?PXY(x,y)PX(x)PY(

47、y)fT(x)g(y)+b(y)?+XxXXyYPX(x)PY(y)?fT(x)g(y)+b(y)?2=2(PXY,PXPY)2E?fT(X)g(Y)?+XxXXyYPX(x)PY(y)?fT(x)g(y)+fT(x)g+?Tfg(y)+b(y)?2=2(PXY,PXPY)2E?fT(X)g(Y)?+tr(3f3 g)+Tg3fg+XxXXyYPX(x)PY(y)?Tfg(y)+b(y)?2=2(PXY,PXPY)2 H(f,g)+Tg3fg+XyYPY(y)hTfg(y)+b(y)i2,where to obtain the penultimate equality we have used

48、 thefacts thatXxXXyYPX(x)PY(y)hfT(x)g(y)i2=XxXXyYPX(x)PY(y)tr?f(x)fT(x)g(y)gT(y)?=trXxXPX(x)f(x)fT(x)XyYPY(y)g(y)gT(y)=tr(3f3g)andXxXXyYPX(x)PY(y)hfT(x)gi2=XxXPX(x)Tgf(x)fT(x)g=Tg XxXPX(x)f(x)fT(x)!g=Tg3fg.APPENDIX CNEURAL NETWORK ARCHITECTURE FOR OPTIMIZINGTHE H-SCOREAneuralnetworkarchitecturedesig

49、nedforjointlyoptimizing and g is shown in Figure 6.In particular,given a single26598VOLUME 8,2020X.Xu,S.-L.Huang:MCRFIGURE 6.The network architecture for jointly training and g inmaximal correlation regression,where NNfis used to generate featuref(x;)from the input x,with denoting the parameters of

50、NNf,and g isgenerated using a one-layer fully-connected network.sample pair(x,y),f(x;)is generated by the network NNf,where representsalltrainableparametersofNNf.Moreover,since Y is discrete,g(y)can be generated using a one-layerfully-connectednetworkwithk outputnodes,wheretheinputis the one-hot enc

展开阅读全文