2023年STATA实用学习笔记.doc-得力文库

资源描述

《2023年STATA实用学习笔记.doc》由会员分享，可在线阅读，更多相关《2023年STATA实用学习笔记.doc（59页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、北京科技大学STATA应用学习摘录第一章 STATA的基本操作一、设立内存容 set mem 500m, perm一、显示输入内容Display 1Display “clive”二、显示数据集结构describeDescribe /d三、编辑 editEdit四、重命名变量Rename var1 var2五、显示数据集内容list/browseList in 1List in 2/10六、数据导入:数据文献是文本类型（.csv）1、 insheet: . insheet using “C:Documents and SettingsAdministrator桌面ST9007dat

2、asetFees1.csv”, clear2、内存为空时才可以导入数据集，否则会出现（you must start with an empty dataset）（1）清空内存中的所有变量：.drop _all（2）导入语句后加入“clear”命令。七、保存文献1、 save “C:Documents and SettingsAdministrator桌面ST9007datasetFees1.dta”2、 save “C:Documents and SettingsAdministrator桌面ST9007datasetFees1.dta”, replace八、打开及退出已存文献use

3、1、.Use 文献途径及文献名, clear2、. Drop _all/.exit九、记录命令和输出结果（log）1、开始建立记录文献：log using J:phdoutput.log, replace2、暂停记录文献：log off3、重新打开记录文献：log on4、关闭记录文献：log close十一、创建和保存程序文献：（doedit, do）1、打开程序编辑窗口：doedit2、写入命令3、保存文献，.do.4、运营命令：.do 程序文献途径及文献名十二、多个数据集合并为一个数据集（变量和结构相同）纵向合并appendinsheet using J:phdFees

4、1.csv, clearsave J:phdFees1.dta, replaceinsheet using J:phdFees2.csv, clearappend using J:phdFees1.dtasave J:phdFees1.dta, replace十三、横向合并，在原数据集基础上加上此外的变量merge1、insheet using J:phdFees1.csv, clearsort companyid yearend save J:phdFees1.dta, replacedescribeinsheet using J:phdFees6.csv, clearsort compan

5、yid yearend merge companyid yearend using J:phdFees1.dtasave J:phdFees1.dta, replacedescribe 2、_merge=1 obs. From master data _merge=2 obs. From using data _merge=3 obs. From both master and using data十四、帮助文献：help 1、. Help describe十五、描述性记录量 1、summarize incorporationyear 单个summarize incorporationyear

6、-big6 连续多个summarize _all or simply summarize 所有 2、更具体的记录量 summarize incorporationyear, detail 3、centilecentile auditfees, centile(0(10)100) centile auditfees, centile(0(5)100) 4、tabulate不同类型变量的频数和比例tabulate companytype tabulate companytype big6, column 按列计算比例tabulate companytype big6, row 按行计算比例tab

7、companytype big6 if companytype=3, row col 同时按行列和条件计算比例 5、计算满足条件观测的个数 count if big6=1count if big6=0 | big6=1 6、按离散变量排序，对连续变量计算描述性记录量：（1）by companytype, sort: summarize auditfees, detail（2）sort companytype By companytype:summarize auditees 十六、转换变量1、按公司类型将公开发行股票公司赋值为1，其他为0gen listed=0 replace listed

8、=1 if companytype=2 replace listed=1 if companytype=3 replace listed=1 if companytype=5replace listed=. if companytype=.十七、产生新变量gen Generate newvar=表达式十八、数据类型1、数值型Storage typeBytesMinMaxbyte1-127+100int2-32,767+32,740long4-2,147,483,6472,147,483,620float4-1.*1038 1.*1036 double8-8.*103078.*103082、字符

9、型Storage typeBytesMax length (characters)str111str222str8080803、新建变量的过程中定义数据类型l gen str3 gender= malel list gender in 1/104、变量所占字节过长l drop genderl gen str30 gender= malel browsel describe genderl compress gender5、日期数据类型：%d dates, which is a count of the number of days elapsed since January 1, 1960。（

10、1）date( 日期变量 )l gen fye=date(yearend, MDY) MDY应根据前面日期的排列顺序而定，结果显示的是距离1960年1月1日的天数l list yearend fye in 1/10（2）日期格式化%d（显示fye变量为日期形式，但数值并未真正变动）：l format fye %d l list yearend fye in 1/10l sum fye（3）运用日期天数求相应的年、月、日l gen year=year(fye)l gen month=month(fye)l gen day=day(fye)l list yearend fye year month

11、 day in 1/10（4）将三个分别表达年、月、日的变量合并为一个日期变量l drop fyel gen fye=mdy(month, day, year)l format fye %dl list yearend fye in 1/10(5) 将一个数值型的时间数据（20230131）转变为ST可辨认的时间数据l gen year=int(date/10000)l gen month=int(date-year*10000)/100)l gen day=date-year*10000-month*100l list date year month day in 1/10l gen eda

12、te=mdy(month, day, year)l format edate %dl list edate date in 1/10十九、存贮记录量的内部变量R（）l sum auditfeesl gen meanadjaf= auditfees-r(mean) l list meanadjaf in 1/10SUM命令后常见的几种R（）值r(N)Number of casesr(sd)Standard deviationr(sum_w)Sum of weightsr(min)Minimumr(mean)Arithmetic meanr(max)Maximumr(var)Variancer(

13、sum)Sum of variable显示这些变量值的命令l sum auditfees, detaill return list二十、recode命令（PPT61）1、产生有多个值的变量的哑变量recoderecode year (min/1999 = 0) (2023/max = 1), gen (yeardum)min/1999表达小于等于1999的值所有赋值为02023/max表达大于等于2023的值所有赋为1。2、对一个连续变量按一定值分为不同间隔的组recodegen assets_categ=recode(totalassets, 100, 500, 1000, 5000, 20

14、230, 100000, 1000000)。分组的值为每组的上限，包含该值。sort assets_categby assets_categ: sum totalassets assets_categ 3、对一个连续变量按一定值分为相同间隔的组autocodeautocode(variable name, # of intervals, min value, max value)for example: gen assets_categ=autocode(totalassets, 10, 0, 10000)4、对一个连续变量按每组样本数相同进行分组：xtilextile assets_cat

15、eg=totalassets, nquantiles(10)每组样本不一定完全相同二十一、一次性计算同一变量不同组别的均值：egen命令按公司类型先排序，再计算每一类型公司审计费用的均值并赋值给新变量：by companytype, sort: egen meanaf2=mean(auditfees)l count()l mean()l median()l sum()二十二、_n和_N命令1、显示每个观测的序号并显示总观测数sort companyid fyecapture drop xgen x=_ncapture drop ygen y=_Nlist companyid fye x y

16、in 1/302、分组显示每个组中变量的序号和每组总的样本数l capture drop x yl sort companyid fyel by companyid: gen x=_nl by companyid: gen y=_Nl list companyid fye x y in 1/303、创建新变量等于每个分组中变量的第一个值或最后一个值l sort companyid fyel by companyid: gen auditfees_first=auditfees1l by companyid: gen auditfees_last=auditfees_N l list compa

17、nyid fye auditfees auditfees_first auditfees_last in 1/304、创建新变量等于滞后一期或滞后两期的值l sort companyid fye l by companyid: gen auditfees_lag1= auditfees_n-1l by companyid: gen auditfees_lag2= auditfees_n-2l list companyid fye auditfees auditfees_lag1 auditfees_lag2 in 1/30 二十三、转变数据集结构：reshape 不同数据库的数据集结构不同：长

18、型是指同一公司不同年度数据在不同的行。宽型数据是指同一数据不同年度数据在现一行。两者间的转换可通过reshape命令来实现。需要注意的是，在转换过程中对数据集是有规定的，一个公司只能有一个年度数据，否则会犯错。 1、长型转换为宽型：reshape wide yearend incorporationyear companytype sales auditfees nonauditfees currentassets currentliabilities totalassets big6 fye, i(companyid) j(year) 2、宽型转换为长型：reshape long yeare

19、nd incorporationyear companytype sales auditfees nonauditfees currentassets currentliabilities totalassets big6 fye, i(companyid) j(year) 3、第二次转换时命令可简化：l reshape widel reshape long二十四、计算CAR的例子：已知股票日回报率，市场回报率，事件日，计算窗口期为三天的CAR。 1、定义三天的窗口期：l sort ticker edatel gen window=0 if eventdate.（事件日为0）l replace

20、 window=-1 if window_n+1=0 & ticker=ticker_n+1l replace window=1 if window_n-1=0 & ticker=ticker_n-1 2、计算AR和CARl gen ar=ret-vwretdl gen car=ar+ar_n-1+ar_n+1 if window=0 & ticker=ticker_n+1 & ticker=ticker_n-1 3、检查l list ticker edate ret vwretd ar car window if window.二十五、means 的T检查： 1、检查总体上big6的审计收费

21、有无显著不同l use J:phdFees.dta, clearl gen lnaf=ln(auditfees)l by big6, sort: sum lnafl test lnaf, by (big6)2、分年度比较big6的审计收费有无显著不同,加入by year命令。l gen fye=date(yearend, MDY)l format fye %dl gen year=year(fye)l sort yearl by year: ttest lnaf, by(big6) 3、均值等于特定值得的T检查：l sum lnafl ttest lnaf=2.1二十六、meadian的显著性

22、检查：1、获取中位数的命令：by big6, sort: sum lnaf, detailby big6, sort: centile lnaf 2、中位数检查：l median lnaf, by(big6)l ranksum lnaf, by(big6) 二十七、列联表检查： 1、创建列联表的命令：l tabulate companytype big6, row第一个变量是表的最左侧一列的项目，第二个变量是表的第一行的项目。 2、两变量之间的相关性检查：chi2tabulate companytype big6, chi2 row 3、相关矩阵：pwcorr lnaf big6 year l

23、isted 4、列出相关矩阵并进行符号检查 pwcorr lnaf big6 year listed, sig 5、在矩阵中列出观测数l pwcorr lnaf big6 listed if year=2023, sig obs二十八、创建一个不包含缺失值的数据集 1、无缺失值的变量值为1，至少有一个的为0gen samp=1 if lnaf. & big6. & year. & listed=0 & lnaf=0 & lnaf=0 & lnaf=5, width(0.25) normal2、散点图（scatter）l scatter lnaf lnta第一个变量是纵轴，第二个变量是横轴。l

24、twoway (scatter lnaf lnta, msize(tiny) (lfit lnaf lnta)在散点图上加入最适合的一条直线。三十、缩尾解决winsor . winsor rev, gen(wrev) p(0.01)0.01代表去掉的百分数。 Winsor rev, gen(wrev) h(5),5代表去掉的个数第二章线性回归内容简介： 2.1 The basic idea underlying linear regression 2.2 Single variable OLS 2.3 Correctly interpreting the coefficients 2.4 E

25、xamining the residuals 2.5 Multiple regression 2.6 Heteroskedasticity 2.7 Correlated errors 2.8 Multicollinearity 2.9 Outlying observations 2.10 Median regression 2.11 “Looping”2.1 The basic idea underlying linear regression 1残差 F为真实值，为预测值，为残差。OLS回归就是使残差最小。2. 基本一元回归regress y x3回归结果的保存回归结果的系数保存在_bvar

26、name内存变量中，常数项的系数保存在 (_cons)内存变量中。4、预测值及残差l predict yhatl predict yres, residyres即为真实值得与预测值之差。5、残差与X的散点图twoway (scatter y_res x) (lfit y_res x)6、衡量估计系数准确限度：标准误差。用样本的标准偏差与系数之间的关系来衡量即T值（用系数除以标准差），同时P值是根据T值的分布计算出来的，表达系数落入标准相应上下限的也许性。前提是残差符合以下假设：同方差：Homoscedasticity (i.e., the residuals have a constant v

27、ariance)独立不相关：Non-correlation (i.e., the residuals are not correlated with each other)正态分布：Normality (i.e., the residuals are normally distributed) 7、回归结果包含的一些内容的意思l 各变差的自由度： For the ESS, df = k-1 where k = number of regression coefficients (df = 2 1) For the RSS, df = n k where n = number of observ

28、ations (= 11 - 2) For the TSS, df = n-1 ( = 11 1)l MS：变差除以自由度：The last column (MS) reports the ESS, RSS and TSS divided by their respective degrees of freedoml R平方：The R-squared = ESS / TSS l 调整的R平方：Adj R-squared = 1-(1-R2)(n-1)/(n-k) ，消除了加入相关度不高解释变量后R平方增长的局限性。l Root MSE = square root of RSS/n-k：模型的

29、平均解释能力l The F-statistic = (ESS/k-1)/(RSS/n-k)：模型的总解释能力2.3 Correctly interpreting the coefficients1、假如想检查big6的审计费用在公开发行和非公开发行公司之间的区别时，可用交互变量。Big6*listed.2、变量回归系数的解释(1)对连续变量系数的解释：估计系数的经济意义是指X对Y的影响，可以有不同的方法来衡量：一种是用X从25%变动到75%时Y的变动量。或X变动一个标准差时Y的变动。l reg auditfees totalassets l sum totalassets if auditfe

30、esmax & cook.4、去掉异常值后重新回归l reg lnaf lnta big6 if cook=max 5、用winsorize方法消除异常值:其缺陷是A disadvantage with “winsorizing” is that the researcher is assuming that outliers lie only at the extremes of the variables distribution。l winsor lnaf, gen(wlnaf) p(0.01)l winsor lnta, gen(wlnta) p(0.01)l sum lnaf wl

31、naf lnta wlnta, detaill reg wlnaf wlnta big62.10 Median regression1、中位数回归是当存在异常值问题时使用。2、原理：OLS估计是尽量使残差平方和最小：中位数回归是尽量使the sum of the absolute residuals最小。 3、回归方法：STATA将中位数回归看作是quantile regressions 的一个特例。qreg lnaf lnta big6 2.11 “Looping”1、当多次用到一个命令集时，我们可以建立一个程序集，以program开头，以forvalues引导的内容，以end结束。使用时只须输入程序名“ten”即可执行程序中的一引起

展开阅读全文