"
],
"text/plain": [
" total_bill tip sex smoker day time size\n",
"230 24.01 2.0 Male Yes Sat Dinner 4\n",
"29 19.65 3.0 Female No Sat Dinner 2\n",
"102 44.30 2.5 Female Yes Sat Dinner 3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = sns.load_dataset('tips')\n",
"df.shape\n",
"df.sample(3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install Packages"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
". ssc install reghdfe\n",
"checking reghdfe consistency and verifying not already installed...\n",
"all files already exist and are up to date.\n",
"\n",
". ssc install ftools\n",
"checking ftools consistency and verifying not already installed...\n",
"all files already exist and are up to date.\n",
"\n",
". \n"
]
}
],
"source": [
"%%stata\n",
"ssc install reghdfe\n",
"ssc install ftools"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Browse Manual"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"-------------------------------------------------------------------------------\n",
"help for winsor2 (blog)\n",
"-------------------------------------------------------------------------------\n",
"\n",
"Winsorizing or Trimming variables\n",
"---------------------------------\n",
"\n",
"\n",
"Syntax\n",
"------\n",
"\n",
" winsor2 varlist [if] [in], [ suffix(string) replace trim cuts(# #)\n",
" by(groupvar) label ]\n",
"\n",
"\n",
"Description\n",
"-----------\n",
"\n",
" winsor2 winsorize or trim (if trim option is specified) the variables in \n",
" varlist at particular percentiles specified by option cuts(#1 #2). In\n",
" defult, new variables will be generated with a suffix \"_w\" or \"_tr\",\n",
" which can be changed by specifying suffix() option. The replace option\n",
" replaces the variables with their winsorized or trimmed ones.\n",
"\n",
" +---------------------------------------------+\n",
" ----+ Difference between winsorizing and trimming +----------------------\n",
"\n",
" Winsorizing is not equivalent to simply excluding data, which is a\n",
" simpler procedure, called trimming or truncation. In a trimmed\n",
" estimator, the extreme values are discarded; in a Winsorized estimator,\n",
" the extreme values are instead replaced by certain percentiles, specified\n",
" by option cuts(# #). For details, see winsor (if installed), trimmean\n",
" (if installed).\n",
"\n",
" For example, you type the following commands to get the 1th and 99th\n",
" percentiles of variable wage, 1.930993 and 38.70926, respectively.\n",
"\n",
" . sysuse nlsw88, clear\n",
" . sum wage, detail\n",
"\n",
" In defult, winsor2 winsorize wage at 1th and 99th percentiles,\n",
"\n",
" . winsor2 wage, replace cuts(1 99)\n",
"\n",
" which can be done by hands:\n",
"\n",
" . replace wage=1.930993 if wage<1.930993\n",
" . replace wage=38.70926 if wage>38.70926\n",
"\n",
" Note that, values smaller than the 1th percentile is repalce by the 1th\n",
" percentile, and the similar thing is done with the 99th percentile.\n",
"\n",
" Things change when -trim- option is specified:\n",
"\n",
" . winsor2 wage, replace cuts(1 99) trim\n",
"\n",
" which can also be done by hands:\n",
"\n",
" . replace wage=. if wage<1.930993\n",
" . replace wage=. if wage>38.70926\n",
"\n",
" In this case, we discard values smaller than 1th percentile or greater\n",
" than 99th percentile. This is trimming.\n",
"\n",
"\n",
"Options\n",
"-------\n",
"\n",
" suffix(string) specifies the suffix of the new variables. The defult is\n",
" \"_w\" or \"_tr\" (when trim specified).\n",
"\n",
" replace replaces the variables with their winsorized or trimmed\n",
" counterpart. Can not be specified with suffix(string).\n",
"\n",
" trim trims the variables.\n",
"\n",
" cuts(# #) specifies the percentiles at which the data is winsorized or\n",
" trimmed. cuts(1 99) (the default) means winsor (trim) at 1th and\n",
" 99th percentile. Specify cuts(1 99) or cuts(99 1) makes no\n",
" difference.\n",
"\n",
" by(groupvar) the winsor or trim is done within each group specified by\n",
" groupvar.\n",
"\n",
"\n",
"Examples\n",
"--------\n",
"\n",
" *- winsor at (p1 p99), get new variable \"wage_w\"\n",
" . sysuse nlsw88, clear\n",
" . winsor2 wage\n",
"\n",
" *- winsor 3 variables at 0.5th and 99.5th percentiles, and overwrite\n",
" the old variables\n",
" . winsor2 wage age hours, cuts(0.5 99.5) replace\n",
"\n",
" *- winsor 3 variables at (p1 p99), gen new variables with suffix\n",
" _win, and add variable labels\n",
" . winsor2 wage age hours, suffix(_win) label\n",
"\n",
" *- left-winsorizing only, at 1th percentile\n",
" . winsor2 wage, cuts(1 100)\n",
"\n",
" *- right-trimming only, at 99th percentile\n",
" . winsor2 wage, cuts(0 99) trim\n",
"\n",
" *- winsor variables at (p1 p99) by (industry), overwrite the old\n",
" variables\n",
" . winsor2 wage hours, replace by(industry)\n",
"\n",
"\n",
"References\n",
"----------\n",
"\n",
" Anonymous. 1951. In memoriam: Charles P. Winsor. Biometrics 7: 221.\n",
"\n",
" Barnett, V. and Lewis, T. 1994. Outliers in statistical data.\n",
" Chichester: John Wiley. [Previous editions 1978, 1984.]\n",
"\n",
" Tukey, J.W. 1962. The future of data analysis. Annals of Mathematical\n",
" Statistics 33: 1-67.\n",
"\n",
"\n",
"Acknowledgements\n",
"----------------\n",
"\n",
" Codes from winsor by Nicholas J. Cox and -winsorizeJ.ado- by Judson\n",
" Caskey have been incorporated.\n",
"\n",
"Author\n",
"------\n",
"\n",
" Yujun,Lian (arlionn) Department of Finance, Lingnan College, Sun Yat-Sen\n",
" University.\n",
" E-mail: arlionn@163.com.\n",
" Blog: https://www.lianxh.cn\n",
"\n",
"\n",
"Other Commands I have written\n",
"-----------------------------\n",
"\n",
"\n",
" lianxh (if installed) ssc install lianxh (to install)\n",
" bdiff (if installed) ssc install bdiff (to install)\n",
" hhi5 (if installed) ssc install hhi5 (to install)\n",
" uall (if installed) ssc install uall (to install)\n",
" xtbalance (if installed) ssc install xtbalance (to install)\n",
"\n",
"\n",
"Also see\n",
"--------\n",
"\n",
" Online: summarize, means, winsor (if installed), trimplot (if\n",
" installed), trimmean (if installed), iqr (if installed), robmean\n",
" (if installed)\n",
"\n",
". "
]
}
],
"source": [
"%%stata\n",
"help winsor2"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Macros\n",
"\n",
"For example, I define the explanatory variables I will use later."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"%%stata\n",
"global X \"total_bill size\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Regression\n",
"\n",
"All parameters are optional:\n",
"- Data input: `df`\n",
"- Force update data\n",
" - Otherwise Stata will use the latest data in memory\n",
" - If you pretty sure you do not change the data, it will be faster to use the data in memory instead of forcing update.\n",
"- Estimation result output: `eret`\n",
"- Data output: `df2`\n",
"\n",
"For more details, please refer to `%%stata?`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
". reghdfe tip $X, a(sex time day smoker) vce(cluster time)\n",
"(MWFE estimator converged in 5 iterations)\n",
"warning: missing F statistic; dropped variables due to collinearity or too few \n",
"> clusters\n",
"\n",
"HDFE Linear regression Number of obs = 244\n",
"Absorbing 4 HDFE groups F( 2, 1) = .\n",
"Statistics robust to heteroskedasticity Prob > F = .\n",
" R-squared = 0.4701\n",
" Adj R-squared = 0.4497\n",
" Within R-sq. = 0.4551\n",
"Number of clusters (time) = 2 Root MSE = 1.0264\n",
"\n",
" (Std. err. adjusted for 2 clusters in time)\n",
"------------------------------------------------------------------------------\n",
" | Robust\n",
" tip | Coefficient std. err. t P>|t| [95% conf. interval]\n",
"-------------+----------------------------------------------------------------\n",
" total_bill | .094487 .0030743 30.73 0.021 .0554239 .1335501\n",
" size | .175992 .041936 4.20 0.149 -.3568556 .7088396\n",
" _cons | .6765225 .1685903 4.01 0.155 -1.465621 2.818666\n",
"------------------------------------------------------------------------------\n",
"\n",
"Absorbed degrees of freedom:\n",
"-----------------------------------------------------+\n",
" Absorbed FE | Categories - Redundant = Num. Coefs |\n",
"-------------+---------------------------------------|\n",
" sex | 2 0 2 |\n",
" time | 2 2 0 *|\n",
" day | 4 1 3 |\n",
" smoker | 2 1 1 ?|\n",
"-----------------------------------------------------+\n",
"? = number of redundant parameters may be higher\n",
"* = FE nested within cluster; treated as redundant for DoF computation\n",
"\n",
". est store m1\n",
"\n",
". \n"
]
}
],
"source": [
"%%stata -d df -force -eret eret -doutd df2\n",
"reghdfe tip $X, a(sex time day smoker) vce(cluster time)\n",
"est store m1"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see what has been returned to Python:\n",
"\n",
"1. The data generated by Stata -> Convenient for further processing\n",
"2. The estimation results -> Process model parameters easily"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"