tag:blogger.com,1999:blog-89604686942792610052024-03-05T04:51:26.036-08:00Adventures in SASCoding findings and challenges in academic research.Unknownnoreply@blogger.comBlogger16125tag:blogger.com,1999:blog-8960468694279261005.post-19891309176140182152016-11-03T10:11:00.000-07:002016-11-03T10:11:46.258-07:00RDPlotI like open source. Specifically, I like github. Mostly because if something annoys me in code, I can fork, change, then pull from my branch in the future (how I handle <a href="https://github.com/gaulinmp/tqdm" target="_blank">TQDM</a>). Or clone, update locally, and pull changes, dealing with conflicts in the future (how I handle <a href="https://github.com/robbyrussell/oh-my-zsh" target="_blank">oh-my-zsh</a>). Either way, I can make my changes persist somehow.<br />
<br />
The reason I bring this up, is because I wanted to test my code against the <a href="https://sites.google.com/site/rdpackages/rdrobust" target="_blank">RDPlot</a> package. I threw my data at it and it worked just fine, until I tried to do area confidence limits. Because the <a href="http://www-personal.umich.edu/~cattaneo/software/rdrobust/stata/rdplot.ado" target="_blank">RDPlot code</a> uses <span style="font-family: "courier new" , "courier" , monospace;">xline</span>, Stata lays that down as part of the axis, which the area plot then covers. So there's no vertical line at the RD cutoff in the middle of the graph. This will just not do. So I opened the code, and added the following (on line 646 if you're interested, changes <b>bolded</b>). Now it looks as I desire, but I can't make this persist across updates. Nor can I submit a bug report or a pull request. If only it were on github :(. C'est la vie, it works for now.<br />
<br />
<span style="font-family: monospace;"><b>quietly sum cir_bin, d<br />
local plot_y_max = r(max)<br />
quietly sum cil_bin, d<br />
local plot_y_min = r(min)</b><br />
<br />
twoway (rarea cil_bin cir_bin meanx_bin, sort color(gs11)) ///<br />
(scatter meany_bin meanx_bin, sort msize(small) mcolor(gs10)) ///<br />
(line y_hat x_sup if x_sup<`c', lcolor(black) sort lwidth(medthin) lpattern(solid)) /// (line y_hat x_sup if x_sup>=`c', lcolor(black) sort lwidth(medthin) lpattern(solid)) ///<br /><b>
(pci `plot_y_max' `c' `plot_y_min' `c', lcolor(black) lwidth(medthin) legend(off)), ///</b><br />
xline(`c', lcolor(black) lwidth(medthin)) xscale(r(`x_min' `x_max')) legend(cols(2) order(2 "Sample average within bin" 3 "Polynomial fit of order `p'" )) `graph_options'<br />
<br />
</span>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-70069279120623912002016-06-09T18:45:00.000-07:002016-06-09T18:45:29.366-07:00Ipython parallel local vs. engine executionTL;DR: using a lambda in the map on a <a href="https://ipyparallel.readthedocs.io/en/latest/" target="_blank">ipyparallel</a> View will obviate loading the function locally.<br />
<br />
I've always used <span style="font-family: "courier new" , "courier" , monospace;">%%px --local</span> to do parallel processing in Python. But recently I wanted to throw all my code in a python file, then just have a short notebook that essentially just kicked off the processes and wrote the results to disk. So I tried this:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">#In [1]:<br />
from ipyparallel import Client<br />
IP_client = Client()<br />
IP_view = IP_client.load_balanced_view()<br />
<br />
# In [2]:<br />
%%px<br />
import sys<br />
sys.path.append('.../code/')<br />
from myresearch import analyze_multiple_ciks<br />
<br />
#In [3]:<br />
N = len(IP_client.ids) # or larger for load balancing<br />
_gs = [df[(df.cik > (_d.cik.quantile(i/N) if i else 0))<br />
&(df.cik <= df.cik.quantile((i+1)/N))] <br />
for i in range(N)] <br />
<br />
#In [4]:<br />
res = IP_view.map(analyze_multiple_ciks, _gs)</span></blockquote>
However this doesn't work. The reason is the <span style="font-family: "courier new" , "courier" , monospace;">IP_view.map</span>; it's looking for <span style="font-family: "courier new" , "courier" , monospace;">analyze_multiple_ciks</span> locally, which we haven't loaded. So wrapping that function to defer its referencing seems to work:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">#In [4]:<br />
res = IP_view.map(lambda x: analyze_multiple_ciks(x), _gs)</span></blockquote>
Perhaps this was obvious, but I couldn't find much online about it. Also I do the chunking manually in In[3] because I've found using ipython to queue 23,000 tasks is really slow. So I wrap my code in an 'analyze_multiple' function and reduce the queue length considerably. Maybe that's not still a problem in the updated ipyparallel, but it's how I've always done it.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-45923250767782492822016-04-05T14:32:00.006-07:002016-04-05T14:32:59.505-07:00Clustered Standard Errors in Statsmodel OLS<div class="tr_bq">
I am using <a href="http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html" target="_blank">Statsmodel</a> instead of STATA where possible, and wanted to cluster standard errors by firm. The problem I encountered was I use <a href="https://patsy.readthedocs.org/en/latest/" target="_blank">Patsy</a> to create the endog/exog matrices, and statsmodel requires the cluster group Series to match length. (Aside: There's an open <a href="https://github.com/statsmodels/statsmodels/issues/1220" target="_blank">Github issue</a> about this.) I'm sure there are more clever solutions, but mine was to give Patsy a dataframe with no missing data. The statsmodels documentation was a bit unclear, so I figured I'd share the working snippet below. </div>
<br />
<blockquote>
<span style="font-family: monospace;"># Selection criteria<br />
select_df = (df[(df['at']>1) & (df['ff12']!=8)]<br />
.sort_values('cik y_q'.split()))<br />
<br />
# Columns that appear in regressions, as well as group variable<br />
cols = 'cik cp ni_at re_at xrd_at at y_q ff12'.split()<br />
<br />
# Final dataframe with no missing data.<br />
# This gets the patsy arrays and group series to have the same length.<br />
reg_df = select_df.ix[select_df[cols].notnull().all(axis=1), cols]<br />
<br />
mod = sm.OLS.from_formula('cp ~ ni_at + re_at + xrd_at + np.log(at)'<br />
'+ C(y_q) + C(ff12)', reg_df)<br />
<br />
res = mod.fit(cov_type='cluster', cov_kwds={'groups': reg_df['cik']})<br />
<br />
# output results without F.E. dummies<br />
print("\n".join([x for x in str(res.summary()).split('\n')<br />
if 'C(' not in x]))<br />
</span></blockquote>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-16800465521002905402016-02-16T22:19:00.001-08:002016-04-29T18:35:42.843-07:00Fama French IndustriesI'm back in Python and needing to get FF12 from sic codes. So I wrote a little script to download the definitions from French's website and make a Pandas DataFrame that allows for merging. Thought I would share:<br />
<br />
<br />
<b>Edit:</b> An alternative is to use <span style="font-family: Courier New, Courier, monospace; font-size: x-small;">pandas_datareader.famafrench</span><br />
<br />
<br />
<script src="https://gist.github.com/gaulinmp/89bf63daea6add5e013d.js"></script>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-75984966615446932282016-02-01T14:24:00.002-08:002016-02-01T14:24:58.065-08:00SAS on XUbuntuFor a long time I only had SAS running in -nodms mode on the latest XUbuntu, my desktop's OS. Today I finally figured it out, and wanted to share just in case anyone else has had this problem.<br />
<br />
First off, I'm running Xubuntu Wily (15.10), and SAS 9.4. The installation didn't work in graphical mode, because when I <span style="font-family: Courier New, Courier, monospace;">sudo su sas</span>, then <span style="font-family: Courier New, Courier, monospace;">./sasdm.sh</span>, it complains: <span style="font-family: Times, Times New Roman, serif;"><i>Can't connect to X11 window server using ':0' as the value of the DISPLAY variable</i></span>. Whatever, <span style="font-family: Courier New, Courier, monospace;">./sasdm.sh -console</span> works. Anyway, the first problem when launching SAS is that it complained about the SASHELP Portable Registry being corrupted. Turns out it didn't exist at all. So I had to copy <b>regstry.sas7bitm</b> from a working version of SAS 9.3 (yeah, it worked across versions somehow) to my local sascfg directory (<i>/opt/SASHome/SASFoundation/9.4/nls/en/sascfg/</i>). Once that was there, I started getting errors about missing libraries. First <b>libXp.so.6</b>, which doesn't exist on the Wily repo any more, and must be downloaded from the Vivid repo here:<br />
<br />
<a href="http://packages.ubuntu.com/vivid/amd64/libxp6/download">http://packages.ubuntu.com/vivid/amd64/libxp6/download</a><br />
<br />
And secondly libjpeg.so.62, which can be installed with <span style="font-family: Courier New, Courier, monospace;">sudo apt-get install libjpeg62-dev libjpeg62</span>. Finally once that was done, SAS loaded in dms mode. It also now runs in X11 mode forwarded over ssh now too.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-28250432940647917872015-11-12T17:34:00.002-08:002015-11-12T17:35:08.749-08:00SAS on JupyterI strongly prefer to do all my coding from within <a href="https://github.com/jupyter/jupyter" target="_blank">Jupyter</a> notebooks, but that's not really possible when everyone else uses SAS (well, in Accounting). So I threw together a really simple SAS kernel for Jupyter, which is hosted on github (<a href="https://github.com/gaulinmp/sas_kernel" target="_blank">gaulinmp/sas_kernel</a>). It'd definitely a work in progress, right now it doesn't even strip line numbers. But my free time is limited, what with dissertating and all. While I'm at it, I'll also plug my <a href="https://github.com/gaulinmp/pyedgar" target="_blank">SEC EDGAR</a> python library, which I use a lot these days.Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-8960468694279261005.post-637396987306020342014-05-20T10:48:00.004-07:002016-02-02T10:55:09.588-08:00Helpful SAS UI/usage tipsTL;DR: <a href="http://support.sas.com/resources/papers/proceedings12/151-2012.pdf" target="_blank">http://support.sas.com/resources/papers/proceedings12/151-2012.pdf</a><br />
<br />
After a long time reading and too little time in Python, I'm back to SAS. My setup involves sshing into a linux server and using SAS over X11 because I like to look at tables. I know there is SAS interactive mode, but I'm a Luddite or don't want to incur the learning costs.<br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjXGv8CSB6EYl18BIIyEARPv_99oM2twnNIBQwXgtOBmrGIyAStiGpCqHEn5rwKIl_9fp_T8FYoztuLL7f3dGwfmXEBSKCpAOCHbcyFD3r6tu2gpXwgEuGHsgo8L_LjyzroUt64yXCLlL/s1600/keys_file.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfjXGv8CSB6EYl18BIIyEARPv_99oM2twnNIBQwXgtOBmrGIyAStiGpCqHEn5rwKIl_9fp_T8FYoztuLL7f3dGwfmXEBSKCpAOCHbcyFD3r6tu2gpXwgEuGHsgo8L_LjyzroUt64yXCLlL/s1600/keys_file.png" width="127" /></a> Right now my keys file looks like this:<br />
<br />
<b>clear;paste;submit;</b> is what I use most often. I program in SublimeText3, and copy sections of code, alt-tab over and hit F1.<br />
<br />
<b>vt &syslast. colheading=name;</b> uses the VIEWTABLE (vt) command to open a table for viewing, and &syslast is an automatic variable that stores the last edited table. This is a 'what did I just make' button.<br />
<br />
<b>gsubmit "QUIT;PROC SQL;"</b> is convenient because I do almost everything in PROC SQL, which I like to just leave running. But when I jump out quickly, this gets me back in so I don't have to copy and paste the proc start command.<br />
<br />
<b>gsubmit "%REMOVE_LABELS(&syslast);"</b> runs a macro that removes the labels from the last file edited. This is not really that important, I just don't like labels in my datasets.<br />
<br />
<br />
The REMOVE_LABELS macro can be found in my <a href="https://gist.github.com/gaulinmp/99bc9c7eb8fe14dd7860#file-macros-sas" target="_blank">MACROS.SAS gist</a>.<br />
<br />
Also to automatically display variable names in the column headings of tables, <a href="http://www.sas.com/offices/europe/uk/support/sas-hints-tips/sas-column-names.html" target="_blank">see here</a>.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-72017497955416450692013-09-03T01:43:00.003-07:002013-09-03T01:43:55.155-07:00STATA Quote MadnessSo I was trying to make pretty LaTeX tables in Stata using esttab, but I happen to be using a program that just outputs scalars. Here's the program in a nutshell:<br />
<br />
<blockquote class="tr_bq" style="font-family: Courier New, Courier, monospace;">
foreach var of varlist at lt invt ppent sale re xrd ta dv {<br />
local gtitle = "Total Assets"<br />
if "`var'" == "lt" {<br />
local gtitle = "Total Liabilities"<br />
}/* List the rest of the titles */<br />
discont `var' port // This is my program that returns scalars.<br />
matrix tmpmat_`var' = r(leftpred), r(rightpred), ///<br />
r(d), r(zstat), r(pstat)<br />
<br />
/* You have to initialize the matrix, of course. */<br />
if "`var'" == "at" {<br />
matrix tmpmat_all = tmpmat_`var'<br />
}<br />
else { /* But then the format is pretty friendly */<br />
matrix tmpmat_all = (tmpmat_all \ tmpmat_`var')<br />
}<br />
<br />
/* Now how to get those title names into the matrix rows? */<br />
local mrownames: display `"`mrownames'"' " " `"`"`gtitle'"'"'<br />
<br />
} /* Done with the foreach*/<br />
<br />
matrix colnames tmpmat_all = E[Left] E[Right] Difference Z-stat P-stat<br />
matrix rownames tmpmat_all = `mrownames'<br />
esttab matrix(tmpmat_all), nomtitles<br />
esttab matrix(tmpmat_all) using table.tex, nomtitles replace</blockquote>
<br />
In case that wasn't clear, the operable line to get those labels working was:<br />
<span style="font-family: Courier New, Courier, monospace;">local mrownames: display `"`mrownames'"' " " `"`"`gtitle'"'"'</span><br />
<br />
Come on STATA... seriously? I think this is the logic:<br />
<br />
1) <span style="font-family: Courier New, Courier, monospace;">local mrownames: display </span><br />
Of course we can't assign it directly, you gotta format it. This is probably my ignorance, I'm sure there's a better way.<br />
<br />
2) <span style="font-family: Courier New, Courier, monospace;"><span style="background-color: yellow;">`"</span>`mrownames'<span style="background-color: yellow;">"'</span> " " `"`"`gtitle'"'"'</span><br />
Make sure the variable between these keeps its quotes.<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
3) <span style="font-family: Courier New, Courier, monospace;">`"<span style="background-color: yellow;">`mrownames'</span>"' " " `"`"`gtitle'"'"'</span><div>
The list so far. In quotes remember.</div>
<div>
<br /><div>
4) <span style="font-family: Courier New, Courier, monospace;">`"`mrownames'"' <span style="background-color: yellow;">" "</span> `"`"`gtitle'"'"'</span></div>
<div>
Make sure to leave a space between your quoted strings.</div>
<div>
<br /></div>
5) <span style="font-family: Courier New, Courier, monospace;">`"`mrownames'"' " " `"`"<span style="background-color: yellow;">`gtitle'</span>"'"'</span></div>
<div>
The new title variable, so good so far.</div>
<div>
<br /></div>
<div>
<div>
6) <span style="font-family: Courier New, Courier, monospace;">`"`mrownames'"' " " `"<span style="background-color: yellow;">`"</span>`gtitle'<span style="background-color: yellow;">"'</span>"'</span></div>
<div>
It's gotta be in quotes, of course.</div>
<div>
<br /></div>
7) <span style="font-family: Courier New, Courier, monospace;">`"`mrownames'"' " " <span style="background-color: yellow;">`"</span>`"`gtitle'"'<span style="background-color: yellow;">"'</span></span></div>
<div>
Because you are up late and deserve to be punished. How long did it take you to figure this one out? Yeah, you could have been sleeping already if this were Python.</div>
<div>
<br /></div>
<div>
I really don't like STATA.</div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-60192846207400817112012-11-18T13:12:00.001-08:002013-03-11T13:14:13.813-07:00Persistant Default LibraryI've been switching back and forth between libraries I define and the work library, mostly because I don't like putting "mylib." in front of every library name. But today I learned that if you define the USER library, it will use that as a permanent work library. So for example the following code:<br />
<blockquote class="tr_bq"><span style="font-family: Courier New, Courier, monospace;">libname USER "D:/SAS/project1";<br />
<br />
DATA example_database;<br />
<br />
SET other_database_in_D_SAS_project1;<br />
<br />
RUN;</span></blockquote>Will create example_database.sas7db in the D:/SAS/project1 folder, and when you reboot SAS and run the libname user command again (or better yet, put it in your autoexec.sas file), all your work files will be there waiting. It's a time saver if you are working on a project and have to shutdown SAS.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-45585204680862773642012-03-16T20:52:00.000-07:002012-03-16T20:53:01.022-07:00Data Step Array (Macro) VariablesI want some thing simple. I have a data file with many fields with sequential names, and I want to reorganize them. It's all dead simple regex logic: var12 becomes var2 in row 1, var22 becomes var2 in row 2. Two minutes in python. In SAS...<br />
<br />
So here's my method of making an Array Macro Variable (nothing native to the best of my knowledge) on which I will then use numbered indexes to massage the table later:<br />
<br />
<blockquote class="code">data _null_;<br />
i = 1;<br />
DO name = "bear","pig","velociraptor";<br />
ii = left(put(i,2.));<br />
call symput('variable_name'||ii,name);<br />
i+1;<br />
put name;<br />
END;</blockquote><br />
The output of that is the following<br />
<br />
<blockquote class="code">bear<br />
pic<br />
velo</blockquote><br />
Oh yeah, that happened. SAS guessed the length of <i>name</i> for the loop at 4 characters, then truncated velociraptor.<br />
<br />
The solution was to use the <i>length name $12</i>:<br />
<br />
<blockquote class="code">data _null_;<br />
i = 1;<br />
length name $12;<br />
DO name = "bear","pig","velociraptor";<br />
ii = left(put(i,2.));<br />
call symput('variable_name'||ii,name);<br />
i+1;<br />
put name;<br />
END;</blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-19783660654740210812012-03-12T03:00:00.001-07:002012-03-12T03:00:42.022-07:00Cleaning SDC Downloaded DataI wrote some python code to simply clean up SDC downloaded fixed width data. It uses easygui, a python library to give it a little GUI interface, and I'll be the first to admit it's pretty crappy. But it works and made my life easier, so I'll share.<br />
<br />
<a href="https://docs.google.com/document/d/14Y2rm_Z3SfPTfL12U-vOXiLdHAUa0v_QHT0Knm6QFiE/edit">Google Doc Link</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-73289869467026288462012-03-11T22:28:00.000-07:002012-03-11T22:40:20.613-07:00Compustat Codes and Field (Variable) NamesMaybe this is common knowledge, but this page made life infinitely easier to replicate old papers:<br />
<br />
<blockquote>
<a href="http://www.crsp.chicagobooth.edu/documentation/product/ccm/cross/annual_data.html">http://www.crsp.chicagobooth.edu/documentation/product/ccm/cross/annual_data.html</a></blockquote>
<br />
Now all I need is free time to turn that into a tool for automatic SQL query generation.<br />
<br />
Also while I'm at it I may as well plug a program I use constantly. <a href="http://winsplit-revolution.com/">WinSplit Revolution</a> allows you to resize windows with key commands (I use control alt numpad). So setting up my desktop to look like below takes a few keyboard presses. If you use multiple monitors it's a life saver. Because sharing is caring.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhB7YpCAajq4wOOAAcnqg0Y8xgD5jRjz5qkz2drRGZj5UrDbTjI5VKA-78J1yH5AuUIrySzHG7zgNyUnVXXmx9phFLINgvwU-FAXN55zqtzxCVdi7FoTcwwOiZW-NVHo1AYoWEi2QK_LriJ/s1600/desktop.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="291" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhB7YpCAajq4wOOAAcnqg0Y8xgD5jRjz5qkz2drRGZj5UrDbTjI5VKA-78J1yH5AuUIrySzHG7zgNyUnVXXmx9phFLINgvwU-FAXN55zqtzxCVdi7FoTcwwOiZW-NVHo1AYoWEi2QK_LriJ/s400/desktop.jpg" width="400" /></a></div>
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-39007731543100498082012-03-09T16:23:00.000-08:002012-03-09T16:23:15.406-08:00Searching Google ScholarI've been doing non-SAS research, and got tired of messing around with Google Scholar searches, so I made the following chrome search tag:<br />
<blockquote class="code">
http://scholar.google.com/scholar?&num=100&as_subj=bus&as_q=%s</blockquote>
That searches within Business and Finance journals only which is quite convenient in my opinion. To add to Chrome (similar in firefox), go to the URL in the following screen shot and enter the URL quoted above where it says. I put the letter <b>s</b> for the keyword, so now when I want to do a google scholar search I type <i>CNTRL-L</i>, then type <i>s</i> and hit space. Anything typed after the space will be the search term.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGHTuOaFZurM41z0dvDbipLtyxP2ou_hgkcDstAZFBuBiDsJj7IKhcNDoRPN0-4O6CO7gKGwWg8LitAAGbemowTHcrut1C8sF59AGp0KN6pdzGdixdhP0t5deNobi35WiA4kCmK8jQcehC/s1600/scholar_search.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="387" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGHTuOaFZurM41z0dvDbipLtyxP2ou_hgkcDstAZFBuBiDsJj7IKhcNDoRPN0-4O6CO7gKGwWg8LitAAGbemowTHcrut1C8sF59AGp0KN6pdzGdixdhP0t5deNobi35WiA4kCmK8jQcehC/s640/scholar_search.jpg" width="640" /></a></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-32553968519286706552012-02-28T17:24:00.000-08:002012-02-28T17:24:30.775-08:00Remove Overlapping WindowsA little bit of SAS code to remove dates that overlap by some amount (defined in a Macro Variable). Example from an event study with dates from SDC.<br />
<br />
First off sort the data, removing any duplicates. In this case duplicates is determined just by the variables sorted on, so permco, permno, and date. This removes the problem of having more than one event on the same day.<br />
<br />
Create a temporary table with lag and difference fields. The lag field is more for debugging as the difference field is all we care about. If the event is the first event for the company, leave the fields blank.<br />
<br />
Create the output table and only keep events where the event date is either blank (first event) or if the time between events is greater than the window.<br />
<br />
<center>------ Code ------ </center><br />
<b>mydata</b>: the library where my crsp and cleaned sdc files are<br />
<b>sdc_permno</b>: database with SDC data and permno's instead of cusips (already linked SDC and CRSP)<br />
<b>sdc_final</b>: the output sdc data file with the event dates (non overlapping) and permnos<br />
<b>&beta_pre_length</b>: The macro variable with the length of my pre window for calculating the betas in the market return model<br />
<b>&study_start_date</b>: The macro bariable with the start date<br />
<br />
<blockquote class='code'>/* <b>SDC_FINAL</b>: Clean up data and remove all SEOs with another SEO within window_prev before. */<br />
PROC SORT DATA=mydata.sdc_permno OUT=tmp_sdc_permno NODUPKEY;<br />
BY permco permno fdate;<br />
RUN;<br />
DATA tmp_sdc_lag; <br />
SET tmp_sdc_permno; <br />
BY permco permno fdate idate shares shares_p shares_s; <br />
lag_date = LAG(fdate); <br />
dif_date = DIF(fdate);<br />
IF FIRST.permco THEN DO; <br />
lag_date = .; <br />
dif_date = .; <br />
END;<br />
RUN;<br />
PROC SQL;<br />
CREATE TABLE mydata.sdc_final AS<br />
SELECT monotonic() as count, permno, fdate, shares, shares_p, shares_s<br />
FROM tmp_sdc_lag<br />
WHERE fdate > &study_start_date and (dif_date EQ . OR dif_date > &beta_pre_len)<br />
;<br />
DROP TABLE tmp_sdc_lag;<br />
DROP TABLE tmp_sdc_permno;<br />
QUIT; </blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-46843765991152549952012-02-27T16:13:00.001-08:002012-02-27T16:14:12.358-08:00sas7bndx: Story of SAS IndexesLast night I ran code to extract companies from the CRSP Daily Stock file. It ran all night and was going to take 8 more hours today. This is unacceptable. The fix was relatively simple, here's how it came about.<br />
<br />
First I got the file original CRSP datafile (2 hour download):<br />
<br />
<blockquote class="code">libname mydata = "D:\SAS";<br />
%let wrds=wrds.wharton.upenn.edu 4016;<br />
options comamid=TCP remote=WRDS;<br />
signon username=&username password=&password;<br />
rsubmit;<br />
libname crspa '/wrds/crsp/sasdata/a_stock';<br />
PROC DOWNLOAD DATA= crspa.dsf<br />
OUT=mydata.crsp_dsf;<br />
WHERE date > '01JAN1995'd; <br />
RUN;<br />
endrsubmit;<br />
signoff;</blockquote><br />
But extracting permnos from that file took 1 minute for each PROC SQL statement. Everyone said something was wrong, no one had ideas what. So I poked around a WRDS ssh session and noticed all these .sas7bndx files. <i>NDX sounds like index</i> I cleverly thought to myself, <i>maybe SAS has an index file to make looking up permnos faster!</i> Lo and behold, it does. So then I ran this code:<br />
<br />
<blockquote class="code">PROC DATASETS LIBRARY=mydata;<br />
MODIFY crsp_dsf;<br />
INDEX CREATE permno;<br />
RUN;</blockquote><br />
And now my PROC SQL queries execute in milliseconds. Day seized.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-8960468694279261005.post-383122210657918352012-02-27T00:06:00.000-08:002012-02-28T17:29:46.093-08:00PROC SQL Dates and Macro VariablesHere's a little quiz: What is the name of a 'column' in a database?<br />
<blockquote>a) column<br />
b) field<br />
c) anything but<br />
d) variable</blockquote>If you answered D, go to wikipedia and find 'variable' on the SQL page. Go, I'll wait. So after wasting hours searching for variables, I learn that SAS calls columns variables, and the things that represent numbers for flexibility (in algebra you call them variables) are MACRO variables. Good luck googling help for macro variables and avoiding all the pages that discuss both database variables and macros.<br />
<br />
I digress. Here's what I learned. The following code: <br />
<blockquote class="code">PROC SQL NOPRINT;<br />
SELECT permno,fdate format=mmddyy10.<br />
INTO :permnovar, :eventdate<br />
FROM libstore.sdc_events<br />
WHERE count = &i;<br />
<br />
SELECT *<br />
FROM libstore.crsp_dsf<br />
WHERE date = &eventdate;<br />
QUIT;</blockquote>Will return null. Oh the data is in there, if you replace &eventdate; by '01JAN2001'd you will get data. But SAS can't pull a date from one table and use it in the WHERE query in another table. MACRO VARIABLES just don't work that way. Duh.<br />
<br />
So the solution? Everything is text. That's my new mantra. Obviously the solution is to use the following code: <br />
<blockquote class="code">PROC SQL NOPRINT;<br />
SELECT permno,fdate format=<b>date9.</b><br />
INTO :permnovar, :eventdate<br />
FROM libstore.sdc_events<br />
WHERE count = &i;<br />
QUIT;<br />
<br />
%let eventdate_cleaned = "&eventdate"d;<br />
<br />
PROC SQL NOPRINT;<br />
SELECT *<br />
FROM libstore.crsp_dsf<br />
WHERE date = &eventdate_cleaned;<br />
QUIT;</blockquote>The other thing to note is that mmddyy10 is NOT the format to use. Even though it comes out of SDC like that, and CRSP uses something else, PROC SQL uses date9 in its comparison. In fact '01/01/2000'd throws an error in PROC SQL. Don't ask, just accept it.<br />
<br />
So the moral of this story is make everything a string. Always. Also use date9 in PROC SQL. And '&eventdate' doesn't work, only double quotes evaluates the macro variable. Ugh, macro variable angers me.Unknownnoreply@blogger.com2