Tuesday, February 28, 2012

Remove Overlapping Windows

A little bit of SAS code to remove dates that overlap by some amount (defined in a Macro Variable). Example from an event study with dates from SDC.

First off sort the data, removing any duplicates. In this case duplicates is determined just by the variables sorted on, so permco, permno, and date. This removes the problem of having more than one event on the same day.

Create a temporary table with lag and difference fields. The lag field is more for debugging as the difference field is all we care about. If the event is the first event for the company, leave the fields blank.

Create the output table and only keep events where the event date is either blank (first event) or if the time between events is greater than the window.

------ Code ------

mydata: the library where my crsp and cleaned sdc files are
sdc_permno: database with SDC data and permno's instead of cusips (already linked SDC and CRSP)
sdc_final: the output sdc data file with the event dates (non overlapping) and permnos
&beta_pre_length: The macro variable with the length of my pre window for calculating the betas in the market return model
&study_start_date: The macro bariable with the start date

/* SDC_FINAL: Clean up data and remove all SEOs with another SEO within window_prev before. */
PROC SORT DATA=mydata.sdc_permno OUT=tmp_sdc_permno NODUPKEY;
    BY permco permno fdate;
RUN;
DATA tmp_sdc_lag;
    SET tmp_sdc_permno;
    BY permco permno fdate idate shares shares_p shares_s;
    lag_date = LAG(fdate);
    dif_date = DIF(fdate);
    IF FIRST.permco THEN DO;
        lag_date = .;
        dif_date = .;
    END;
RUN;
PROC SQL;
    CREATE TABLE mydata.sdc_final AS
    SELECT monotonic() as count, permno, fdate, shares, shares_p, shares_s
    FROM tmp_sdc_lag
    WHERE fdate > &study_start_date and (dif_date EQ . OR dif_date > &beta_pre_len)
    ;
    DROP TABLE tmp_sdc_lag;
    DROP TABLE tmp_sdc_permno;
QUIT;

Monday, February 27, 2012

sas7bndx: Story of SAS Indexes

Last night I ran code to extract companies from the CRSP Daily Stock file. It ran all night and was going to take 8 more hours today. This is unacceptable. The fix was relatively simple, here's how it came about.

First I got the file original CRSP datafile (2 hour download):

libname mydata = "D:\SAS";
%let wrds=wrds.wharton.upenn.edu 4016;
options comamid=TCP remote=WRDS;
signon username=&username password=&password;
  rsubmit;
    libname crspa '/wrds/crsp/sasdata/a_stock';
    PROC DOWNLOAD DATA= crspa.dsf
      OUT=mydata.crsp_dsf;
      WHERE date > '01JAN1995'd;
    RUN;
  endrsubmit;
signoff;

But extracting permnos from that file took 1 minute for each PROC SQL statement. Everyone said something was wrong, no one had ideas what. So I poked around a WRDS ssh session and noticed all these .sas7bndx files. NDX sounds like index I cleverly thought to myself, maybe SAS has an index file to make looking up permnos faster! Lo and behold, it does. So then I ran this code:

PROC DATASETS LIBRARY=mydata;
  MODIFY crsp_dsf;
  INDEX CREATE permno;
RUN;

And now my PROC SQL queries execute in milliseconds. Day seized.

PROC SQL Dates and Macro Variables

Here's a little quiz: What is the name of a 'column' in a database?
a) column
b) field
c) anything but
d) variable
If you answered D, go to wikipedia and find 'variable' on the SQL page. Go, I'll wait. So after wasting hours searching for variables, I learn that SAS calls columns variables, and the things that represent numbers for flexibility (in algebra you call them variables) are MACRO variables. Good luck googling help for macro variables and avoiding all the pages that discuss both database variables and macros.

I digress. Here's what I learned. The following code:
PROC SQL NOPRINT;
    SELECT permno,fdate format=mmddyy10.
    INTO :permnovar, :eventdate
    FROM libstore.sdc_events
    WHERE count = &i;

    SELECT *
    FROM libstore.crsp_dsf
    WHERE date = &eventdate;
QUIT;
Will return null. Oh the data is in there, if you replace &eventdate; by '01JAN2001'd you will get data. But SAS can't pull a date from one table and use it in the WHERE query in another table. MACRO VARIABLES just don't work that way. Duh.

So the solution? Everything is text. That's my new mantra. Obviously the solution is to use the following code:
PROC SQL NOPRINT;
    SELECT permno,fdate format=date9.
    INTO :permnovar, :eventdate
    FROM libstore.sdc_events
    WHERE count = &i;
QUIT;

%let eventdate_cleaned = "&eventdate"d;

PROC SQL NOPRINT;
    SELECT *
    FROM libstore.crsp_dsf
    WHERE date = &eventdate_cleaned;
QUIT;
The other thing to note is that mmddyy10 is NOT the format to use. Even though it comes out of SDC like that, and CRSP uses something else, PROC SQL uses date9 in its comparison. In fact '01/01/2000'd throws an error in PROC SQL. Don't ask, just accept it.

So the moral of this story is make everything a string. Always. Also use date9 in PROC SQL. And '&eventdate' doesn't work, only double quotes evaluates the macro variable. Ugh, macro variable angers me.