Lutz Hendricks - UNC - Department of
Working With Data
- Read the data documenation.
- You need to know how the data are structured, who was surveyed, limitations of the data, etc.
- Assignment: Browse through the documenation for the Current Population Survey (CPS) here. Also look at the sample design.
- Figure out the variables that you need.
- Make sure you read the codebooks, especially if the sample design is complicated.
- You need to know in which context the questions were asked and who was asked.
- Assignment: Browse through the CPS codebooks.
- Download data, usually from some web site.
- Convert data into the format you want.
- Datasets usually come in odd formats, such as nondelimited ascii files.
- Stats packages, such as
Stata can often read those files (data sources often provide programs for this).
- Stat/Transfer can convert between data formats. Especially useful if you want to work in a real programming language rather than a stats package.
- Show summary statistics and distributions for all variables.
- Check for implausible values, outliers, odd distributions (e.g., we would expect a wage distribution to look roughly log-Normal).
- Look at time series plots. Investigate any suspicious jumps.
What Can Go Wrong?
- Not enough observations to get precise answers.
- Example: The Brazilian census has >1m observations, but only a few thousand immigrants.
- Questions and coding change over time.
- Example: CPS records number of years of schooling until 1980s; then highest degree earned.
- Imputed responses (often not obvious).
- Example: CPS imputes earnings for quite a few observations, but IPUMS does not tell you for which ones.
- Top coding, especially for dollar figures (income, wealth).
- Samples are not representative.
- Example: New Immigrant Survey only contains Legal permanent residents.
- Almost all panel datasets omit most of the richest 1% (PSID, NLSY).
- Attrition in panel data (people drop out non-randomly over the years).
Stats packages, such as
Stata are great at data handling. But their programming languages are a disaster.
Personally, I refuse to work in poorly designed scripting languages and use
Matlab instead. This means that I cannot help you with
Stata code (but your advisor likely can).
- Write your code so that you can go from raw data to final results by running a single command without manual intervention. It’s the only way to ensure consistency.
- Read the questionnaires. You need to know exactly what people were asked to understand their answers.
- The questionnaire also reveals who is in the “universe” for a question (who was asked).
- Code and Data for the Social Sciences: A Practitioner’s Guide by Gentzkow and Shapiro.