Lutz Hendricks - UNC - Department of
Working With Data
Unlike in previous classes you may have taken, the data for your thesis are raw. You need to understand the structure of the dataset and “clean” the data.
- Read the data documentation.
- You need to know how the data are structured, who was surveyed, limitations of the data, etc.
- Assignment: Browse through the documentation for the Current Population Survey (CPS) here. Also look at the sample design.
- Figure out the variables that you need.
- Make sure you read the codebooks, especially if the sample design is complicated.
- You need to know in which context the questions were asked and who was asked.
- Assignment: Browse through the CPS codebooks.
- Download data, usually from some web site.
- Convert data into the format you want.
- Datasets usually come in odd formats, such as nondelimited ascii files.
- Stats packages, such as
Stata can often read those files (data sources often provide programs for this).
- Stat/Transfer can convert between data formats. Especially useful if you want to work in a real programming language rather than a stats package.
- Show summary statistics and distributions for all variables.
- Check for implausible values, outliers, odd distributions (e.g., we would expect a wage distribution to look roughly log-Normal).
- Look at time series plots. Investigate any suspicious jumps.
What Can Go Wrong?
- Not enough observations to get precise answers.
- Example: The Brazilian census has >1m observations, but only a few thousand immigrants.
- Questions and coding change over time.
- Example: CPS records number of years of schooling until 1980s; then highest degree earned.
- Imputed responses (often not obvious).
- Example: CPS imputes earnings for quite a few observations, but IPUMS does not tell you for which ones.
- Top coding, especially for dollar figures (income, wealth).
- Samples are not representative.
- Example: New Immigrant Survey only contains Legal permanent residents.
- Almost all panel datasets omit most of the richest 1% (PSID, NLSY).
- Attrition in panel data (people drop out non-randomly over the years).
Stats packages, such as
Stata are great at data handling. But their programming languages are a disaster.
Personally, I refuse to work in poorly designed scripting languages and use
Matlab instead. This means that I cannot help you with
Stata code (but your advisor likely can).
- Write your code so that you can go from raw data to final results by running a single command without manual intervention. It’s the only way to ensure consistency.
- Read the questionnaires. You need to know exactly what people were asked to understand their answers.
- The questionnaire also reveals who is in the “universe” for a question (who was asked).
- Code and Data for the Social Sciences: A Practitioner’s Guide by Gentzkow and Shapiro.