Tips for Using IPUMS (USA or International) Files
This describes my method for working with IPUMS data. Other methods may work just as well.
Importing Data
- Make a list of the variables you need. Always include household and person number, so that each person is uniquely identified.
- I usually leave filtering observations for later. This way, the data can be reused for other projects.
- Create an extract (see IPUMS website). Download it. The
extract comes as an undelimited text file with SAS and SPSS programs to
load it.
- Modify the SPSS (or SAS) program so it can load the file
(change the path to the "dat" file in the first line). Run the program
to load the file into SPSS.
- Write a program that breaks the file into variables. For
each variable you need a SAVE TRANSLATE statement. Now you have a set
of files such age "age.txt" One for each variable. Each row is a person.
Filter and Import into Matlab
- In Matlab: Loop over the variables and import each into a "mat" file. This is simply a load/save operation.
- Start with variables used for preliminary filtering. Create
a variable that is 1 for persons passing the filter and 0 otherwise.
- Then loop over all variables and import those passing the filter. Save them as numbered variables.
- Write a recode function for each variable. At least you want consistent missing value and topcoding codes.
Constructing summary variables
- I find it helpful to construct a multi-dimensional matrix
that contains everything I need later in finely disaggregated cells.
For example: If I want to compute schooling by country I might create a
matrix (one for each country/year sample) indexed by [school level,
age, sex, industry] that contains the number of observations, the total
mass (weight) in each cell, perhaps other characteristics that I might
use such as mean log wages.
- Once I have that matrix, I never need to touch individual
level data. For example, if I want to compute [fraction of persons with
a college degree outside of agriculture between the ages of 18 and 65],
all I need to do is average over the cells with the right [age, sex,
industry] values. It's fast and easy. Plus it ensures that all summary
stats are computed from the same universe of individuals.
Tip: Read a book on good programming practices before you start. I like "Writing Solid Code."
|