Building and validating an administrative records database for the United States.
The administrative records that make up STARS are:
* Internal Revenue Service Individual Master file of tax returns
* Internal Revenue Service Information Returns Master file of reported income and other information
* Medicare enrollment database
* Department of Housing and Urban Development public housing assistance file
* Selective Service System registration file of potential young male draft candidates
* Indian Health Service patient file
These files were selected to maximize coverage of the United States population, and to facilitate the integration of these data. Most people file an income tax return for themselves and their dependents, or have wage or interest income reported to the government. Because there are segments of the population that do not file tax returns or that have unreported income, we attempt to fill in coverage gaps with targeted files. The Medicare file includes data for much of the elderly population. The Selective Service file provides nearly complete coverage of young males, who are required to register should a military draft be necessary. The Indian Health Service file includes many American Indians, and the public housing assistance file targets the poor population.
There are several challenges in integrating these large data sources, which combined, total about 800 million records. Many people are represented in multiple files. For example, a young American Indian male might file a tax return, be registered for Selective Service, and have a record with the Indian Health Service. To facilitate integration, the administrative records include the Social Security number, a unique personal identifier, for each person record that allows us to avoid duplication. We compare each Social Security number to the administrative master list of numbers from the Social Security Administration, and remove invalid person records, such as those of the deceased, foreigners who filed taxes in the United States, and individuals with falsified Social Security numbers. The final STARS database does not include Social Security Numbers or names to preserve privacy and confidentiality, according to Census Bureau policy.
We select only administrative records files that include an address for each person record, which allows us to allocate people to census blocks and expand the applicability of the STARS database. For example, independent statistics based on administrative data can be computed for census blocks or higher levels of geographic aggregation. However, a given person may have varying addresses across files. For example, someone might move after filing his taxes and seek health services under Medicare at his new address. We resolve multiple addresses with a complex algorithm that generally uses address quality and timeliness to determine a single address for that person record in STARS.
The source files that comprise the prototype STARS 1999 are generally of a vintage that precedes Census 2000 by about 15 months. This precludes validating STARS using absolute numbers from the census. For example, STARS 1999 has about 257 million person records, compared to about 284 million for Census 2000. It is more meaningful to compare relative distributions of the population by race, Hispanic origin, age and sex. The tables below show that STARS 1999 does reasonably well at getting the correct demographic distribution of the population.
An updated version of STARS using more recent files along with several other improvements is currently under construction. STARS 2000 includes additional files to increase coverage of the population in public housing and to obtain more complete reporting of mortality. These and myriad other improvements will make STARS 2000 a more complete and accurate representation of the population of the United States.
Age comparisons at the national level (%) 0 - 17 years 18 - 29 years 30 - 49 years old old old Census 2000 26.0 16.6 30.3 StARS 1999 22.6 16.5 31.5 50+ years old Census 2000 27.1 StARS 1999 29.4 Race comparisons at the national level (%) American Asian or White or Other Black Indian Pacific Islander Census 2000 81.5 12.7 1.4 4.4 StARS 1999 83.1 11.9 0.9 4.1 Hispanic origin comparisons at the national level (%) Hispanic Not Hispanic Census 2000 12.5 87.5 StARS 1999 10.9 89.1 Sex comparisons at the national level (%) Male Female Census 2000 49.1 50.9 StARS 1999 49.3 50.7
James Farber and Charlene Leggieri, The authors are from the U.S. Census Bureau.
|Printer friendly Cite/link Email Feedback|
|Author:||Farber, James; Leggieri, Charlene|
|Publication:||New Zealand Economic Papers|
|Date:||Jun 1, 2002|
|Previous Article:||Matching and cleaning administrative data.|
|Next Article:||A Norwegian perspective on data integration.|