An easier method to extract data from large databases: the Medicare Hospital Cost Report database.
The purpose of this paper is to assist academic researchers to easily extract data from large databases like the Hospitals Medicare Cost Report Database that is published by the Center for Medicare and Medicaid Services. Though the data is inexpensive, the enormous size of the files may require a researcher to use a mainframe computer and hire a technical expert to extract the data in a form that is usable for a research study. Because we faced this dilemma after purchasing the database, we found a much easier means so that a single researcher could use a stand-alone personal computer to extract and manipulate the data. This paper describes the steps needed to accomplish that task.
Many academic researchers across a variety of disciplines need and use information contained in Medicare Hospital Cost Reports. This paper explains the type of information included in the Cost Reports and demonstrates how to extract and manipulate the data. The Medicare Cost Report database is inexpensive and easy to obtain. However, because of the enormous size of the database, the database provider suggests that the researcher load the data into a large database server running Microsoft SQL (Structured Query Language) Server, Oracle or Sybase. This recommendation precludes easy use of the data by a typical academic researcher on a stand-alone personal computer (PC). Thus, this paper outlines the steps needed for any researcher, regardless of database skills, to obtain, extract and use the data without sophisticated technology.
Hospitals that receive Medicare reimbursement are required to submit an annual cost report to the Center for Medicare and Medicaid Services (CMS). The cost report contains hospital information such as facility characteristics, utilization data, cost and charges by cost center (in total and for Medicare), Medicare settlement data, and financial statement data (Cost Reports 2005). The hospital cost report, Form CMS-255296, is a large collection of worksheets that are lettered and numbered according to the type of information contained in each worksheet (HCRIS Specifications 1999). For example, Worksheet G includes balance sheet information, Worksheet G-2 has revenue and expense information and Worksheet S-2 reports hospital characteristics such as whether the hospital is in a rural or urban area.
Worksheets are lettered from A through S with multiple subparts. One hospital cost report contains about two hundred printed pages. The database includes cost reports for about 6,000 hospitals for fiscal years ending on or after September 30, 1996 through March 31, 2005. Hence, the central portions of the database contain 106.6 million rows of data in three separate files. Unzipped, these three files contain approximately 6,286 MB, with one file, the NMRC file (see below) containing a bit over 3,000 MB of information. The next section provides information about obtaining the database, followed by practical information for accessing the data.
OBTAINING AND UNDERSTANDING THE MEDICARE HOSPITAL COST REPORT DATABASE
CMS manages the database through the Healthcare Cost Report Information System (HCRIS) (Hospital Cost Report 2005). At this time, CMS provides the database at a cost of $100 in a CD format. Previously, one could download the database from the website, but CMS discontinued that capability because of the database size. CMS organizes the data in a relational database format.
Prior to 2002, HCRIS organized cost reports in a flat file format where one record (one row) contained one hospital's cost report for one fiscal year, and each variable (column) contained one worksheet value (Asper 2003). Each row may have included thousands of columns, and database management became difficult if even one variable was added to the cost report. In 2002, CMS changed to a relational file format to make it easier to add variables to cost reports, and easier to obtain information cross-sectionally.
Now, the cost report is in three separate files that are linked by a report record number (Asper 2003). A graphical representation is available in the DataModel file at the CMS website (Hospital Cost Report, 2005). The first file is the Report file (HOSP_RPT) and that contains descriptive information about the hospital. The second and third files are the alphanumeric file (HOSP_RPT_ALPHNMRC) and the numeric file (HOSP_RPT_NMRC) where each file has five columns: the report record number, the worksheet code, the column number (from the worksheet), the line number (from the worksheet) and the data value. If the data value in the worksheet consists of character and numbers such as street address, then one searches the alphanumeric file, while numbers-only data is in the numeric file.
Obviously, most researchers would use a subset of the entire database to focus on specific hospitals, in specific years, using specific data from the cost reports. Because a unique 'report record number' links the three large files, the researcher must first locate that number. To do that, she downloads the hospital name and address file (HOSPITAL_NAMES_ADDRESSES), which is an Excel spreadsheet, from the CMS website (Hospital Cost Report 2005). In that file, she first finds the state (listed alphabetically) where the hospital is located. Then, she finds the hospital within that state (not listed alphabetically). Once she has located the hospital, she has its unique 'provider number' that is located in the first column.
For each 'provider number,' the researcher opens the HOSP_RPT file, and scrolls down to the provider number (third column). Each provider number has several rows of information with a separate row for each fiscal year. She finds the fiscal year(s) of interest, and then finds the unique 'report record number' in the first column. This 'report record number' is the critical link to the database, as it is unique to a particular hospital for a particular fiscal year.
With that 'report record number,' the researcher is now ready to access information from the worksheets using either the alphanumeric file or the numeric file. Blank templates of the Worksheet Forms (2552-96 Worksheets), found in an Excel file at the CMS website, guide the researcher in choosing the correct form. Having described how to obtain the database and the way that it is organized, we turn to accessing the extremely large numeric file.
EXTRACTING THE HOSPITAL COST REPORT DATA
When a researcher uses data collected by others, converting that data into a form suitable for the type of analysis intended can pose significant obstacles. In our case, the obstacle was the size of individual files in the Medicare Hospital cost report database. One file in the database, the numeric file, contains about 75 million records that 'unzip' into 3,000 MB of comma-delimited data (csv), whereby data items are organized by fields (e.g., record number) in tabular form, separated by commas. We bought the data for our research, and found that from 75 million records, less than four thousand were germane to our study.
The normal process for analyzing a subset of the data in a large data set is to import the data into a statistical package, select the relevant records and run the appropriate statistical processes on those selected records. However, Stata 8.0, the statistical package we used to analyze the data, was unable to import a file of that size. An alternate approach is to accomplish the sub-selection task by loading the database into a database management package such as Sybase or SQL Server and use a database query to select and load the relevant data into the statistical package. While most standard database packages available for a standard modern desktop PC can hold and use a 3,000 MB table, Microsoft Access, Sybase and SQL Server were all unable to load a file of that size into a database table on such a PC.
Faced with this situation, we considered three alternatives. First, find a more powerful computer (server PC or mainframe) with enough memory to pursue the normal process for analyzing the data. If such a machine had been handy, then this would have been the easiest solution. In our case, it was not. Second, write a custom program to read the 75 million records, one record at a time, and select only the relevant records. While a program of this type is not difficult to write, it results in a rather "rigid" solution. The effort is not easily reusable for other data. In addition, it discourages further impromptu investigation of the data since any analysis involving additional records and fields might require additional programming.
The third alternative we considered, and selected, was to use an "off the shelf" utility to break the dataset into smaller pieces. Each piece could then be loaded into a standard database package. We then developed a flexible stepwise procedure using standard database queries to select the relevant records from each piece and merge them into a single file that Stata could absorb.
The alphanumeric file, HOSP_RPT_ALPHNMRC.csv, is 276,339 KB and the numeric file, HOSP_RPT_NMRC.csv, is 3,107,901 KB. As we could not process the 3,000 MB NMRC file directly, we first needed to break it into usable pieces. Following an "off the shelf" strategy, we made a quick trip to download.com where a search for "free" utilities to "split files" yielded a number of 'hits.' A short search through the search results and associated "User Reviews" eliminated several utilities that were limited to splitting files up to one gigabyte in size.
Of the remaining entries, we selected MaxSplitter v1.53. It is 'freeware,' has great user reviews, and can be downloaded from http://download.com.com/3120- 200.html?qt=MaxSplitter&tg=dl-2001. After a quick and painless install and answering a few questions, we chose to split the file into 10 pieces. We now had our NMRC data split into 10 usable pieces, each containing about 7.5 million records (310 million characters). A researcher can quickly and easily import each of these pieces into a Microsoft Access database.
Although MaxSplitter was quite satisfactory, it was not perfect. The resulting files were split on character, not record or field, boundaries. As such, a single record could end up split between two physical files. Looking at Exhibit 1, the first part of the last record in the table (record number 7,577,359) is at the end of HOSP_RPT_NMRC01, while the rest of that record (the remainder of Field4 and all of Field5) would be found at the beginning of HOSP_RPT_NMRC02. If that record had belonged to one of the hospitals that we were interested in, we would have had to patch it manually. That is a relatively easy task in Access. Since institution 21614 was not of interest, no patching was required in this case.
With the data split into usable pieces, the next step in our "off the shelf" strategy was to select the three thousand or so relevant records for our study from among the seventy five million (now distributed into 10 files of 7.5 million each). For this task, we chose Microsoft Access. It is ubiquitous and usable by those with little or no knowledge of structured query language (SQL). With Access, we first created a blank Access database and added two tables. The first table, named RecordNumbersWanted, simply contained the identifiers of the 140 institutions of interest (see Exhibit 2A). The second table, named ItemsWanted, contained the 26 attributes of each institution that were relevant to our research study (Exhibit 2B). Each attribute is specified by the worksheet, line, and column on which it is located.
We then imported the first piece of our split NMRC file (HOSP_RPT_NMRC_01.CSV) into Access (see Exhibit 3A). "Access" chose the wrong data type for the Field4 'column,' so we had to force it to a text rather than numeric type. Renaming the fields would have taken extra time without benefit, so we left the default (Field1, Field2, etc.) names intact. Note that the number in Field1 corresponds to the numbers in the RecordNumbersWanted table we created earlier. Similarly, Field2, Field3 and Field4 correspond to the Worksheet, Line and Column fields in the ItemsWanted Table. Finally, we created an Access "MakeTable" query that joins (technically, it does an inner join) the three tables described to produce a new table containing only the relevant attributes of the relevant records from the first of the 10 pieces of the NMRC data (see Exhibit 3B).
COMBINING THE THREE TABLES INTO A NEW TABLE USING "ACCESS"
To continue the process with all ten pieces of the numeric file, we deleted the table created by importing HOSP_RPT_NMRC_01.csv to make room for the next piece of the NMRC file. If we kept the imported tables as we continued, Access would eventually stall because of the size of our created database (.mdb). We then repeated the last three steps for each part of the split NMRC file: 1) Import HOSP_RPT_NMRC_02.CSV; 2) To add the NMRC data for the additional hospitals, we used an "Append" query" instead of a "MakeTable" query; 3) Delete the table created by importing HOSP_RPT_NMRC_02.CSV. Finally, we exported the table containing just the relevant NMRC data for the hospitals of interest in .csv format (see Exhibit 4).
The final table was easily imported into our statistics package, Stata. With the subset selection already done, the researcher can easily import the data into almost any statistical package or spreadsheet for analysis. After importing the final table into Stata, we produced a pivot table to summarize the data for each hospital on a single row. As an added check, we also imported the data into Excel and produced the same pivot table. An advantage of the statistical package method is that log files are readily produced for later review.
Although we describe the process that we used to extract data from the Medicare Hospital Cost Report Database, the method is generalizable to other large relational databases from CMS or other providers. For example, CMS offers many other databases that include the Hospital Outpatient Prospective Payment System (OPPS), and the Identifiable Data Set Medicare Provider Analysis and Review (MEDPAR). The importance of this paper is that any researcher can use our method to access data from large databases.
There is nothing particularly "high-tech" about the approach we used to extract the data used for our research study. Rather, it is the use of low tech "off the shelf" techniques that make this paper useful to researchers with a wide variety of technical knowledge. Most researchers today use "off the shelf" statistical packages to analyze their data. Clever use of "off the shelf" components can allow researches to overcome data accessibility problems without resorting to expensive mainframe (server) equipment, reliance on "high-tech" experts, or custom one-of-a-kind solutions.
Asper, F. (2003, Oct.). Cost Report Database Formats: Traditional Flat File Versus Relational Database.
Research Data Assistance Center. Accessed Apr. 16, 2004 at: www.resdac.umn.edu/Tools/TBs/TN-005b.pdf.
Cost Reports: General Information. Accessed May 25, 2005, at: www.cms.hhs.gov/data/cost_reports/cr_explanation.asp.
HCRIS Specifications for the HCFA2552-96. (1999, Nov. 10). Table 2 Worksheet Indicators.
Hospital Cost Report: CMS-2552-96. Accessed May 25, 2005, at: www.cms.hhs.gov/data/download/hcris_hospital/default.asp.
Alan C. Cutting
Roger Williams University
Gerson M. Goldberg
New Mexico State University
Kathryn J. Jervis
University of Rhode Island
Address for correspondence: Kathryn J. Jervis, College of Business Administration, University of Rhode Island, 7 Lippitt Road, Kingston, RI 02881-0802 USA, firstname.lastname@example.org.