An easier method to extract data from large databases: the Medicare Hospital Cost Report database.ABSTRACT
The purpose of this paper is to assist academic researchers to easily extract data from large databases like the Hospitals Medicare Cost Report Database that is published by the Center for Medicare and Medicaid Medicare and Medicaid
U.S. government programs in effect since 1966. Medicare covers most people 65 or older and those with long-term disabilities. Part A, a hospital insurance plan, also pays for home health visits and hospice care. Services. Though the data is inexpensive, the enormous size of the files may require a researcher to use a mainframe computer and hire a technical expert to extract the data in a form that is usable for a research study. Because we faced this dilemma after purchasing the database, we found a much easier means so that a single researcher could use a stand-alone personal computer to extract and manipulate the data. This paper describes the steps needed to accomplish that task.
Many academic researchers across a variety of disciplines need and use information contained in Medicare Hospital Cost Reports. This paper explains the type of information included in the Cost Reports and demonstrates how to extract and manipulate the data. The Medicare Cost Report database is inexpensive and easy to obtain. However, because of the enormous size of the database, the database provider suggests that the researcher load the data into a large database server running Microsoft SQL SQL
in full Structured Query Language.
Computer programming language used for retrieving records or parts of records in databases and performing various calculations before displaying the results. (Structured Query Language See SQL.
Structured Query Language - SQL ) Server, Oracle or Sybase. This recommendation precludes easy use of the data by a typical academic researcher on a stand-alone personal computer (PC). Thus, this paper outlines the steps needed for any researcher, regardless of database skills, to obtain, extract and use the data without sophisticated technology.
Hospitals that receive Medicare reimbursement Reimbursement
Payment made to someone for out-of-pocket expenses has incurred. are required to submit an annual cost report to the Center for Medicare and Medicaid Services (CMS (1) See content management system and color management system.
(2) (Conversational Monitor System) Software that provides interactive communications for IBM's VM operating system. ). The cost report contains hospital information such as facility characteristics, utilization data, cost and charges by cost center (in total and for Medicare), Medicare settlement data, and financial statement data (Cost Reports 2005). The hospital cost report, Form CMS-255296, is a large collection of worksheets that are lettered and numbered according to according to
1. As stated or indicated by; on the authority of: according to historians.
2. In keeping with: according to instructions.
3. the type of information contained in each worksheet (HCRIS HCRIS Hospital Cost Report Information System Specifications 1999). For example, Worksheet G includes balance sheet information, Worksheet G-2 has revenue and expense information and Worksheet S-2 reports hospital characteristics such as whether the hospital is in a rural or urban area.
Worksheets are lettered from A through S with multiple subparts. One hospital cost report contains about two hundred printed pages. The database includes cost reports for about 6,000 hospitals for fiscal years ending on or after September 30, 1996 through March 31, 2005. Hence, the central portions of the database contain 106.6 million rows of data in three separate files. Unzipped, these three files contain approximately 6,286 MB, with one file, the NMRC NMRC New Millennium Research Council
NMRC Naval Medical Research Center
NMRC National Microelectronics Research Centre (University College Cork, Ireland)
NMRC National Monuments Record Centre (UK) file (see below) containing a bit over 3,000 MB of information. The next section provides information about obtaining the database, followed by practical information for accessing the data.
OBTAINING AND UNDERSTANDING THE MEDICARE HOSPITAL COST REPORT DATABASE
CMS manages the database through the Healthcare Cost Report Information System (HCRIS) (Hospital Cost Report 2005). At this time, CMS provides the database at a cost of $100 in a CD format. Previously, one could download the database from the website, but CMS discontinued dis·con·tin·ue
v. dis·con·tin·ued, dis·con·tin·u·ing, dis·con·tin·ues
1. To stop doing or providing (something); end or abandon: that capability because of the database size. CMS organizes the data in a relational database relational database
Database in which all data are represented in tabular form. The description of a particular entity is provided by the set of its attribute values, stored as one row or record of the table, called a tuple. format.
Prior to 2002, HCRIS organized cost reports in a flat file format where one record (one row) contained one hospital's cost report for one fiscal year, and each variable (column) contained one worksheet value (Asper 2003). Each row may have included thousands of columns, and database management became difficult if even one variable was added to the cost report. In 2002, CMS changed to a relational file format to make it easier to add variables to cost reports, and easier to obtain information cross-sectionally.
Now, the cost report is in three separate files that are linked by a report record number (Asper 2003). A graphical representation is available in the DataModel file at the CMS website (Hospital Cost Report, 2005). The first file is the Report file (HOSP HOSP Hospital
HOSP House Sparrow
HOSP Hot Springs National Park (US National Park Service) _RPT RPT - Unify. Report Writer Language. ) and that contains descriptive information about the hospital. The second and third files are the alphanumeric alphanumeric (ăl'fənmĕr`ĭk) or alphameric (ăl'fəmĕr`ĭk), the set of letters and numbers. file (HOSP_RPT_ALPHNMRC) and the numeric numeric
see ten-key pad. file (HOSP_RPT_NMRC) where each file has five columns: the report record number, the worksheet code, the column number (from the worksheet), the line number (from the worksheet) and the data value. If the data value in the worksheet consists of character and numbers such as street address, then one searches the alphanumeric file, while numbers-only data is in the numeric file.
Obviously, most researchers would use a subset A group of commands or functions that do not include all the capabilities of the original specification. Software or hardware components designed for the subset will also work with the original. of the entire database to focus on specific hospitals, in specific years, using specific data from the cost reports. Because a unique 'report record number' links the three large files, the researcher must first locate that number. To do that, she downloads the hospital name and address file (HOSPITAL_NAMES_ADDRESSES), which is an Excel spreadsheet, from the CMS website (Hospital Cost Report 2005). In that file, she first finds the state (listed alphabetically al·pha·bet·i·cal also al·pha·bet·ic
1. Arranged in the customary order of the letters of a language.
2. Of, relating to, or expressed by an alphabet. ) where the hospital is located. Then, she finds the hospital within that state (not listed alphabetically). Once she has located the hospital, she has its unique 'provider number' that is located in the first column.
For each 'provider number,' the researcher opens the HOSP_RPT file, and scrolls down to the provider number (third column). Each provider number has several rows of information with a separate row for each fiscal year. She finds the fiscal year(s) of interest, and then finds the unique 'report record number' in the first column. This 'report record number' is the critical link to the database, as it is unique to a particular hospital for a particular fiscal year.
With that 'report record number,' the researcher is now ready to access information from the worksheets using either the alphanumeric file or the numeric file. Blank templates of the Worksheet Forms (2552-96 Worksheets), found in an Excel file at the CMS website, guide the researcher in choosing the correct form. Having described how to obtain the database and the way that it is organized, we turn to accessing the extremely large numeric file.
EXTRACTING THE HOSPITAL COST REPORT DATA
When a researcher uses data collected by others, converting that data into a form suitable for the type of analysis intended can pose significant obstacles. In our case, the obstacle was the size of individual files in the Medicare Hospital cost report database. One file in the database, the numeric file, contains about 75 million records that 'unzip' into 3,000 MB of comma-delimited data (csv), whereby data items are organized by fields (e.g., record number) in tabular form Same as table view with respect to printed output. , separated by commas. We bought the data for our research, and found that from 75 million records, less than four thousand were germane ger·mane
Being both pertinent and fitting. See Synonyms at relevant.
[Middle English germain, having the same parents, closely connected; see german2. to our study.
The normal process for analyzing a subset of the data in a large data set is to import the data into a statistical package, select the relevant records and run the appropriate statistical processes on those selected records. However, Stata 8.0, the statistical package we used to analyze the data, was unable to import a file of that size. An alternate approach is to accomplish the sub-selection task by loading the database into a database management package such as Sybase or SQL Server An earlier relational DBMS from Sybase and from Microsoft. Sybase introduced SQL Server in 1988 for various Unix versions. In that same year, with help from IBM, Sybase created an OS/2 version that Microsoft licensed and branded as Microsoft SQL Server. and use a database query to select and load the relevant data into the statistical package. While most standard database packages available for a standard modern desktop PC can hold and use a 3,000 MB table, Microsoft Access A database program for Windows, available separately or included in the Microsoft Office suite. Access is programmable using Visual Basic for Applications (VBA). Access can read Paradox, dBASE and Btrieve files, and using ODBC, Microsoft SQL Server, SYBASE SQL Server and Oracle data. , Sybase and SQL Server were all unable to load a file of that size into a database table on such a PC.
Faced with this situation, we considered three alternatives. First, find a more powerful computer (server PC or mainframe) with enough memory to pursue the normal process for analyzing the data. If such a machine had been handy, then this would have been the easiest solution. In our case, it was not. Second, write a custom program to read the 75 million records, one record at a time, and select only the relevant records. While a program of this type is not difficult to write, it results in a rather "rigid" solution. The effort is not easily reusable re·use
tr.v. re·used, re·us·ing, re·us·es
To use again, especially after salvaging or special treatment or processing.
re·us for other data. In addition, it discourages further impromptu A Windows query and reporting tool from Cognos with support for a large variety of databases. It is capable of generating cross tabs for spreadsheets such as Excel, Lotus for Windows and Quattro Pro for Windows. investigation of the data since any analysis involving additional records and fields might require additional programming.
The third alternative we considered, and selected, was to use an "off the shelf" utility to break the dataset into smaller pieces. Each piece could then be loaded into a standard database package. We then developed a flexible stepwise stepwise
incremental; additional information is added at each step.
stepwise multiple regression
used when a large number of possible explanatory variables are available and there is difficulty interpreting the partial regression procedure using standard database queries to select the relevant records from each piece and merge them into a single file that Stata could absorb.
The alphanumeric file, HOSP_RPT_ALPHNMRC.csv, is 276,339 KB and the numeric file, HOSP_RPT_NMRC.csv, is 3,107,901 KB. As we could not process the 3,000 MB NMRC file directly, we first needed to break it into usable pieces. Following an "off the shelf" strategy, we made a quick trip to download.com where a search for "free" utilities to "split files" yielded a number of 'hits.' A short search through the search results and associated "User Reviews" eliminated several utilities that were limited to splitting files up to one gigabyte One billion bytes. Also GB, Gbyte and G-byte. See giga and space/time.
(unit) gigabyte - 2^30 = 1,073,741,824 bytes = 1024 megabytes.
Roughly the amount of data required to encode a human gene sequence (including all the redundant codons).
See prefix. in size.
Of the remaining entries, we selected MaxSplitter v1.53. It is 'freeware,' has great user reviews, and can be downloaded from http://download.com.com/3120- 200.html?qt=MaxSplitter&tg=dl-2001. After a quick and painless pain·less
Free from complication or pain: a painless operation.
painless·ly adv. install and answering a few questions, we chose to split the file into 10 pieces. We now had our NMRC data split into 10 usable pieces, each containing about 7.5 million records (310 million characters). A researcher can quickly and easily import each of these pieces into a Microsoft Access database.
Although MaxSplitter was quite satisfactory, it was not perfect. The resulting files were split on character, not record or field, boundaries. As such, a single record could end up split between two physical files. Looking at Exhibit 1, the first part of the last record in the table (record number 7,577,359) is at the end of HOSP_RPT_NMRC01, while the rest of that record (the remainder of Field4 and all of Field5) would be found at the beginning of HOSP_RPT_NMRC02. If that record had belonged to one of the hospitals that we were interested in, we would have had to patch it manually. That is a relatively easy task in Access. Since institution 21614 was not of interest, no patching was required in this case.
With the data split into usable pieces, the next step in our "off the shelf" strategy was to select the three thousand or so relevant records for our study from among the seventy five million (now distributed into 10 files of 7.5 million each). For this task, we chose Microsoft Access. It is ubiquitous and usable by those with little or no knowledge of structured query language (SQL). With Access, we first created a blank Access database and added two tables. The first table, named RecordNumbersWanted, simply contained the identifiers of the 140 institutions of interest (see Exhibit 2A). The second table, named ItemsWanted, contained the 26 attributes of each institution that were relevant to our research study (Exhibit 2B). Each attribute is specified by the worksheet, line, and column on which it is located.
We then imported the first piece of our split NMRC file (HOSP_RPT_NMRC_01.CSV) into Access (see Exhibit 3A). "Access" chose the wrong data type for the Field4 'column,' so we had to force it to a text rather than numeric type. Renaming the fields would have taken extra time without benefit, so we left the default (Field1, Field2, etc.) names intact. Note that the number in Field1 corresponds to the numbers in the RecordNumbersWanted table we created earlier. Similarly, Field2, Field3 and Field4 correspond to the Worksheet, Line and Column fields in the ItemsWanted Table. Finally, we created an Access "MakeTable" query that joins (technically, it does an inner join) the three tables described to produce a new table containing only the relevant attributes of the relevant records from the first of the 10 pieces of the NMRC data (see Exhibit 3B).
COMBINING THE THREE TABLES INTO A NEW TABLE USING "ACCESS"
To continue the process with all ten pieces of the numeric file, we deleted Deleted
A security that is no longer included on a specified market. Sometimes referred to as "delisted".
Reasons for delisting include violating regulations, failing to meet financial specifications set out by the stock exchange and going bankrupt. the table created by importing HOSP_RPT_NMRC_01.csv to make room for the next piece of the NMRC file. If we kept the imported tables as we continued, Access would eventually stall because of the size of our created database (.mdb). We then repeated the last three steps for each part of the split NMRC file: 1) Import HOSP_RPT_NMRC_02.CSV; 2) To add the NMRC data for the additional hospitals, we used an "Append To add to the end of an existing structure. " query" instead of a "MakeTable" query; 3) Delete the table created by importing HOSP_RPT_NMRC_02.CSV. Finally, we exported the table containing just the relevant NMRC data for the hospitals of interest in .csv format (see Exhibit 4).
The final table was easily imported into our statistics package, Stata. With the subset selection already done, the researcher can easily import the data into almost any statistical package or spreadsheet for analysis. After importing the final table into Stata, we produced a pivot table See multidimensional views. to summarize sum·ma·rize
intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es
To make a summary or make a summary of.
sum the data for each hospital on a single row. As an added check, we also imported the data into Excel and produced the same pivot table. An advantage of the statistical package method is that log files are readily produced for later review.
Although we describe the process that we used to extract data from the Medicare Hospital Cost Report Database, the method is generalizable gen·er·al·ize
v. gen·er·al·ized, gen·er·al·iz·ing, gen·er·al·iz·es
a. To reduce to a general form, class, or law.
b. To render indefinite or unspecific.
2. to other large relational databases from CMS or other providers. For example, CMS offers many other databases that include the Hospital Outpatient Prospective Payment System (OPPS OPPS Outpatient Prospective Payment System (hospital/medical) ), and the Identifiable Data Set Medicare Provider Analysis and Review (MEDPAR MEDPAR Medicare Provider Analysis and Review (Medicare)
MEDPAR Medical Patient Accounting and Reporting ). The importance of this paper is that any researcher can use our method to access data from large databases.
There is nothing particularly "high-tech" about the approach we used to extract the data used for our research study. Rather, it is the use of low tech "off the shelf" techniques that make this paper useful to researchers with a wide variety of technical knowledge. Most researchers today use "off the shelf" statistical packages to analyze their data. Clever use of "off the shelf" components can allow researches to overcome data accessibility problems without resorting to expensive mainframe (server) equipment, reliance on "high-tech" experts, or custom one-of-a-kind solutions.
Asper, F. (2003, Oct.). Cost Report Database Formats: Traditional Flat File Versus Relational Database.
Research Data Assistance Center. Accessed Apr. 16, 2004 at: www.resdac.umn.edu/Tools/TBs/TN-005b.pdf.
Cost Reports: General Information. Accessed May 25, 2005, at: www.cms.hhs.gov/data/cost_reports/cr_explanation.asp.
HCRIS Specifications for the HCFA HCFA
Health Care Financing Administration
n.pr See Health Care Financing Administration. 2552-96. (1999, Nov. 10). Table 2 Worksheet Indicators.
Hospital Cost Report: CMS-2552-96. Accessed May 25, 2005, at: www.cms.hhs.gov/data/download/hcris_hospital/default.asp.
Alan C. Cutting
Roger Williams University Roger Williams University, commonly abbreviated as RWU, is a private, coeducational American liberal arts university located on 120 acres in Bristol, Rhode Island, above Mt. Hope Bay. Founded in 1956, it was named for theologian and Rhode Island cofounder Roger Williams.
Gerson M. Goldberg
New Mexico State University New Mexico State University, at Las Cruces; land-grant and state supported; coeducational; chartered and opened 1889 as a college. It became New Mexico State Univ. of Engineering, Agriculture, and Science in 1958 and adopted its present name in 1960.
Kathryn J. Jervis
University of Rhode Island History
The University was first chartered as the state's agricultural school in 1888. The site of the school was originally the Oliver Watson Farm, and the original farmhouse still lies on the campus today.
Address for correspondence: Kathryn J. Jervis, College of Business Administration, University of Rhode Island, 7 Lippitt Road, Kingston, RI 02881-0802 USA, firstname.lastname@example.org.