Data Resources for Conducting Health Services and Policy Research

Rich federal data resources provide essential data inputs for monitoring the health and health care of the U.S. population and are essential for conducting health services policy research. The six household surveys we document cover a broad array of health topics from health insurance coverage (American Community Survey, Current Population Survey); health conditions and behaviors (National Health Interview Survey, Behavior Risk Factor Surveillance System); health care utilization and spending (Medical Expenditure Panel Survey); and longitudinal data on public program participation (SIPP). New federal activities are linking federal survey with administrative data to reduce duplication and response burden. In the private-sector sector vendors are aggregating data from medical records and claims to enhance our understanding of treatment, quality and outcomes of medical care. Federal agencies must continue to innovate to meet the continuous challenges of scarce resources, pressures for more granular data and new multi-mode data collection methodologies.


Introduction
Health policy is increasingly complex as major efforts to reform the U.S. health care system are met with partisan divide over core values about the role of government involvement in health insurance coverage. The policies designed and promoted on both sides of the aisle are dependent on what we know about current levels of private and public health insurance coverage, the health care needs of the population, and the role of government in financing health care. Our national data resources serve as inputs for policy analysis and projection models that inform this intense policy debate.
In this paper, we provide an overview of key federal household surveys that are used in health services policy research. We present information on elements of each resource, describing their strengths and limitations, with the anticipation that this synthesis will be used as a type of users guide. We include a discussion of innovations in data collection including an overview of data linkage projects and relatively new efforts to monitor cost and utilization through the aggregation of patient claims data. We conclude with observations of the future of U.S. household surveys.
continuous mixed-mode survey with mail and internet (added in 2013) collection with phone and in-person used for non-response follow-up. The ACS is a mandatory survey, although there are efforts to make the survey voluntary (16). This would likely have a negative impact on response rates and increase costs (87,105).
The greatest strength of the ACS is its large sample size and ability to produce estimates for states and sub-state geographies, such as congressional districts, counties, and zip-code tabulation areas. The ACS samples about 3.5 million addresses annually (94). The key limitation for health services research is that it does not have in-depth health related content. Every question on the ACS requires federal justification (99), which keeps the ACS from responding quickly to changes in policies. Proposed question changes must be approved by the Office of Management and Budget (OMB) and the Interagency Council for Statistical Policy and go through a content test before implementation (93).
The ACS did not make any updates in its questionnaire to accommodate health reform, unlike the other federal surveys discussed in this paper. Questions on exchange participation and subsidies are currently being evaluated on the 2016 Content Test (88), but, with the current political debates involving health reform, these questions might be outdated by the time they are ready for implementation.

Current Population Survey (CPS)
The Current Population Survey (CPS) (95) provides monthly data regarding labor force participation and unemployment for the civilian non-institutionalized population. It is conducted by the U.S. Census Bureau on behalf of the Bureau of Labor Statistics. The Annual Social and Economic Supplement (ASEC) collects data by phone and in-person, on income and health insurance coverage, from February to April of each year. The CPS has additional supplements fielded throughout the year on a variety of topics (107). The CPS samples about 100,000 addresses annually and provides estimates for all states (101).
After several years of research to improve the measurement of uninsurance (83,109) the Census Bureau revised the CPS' questions on health insurance coverage and income in 2014. The revised survey also added new questions on health insurance exchange participation, point-in-time coverage and employer-offers of coverage (110). Information on monthly insurance coverage data was also added and is currently being assessed for quality and disclosure concerns (108). engineered for the 2014 panel to "reduce survey costs and respondent burden" (84,97) eliminating the use of topical modules (104). Households are interviewed in person and by phone, once a year for four years. The 2014 SIPP-EHC panel began with a sample of about 53,000 households (104).
The SIPP-EHC has sample sizes designed to be representative for 20 states, but identifiers for all states are available on the public use files. The SIPP-EHC is designed specifically for longitudinal analysis, which is ideal for studying changes over time, but it has the longest lag for data release of any of the surveys discussed in this paper. The first wave from the 2014 panel (about coverage in 2013) was not released until early 2017.

Behavioral Risk Factor Surveillance System (BRFSS)
The Behavioral Risk Factor Surveillance System (BRFSS) (15) collects data regarding behavioral health and related risk factors for the adult civilian non-institutionalized population, is sponsored by the Centers for Disease Control and Prevention and is administered at the state level. In 2011, BRFSS updated its telephone-based sampling design to include cell phones (14). All states and the District of Columbia are required to use the survey's core standardized questionnaire, which can be either fixed or rotated on a biannual basis. States have the option to include additional modules or state-specific questions.
A strength of the BRFSS is its flexibility for states to add their own content and be agile to changing policies. The flexibility of the BRFSS also creates challenges when working with the data. Comparisons can be difficult if content is asked of only a few states or removed from the core questionnaire.

National Health Interview Survey (NHIS)
The National Health Interview Survey (NHIS) (77) collects data on health status, access, utilization, and health behaviors of the civilian non-institutionalized population and is sponsored by the CDC's National Center for Health Statistics (NCHS). The in-person survey is conducted continuously throughout the year with about 35,000 households sampled annually (69).
The 2014 and 2015 health insurance coverage estimates were published for all states and the District of Columbia. Due to funding constraints, the sample size was reduced in 2016 and insurance coverage estimates for all states are no longer available (17, 62). Researchers conducting analyses requiring state identifiers must go through the NCHS or Census Bureau's Research Data Center (RDC) network.
NCHS has proposed redesigning the NHIS questionnaire in 2018 to reduce response burden and "to establish a long-term structure of ongoing and periodic topics" (76). The family level questions will be discontinued, with much of the content incorporated into the sample adult and sample child questions. With most questions asked of only the sample adult and sample child, the redesigned survey will reduce the sample size available for many questions.

Medical Expenditure Panel Survey -Household Component (MEPS-HC)
The Medical Expenditure Panel Survey -Household Component (MEPS-HC) (4) collects data on health, health care access, and expenditures and is sponsored by the Agency for Healthcare Research and Quality (AHRQ). The MEPS-HC is a longitudinal survey with participants interviewed in-person five times over a two-year period. Participants are selected from households included in the previous year of the NHIS. The sample includes about 13,500 families (5).
Information is also collected from each respondents' health care provider and supplements the household information. The MEPS-HC provides a breadth of health related information but is not state-representative. Only select estimates are released for the larger states with sufficient sample size. Researchers conducting analyses requiring state identifiers must go through the AHRQ Data Center or the Census/NCHS RDC network.

Harmonization efforts: NHIS and MEPS-HC
The Minnesota Population Center (MPC) has made significant efforts to harmonize federal health data that crosses years of data collection. The Integrated Health Interview Series (IHIS), funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development, provides easily accessible integrated microdata for 5.5 million persons surveyed over 50 years, with more than 14,000 variables describing population health. The MPC is currently working on expanding the IHIS by including the MEPS-HC, the most complete source of information on population health care use and expenditures.

Linking NHIS and MEPS-HC
The MEPS household sample is drawn from the pool of respondents to the previous year's NHIS allowing each MEPS-HC annual release to be linked to the previous year's NHIS annual release beginning with the 1995 NHIS and the 1996 MEPS (59). These linked data files are currently only available through the AHRQ Data Center or the NCHS/CDC RDC network (3).
The NHIS-MEPS linkage allows researchers to access a variety of sociodemographic, health status, and household characteristics included in the NHIS but not the MEPS. For example, researchers have used these combined data to analyze the medical expenditures of immigrants in the U.S. (51,89); the time and financial burden of caregiving to children with chronic conditions (112); and the effects of Medicaid eligibility on mental health services and out-of-pocket spending on mental health services (38).

Linking Administrative Records and Survey Data
Federal survey data may be linked with administrative data to better understand the health impact of policies but also to reduce respondent burden, improve accuracy, and reduce cost. We highlight data linkages that are currently maintained and updated for general use by extramural researchers. Some of these data are provided as public use files while others can be accessed only as part of the Census Bureau/NCHS RDC network.

NHIS and Death Records Longitudinal Mortality Files (LMF)
The NHIS for 1985-2009 is linked to longitudinal mortality data from the National Death Index (NDI) through Dec 31, 2011 (72) and was last updated in 2013 (105). A limited set of data are available through public use files with more data available as restricted files through the Census Bureau/NCHS RDC network (19). Restricted use files include detailed mortality information for all survey participants (including children) and information on age and cause of death (105). Published studies have used the linked data to evaluate the association between mortality and various health behaviors, conditions, and treatments, and compare mortality rates of different groups, adjusted for covariates (20).

NHIS and Centers for Medicare and Medicaid Services (CMS)
The 1994-2013 NHIS are linked to 1999-2013 Medicare enrollment and claims records collected by the Centers for Medicare and Medicaid Services (CMS) (70). In each of the last ten available NHIS samples, approximately 3,000-10,000 NHIS respondents are linked to Medicare administrative data (13-18% of linkage-eligible respondents) (33). Research files are available through the Census Bureau/NCHS RDC network (79).
Researchers have used these data to study hospitalization, readmission, and death among Medicare enrollees (20); to compare rates of self-reported diabetes in the NHIS with rates of diabetes identified in Medicare claims data (21); to study the relationship between moral hazard and the invasiveness of surgical procedures (57); and to assess NHIS measurement error of Medicare coverage (37).

NHIS and Social Security and Benefit History Data
The 1994-2005 NHIS are linked to the Old Age Survivors and Disability Insurance (OASDI) and Supplemental Security Income (SSI) records from the Social Security Administration (SSA) for respondents who agreed to provide their Social Security Number along with their name and date of birth (39,73). Available data include benefits received from and payments to the SSA, eligibility for SSI disability benefits, and the SSA's determination of disability status for individuals receiving or applying for disability benefits. In each of the last ten available NHIS samples, 30,000-46,000 NHIS respondents are linked to SSA administrative data (44-62% of any given NHIS sample) (39). SSA-linked data are confidential and are available only through the Census Bureau/NCHS RDC network (75).

NHIS and Department of Housing and Urban Development (HUD)
The 1990-2012 NHIS are linked to HUD administrative data from 1999 through 2014 (56,71). The administrative data include information on participation in public housing, type and timing of housing assistance received and housing structure (74). Approximately 1,300 to 2,600 respondents were linked in each of the last ten NHIS survey years (about 8-10% of linkage-eligible respondents) (66). HUD-linked data are confidential and are available only through the Census Bureau/NCHS RDC network (71).

New Aggregate Health Care Claims Data
The private sector has responded to the need for aggregate claims data to provide information on the trends in use and cost of health care services. Several firms serve as data aggregators that harmonize claims data across multiple payers creating large scale-Big Data for the purposes of data analytics and research. These data vendors vary by the number and type of participating health plans and by the accessibility of the data for health services research. We review four of the key data vendors and include information on sample coverage, sample size, geographic coverage and key variables.

Fair Health
Fair Health was established in 2009 as part of a legal settlement concerning New York State health insurance industry reimbursement practices (29). Fair Health receives data from 60 contributors including insurers and third party administrators and represents 150 million covered lives, approximately 75% of the privately insured population and 23.4% of national payments by privately insured patients. Costs for accessing the data are made on a case-bycase basis.
Researchers have used Fair Health data to study the effects of medical malpractice damage caps on provider reimbursement (34), market power and provider consolidation (54), the effects of occupational licensing on health care prices (53), the effects of health coverage mandates on provider reimbursement (35), and the opioid crisis among the privately insured (28).

MarketScan Research Databases
MarketScan Research Databases are a product of Truven Health Analytics, a subsidiary of IBM (13) that includes data from 150 employers, 21 commercial health plans and Medicare and Medicaid (22). The databases cover 230 million unique patients since 1995, and the most recent year includes 50 million covered lives (1). Researchers can analyze MarketScan data with Truven's proprietary analytic tools or on their own through licensing agreements.
Researchers have used MarketScan data to investigate the relationship between physician practice competition and physician service prices (7), the effects of managed care on angioplasty procedure prices (22), the impact of cost sharing on treatment adherence and outcomes among patients with diabetes (36), the relationship between incentive-based drug formularies and drug selection and spending on hypertension (52), the impact of hospital market consolidation on health care prices (63), and the causes of the 2009-2011 slowdown in health care spending (85).

Health Care Cost Institute Database
Health Care Cost Institute (HCCI) is a private non-profit established in 2011 to collect and disseminate heath care cost and utilization data for Americans with private health insurance (45). The majority of its funding comes from its four data contributors: Aetna, Humana, Kaiser Permanente, and UnitedHeathcare (41), and its data covers 50 million individuals (43). The data cover 25% of the nonelderly population with employer-sponsored coverage (43). Researchers can access HCCI data through its Data Enclave, a secure, virtual environment hosted by NORC at the University of Chicago that allows HCCI to maintain data security requirements (46).
Researchers have used HCCI data to examine the trends driving the growth in health spending among those with employer-sponsored coverage (50), the relationship between structural change in the health sector and health spending (23), the relationship between regional hospital prices and health spending (18), differences in reimbursement rates between Medicare Advantage and Medicare fee-for-service (6), and out-of-pocket spending on inpatient medical services among the non-elderly (2).

Optum Labs Data Warehouse
Optum Labs was founded as a partnership between Optum (a subsidiary of UnitedHealth Group, a public for-profit health insurer) and Mayo Clinic in 2013 and includes data from providers' electronic health record systems as well as from claims and enrollment systems of affiliated and non-affiliated health plans. The data cover 150 million lives, 19% of the population in commercial health plans, 19% of those in Medicare Advantage plans, 24% of those in Medicare Part D only plans, and 7% of the U.S. population with any health care utilization. So far, researchers have not published work using Optum Labs data to investigate questions related to health policy.

CONCLUSIONS
The U.S. federal survey resources are essential for monitoring trends in the nation's health, health insurance coverage and access to needed to care, and health care spending. Yet, federal surveys face significant challenges. First from resource constraints and pressures to demonstrate their perceived utility in real time (16). For example, some federal surveys are more agile than others in responding to the need for timely data and adding survey content in response to shifts in the policy environment. For example, the NHIS significantly expanded its content in anticipation of ACA implementation and early release data from NCHS are the first indicators on health insurance coverage, released in the first quarter of the year reporting on prior year coverage (89).
There are also challenges due to the trends in survey research that include falling response rates, increase in non-response bias and changing modes of data collection. For example, in 1997 the NHIS achieved 92% household response (67) but fell to about 70% in 2015 (68). Surveys must be responsive. The ACS has gone a long way to meet respondent preferences by offering mail and on-line survey options as well as telephone and in-person follow-up for non-responders. While response rates are but one measure of quality; to date studies examining trends in bias between 1995 and 2015 are reassuringly stable (25).
These are also challenges to maintaining and assuring privacy as Americans signal a growing mistrust of government regarding privacy and confidentiality concerns (57). Leveraging administrative data is one option to reduce respondent burden, improve accuracy and reduce costs (98,100). This is particularly appealing for sensitive topics, such as income that have high rates of missing data, which could be accessed through an administrative data source. Again, this is not without challenges given issues of confidentiality and need for cooperation within agencies and across sectors (59).
Finally, some questions of interest to health services research require detailed data on use and costs of services-data that stretch the limits of self-report in surveys. This need has motivated the development of new frontiers in the data resource landscape such aggregated claims data from private sources. Health services researchers have much to discover and document about the strengths and limits of new data resources, while continuing to rely on available federal surveys and linked data resources. And federal agencies will need to continue to innovate to keep pace with the changes in survey methodology, trends in health and health care during a time of increased demands for data with limited resources. Health Related Content by Federal Survey a The BRFSS includes content in the core questionnaire; states also have the option to add additional questions through optional modules. All of the surveys include demographic and economic variables such as age, race/ethnicity, educational attainment, marital status, work status, and income.  Analysts can choose between "Early View data" with no minimum run-out, "Standard Updates" with 3 month minimum run-out, and "Annual File" with at least 6-month run-out (1).

LITERATURE CITED
Annual claims submitted at end of CY (44). Claims have a 5-6 month run-out period depending on payer (43). Unknown