Health outcome measures.
Over the past two decades increasing moves towards 'measuring' health outcome have influenced much of health care and indeed much of the physiotherapy profession. However, health outcome measures can have multiple purposes, are associated with evolving and sometimes confusing terminology, and may have perceived and actual barriers to use. As a result the level of understanding and incorporation into physiotherapy practice is variable despite increasing national and international professional guidelines. In order to appropriately understand and use outcome measures (as well as interpret the information from them), it is essential to consider three key areas covered in this paper: conceptual frameworks to place an outcome measure within, practical considerations regarding implementation and finally identifying and describing the measurement qualities of an outcome measure; its psychometric properties. Horner D, Larmer PJ (2006): Health outcome measures. New Zealand Journal of Physiotherapy 34(1): 17-24.
Keywords: Health outcome measure, Physiotherapy, Conceptual frameworks, Psychometric properties
'Measuring outcome' is a term used by a large number of industries across the world to determine how well the specific goals of any one business activity are met. Within the health care arena, the measurement of outcome has become increasingly widespread over the past two decades in response to calls to move beyond mere 'appearance of benefit' as an indicator of therapeutic impact. The tools derived for this purpose are usually referred to as health 'outcome measures' (Duckworth 1999). A health outcome measure has been described as a measure of health change, at a defined point in time, as a result of one or more health care processes (Baumberg et al 1995, Wennberg and Glittelsohn 1982). The implementation, interpretation and evaluation of outcome measures have caused much debate and controversy within the health literature.
Internationally the physiotherapy profession has been actively involved in promoting the use of outcome measures (Chartered Society of Physiotherapy 2000, Cole et al 1994, Kendall 1997). Physiotherapy practice commonly uses health outcome measures for their evaluative purpose, particularly since the advent of evidenced-based practice (Huijbregts et al 2002, Klassen et al 2001, Patrick and Chiang 2000). An evaluative outcome measure is used to aid in the measurement of effectiveness (or not, as the case may be) of physiotherapeutic interventions, indicating whether there has been a change in status since the last measurement. Robust and well targeted outcome measurement is therefore integral to clinical trial methodology or determining whether change can actually be attributed to the intervention delivered.
This paper outlines the historical development of the outcome movement, and then specifically implementation within the physiotherapy profession. The paper introduces three conceptual frameworks and practical considerations for usage of outcome measures. Finally an emphasis has been placed on how best to test the quality of an outcome measure to enable an informed interpretation of the outcome measure result. To aid clarity and practical application, examples from physiotherapy practice are supplied with further references given where depth of discussion is beyond the scope of this paper.
DEVELOPMENT OF THE OUTCOME MOVEMENT-HISTORY
Relman (1988) suggests that there have been three distinct revolutions in western health care. The first revolution being the 'Era of Expansion' extended from the 1920s to the 1960s when significant growth in hospitals and specialists occurred along with increasingly sophisticated technology. Inflationary pressures of the first revolution led to the second revolution the 'Era of Cost Containment' from 1960s to the 1980s. The third revolution in health care the 'Era of Assessment and Accountability' was brought about by the need for information which would aid in the rationalisation, effectiveness and quality of health care. Relman (1988) reported that there were significant variations in practice across many sectors of health care with differences in both utilization rates, costs and the effects of interventions. These variations seemed to occur without any discernible measurable difference in patient outcomes (Gersten 1998). As a result, funders, particularly health insurance companies had a significant role in driving in the development of outcome measurements as a means of assuring that the treatment they were paying for was effective (Relman 1988). Health professionals recording that the patient reported 'feeling better', was no longer considered sufficient. Funders required objective measurements that could demonstrate that they were getting value for their money. One of the primary tools for obtaining this information was the use of outcome measures and hence this era has also been referred to as the 'Outcome Movement' (Epstein 1990).
DEVELOPMENT OF THE OUTCOME MOVEMENT-PHYSIOTHERAPY
Physiotherapy is reportedly often aligned with the traditional medical model (Nicholls and Larmer 2005). As medical interventions were put under the spotlight regarding treatment effectiveness, so this same process was applied to physiotherapy interventions. Hence, as the medical profession and those interacting with them saw outcome measures as a way of answering the critics, physiotherapists were also encouraged to incorporate standardised tests and measurements (outcome measures) in their practice (Rothstein et al 1991).
Physiotherapy professional organisations (international and national) are incorporating more overtly the use of outcome measures within core documentation. Specifically physiotherapy outcome measures have been described as '...a scale utilised and interpreted by physical therapists [physiotherapists] designed to measure a specific attribute [of interest to patient and therapist] that is expected to change owing to the intervention of a physical therapist [physiotherapist]' (Mayo et al 1993 p.81). In addition, for data from an outcome measure to be 'trustworthy', it must have been evaluated and results reported in peer reviewed literature demonstrating adequate measuring properties (Mayo et al 1993).
In 1994 the Chartered Society of Physiotherapy in the United Kingdom as part of a quality assurance initiative indicated the growing importance of taking accurate tests and measurements within general documentation (Chartered Society of Physiotherapy 1994). By 2000 the Chartered Society of Physiotherapy (2000) had identified 22 core standards of professional practice, one (Standard 6) requiring members to use appropriate and high quality outcome measures in their routine clinical practice. Naming outcome measure within the core standards raises the profile and reflects the increasing importance for physiotherapists to collect and utilise appropriate information to inform their practice.
Nationally, the New Zealand Physiotherapy Board published their second edition of registration requirements, a document that describes 10 competencies to enable a physiotherapist to register to practice in New Zealand (New Zealand Physiotherapy Board 1999). At least four of the 10 competencies (3 4, 8 and 10), contain terminology, that outcome measures should be applied and evaluated.
Although there has been this growing understanding of outcome measures, research with physiotherapy practitioners internationally and within New Zealand suggests that there is not a clear understanding of the use and interpretation of outcome measures (Huijbregts et al 2002, Kendall 1997). The aim here therefore is to further enhance the understanding of the use and interpretation of outcome measures. The following section identi . es three conceptual frameworks that health outcome measures can be aligned with to enhance the appropriate choice of which outcome measure(s) to use.
Three of the most dominant frameworks suggested for the measurement of health outcomes are: the International Classification of Functioning, Disability and Health (ICF), Health Related Quality of Life (HRQoL) and thirdly, cost (Finch et al 2002).
The first conceptual framework, the ICF, precursors of which were known as the International Classification of Impairment Disability and Handicap (ICIDH), is a comprehensive conceptual framework of outcomes in the measurement of health (World Health Organisation 2001). The ICF assigns the term 'functioning' as encompassing the positive components of health, and 'disability' as encompassing the negative components of health. Disability is further subdivided into impairments, activity limitations and participation restrictions within the context of environmental facilitators and barriers.
The second conceptual framework for the measurement of health outcomes is HRQoL. Although the precise definition of HRQoL is debated, there is agreement that HRQoL measures include multiple dimensions and that they are important to the individual and relevant to the particular health intervention (Oldridge 1997). HRQoL is purported to include dimensions that describe a person's physical, social and psychological health (Bulpitt 1997, Oldridge 1997, Stewart et al 1987). Some HRQoL instruments not only measure specific dimensions but also place a value on each dimension (Muldoon et al 1998).
The third conceptual framework that a health outcome could be identified with is the cost of the service, both direct and indirect. Direct cost relates to the resources consumed in providing the service and indirect cost refers to the costs to the patient in undergoing an episode of care. The cost to the patient may include family, whanau and other support. This framework is set within the demand for and increases in health care services against a finite health dollar. Physiotherapists may see themselves in the role of patient advocate rather than considering the economic implications of their decisions and therefore may not always consider cost as a primary outcome. However, economic components of health care influence (either explicitly or implicitly) all levels of decision making in health (Kernick 2003, Robinson 1999).
While the three conceptual frameworks have been outlined separately, they can co-exist in practice with areas of overlap and often with no definitive borders (Finch et al 2002, Jette 1993) or may even be in tension with one another. An example of this overlap is demonstrated in cost effectiveness analysis--the comparison of two interventions where they would have a common outcome measure would aid in the clinical decision to which intervention was used (Kernick 2003). In this situation an outcome indicator from within the ICF framework would be compared to the cost outcome in the delivery of the two interventions being compared. Therefore, the choice of specific health outcome measures relating to these two frameworks will depend on multiple issues, including who and what the information is being used for.
Being able to identify the conceptual framework underpinning an outcome measure is paramount for robust evaluation of health care. Once a health outcome measure is associated within conceptual frameworks the next issue to consider is how practical (or not) it is to complete within an identified setting.
PRACTICAL CONSIDERATIONS FOR USE OF OUTCOME MEASURES
Ideally, the burden for either the physiotherapist or patient in applying or using an outcome measure should be minimal. Burden could be perceived or actual with examples for the physiotherapist reported previously as including cost, time, and limited knowledge of appropriate outcome measure (Cole et al 1994, Kendall 1997). Burden for the patients may include time, cultural barriers, perception of relevance, and understanding of the measure.
One of the most commonly used health outcome measures in current usage is the Short-Form 36 (SF-36) questionnaire which purports to measure health status (Ware and Sherbourne 1992). The SF-36 has been widely used and tested across countries, within different diseases and in different health states since it was first developed (Ware and Gandek 1998). Whilst this questionnaire has undergone extensive validation and improvements this outcome measure can be used to illustrate possible examples of burden.
The use of the SF-36 questionnaire requires obtaining user registration, purchasing a licence, (approximately $US360), and incidental costs such as photocopying. Depending on the type of administration (self or interviewer-delivered) the questionnaire on average takes between 10 to 30 minutes to complete (Ware et al 2000). The analysis of the data then requires further manipulation of the numerical scores to allow interpretation of the results. Furthermore, although the SF-36 has a specific version for use within New Zealand, this version is only available in English (Sanson-Fisher and Perkins 1998). A patient must be sufficiently fluent with the English language to complete the questionnaire without consultation, as required by the standardised instructions (Ware et al 2000). Within New Zealand's multi-cultural society the language barrier may be too great a burden for some and may also be culturally unacceptable.
Having identified possible burdens to undertaking outcome measurement, each setting must assess their own needs (and restraints) to consider how practical any one outcome measure is to use. In the clinical setting time may be the greatest barrier; however, this barrier may be overcome in the research setting with adequate funding to employ staff to gather outcome measures.
Having identified conceptually where a health outcome may lie and highlighted practical issues surrounding the actual use of health outcomes, the final section of the paper describes how to assess and critique the measurement qualities or properties (psychometrics) of an outcome measure. This will in turn enable a more informed interpretation of the outcome measure result.
Psychometrics is the theory and rules of measurement (Nunnally 1978). For a measure to be used as an outcome measure certain psychometric properties need to be demonstrated. Figure 1 provides a schematic overview of psychometrics properties. Two classic psychometric properties, reliability and validity, are described. Finally, a third property that is sometimes overlooked but is clearly important in evaluation of outcome is discussed--the ability to detect change (sensitivity to change and responsiveness).
[FIGURE 1 OMITTED]
All measurement involves some internal error or lack of precision, whether in measuring joint angle or blood pressure (which is sometimes referred to as hard' measurements) or HRQoL (sometimes called soft')(Fries 1983). It is therefore important for the potential user, clinician or researcher to access the degree of validity and reliability of any measure for a specific population and for a specific purpose (Kirshner and Guyatt 1985). From a mathematical perspective, error of measurement can be in the form of either non-random error (otherwise referred to as validity) or random error (otherwise referred to as reliability).
Non-random error is a systematic biasing that affects measurement (Nunnally 1978). This can be explained by considering the situation of a physiotherapist measuring joint range of motion using a goniometer. Although the goniometer measures precisely, it may have been calibrated incorrectly (measuring 13 degree higher than it should). The extent of validity of an outcome measure is dependent on the degree of non-random error. The larger the non-random error the less valid the measure is.
Random error can be defined as chance unexplained fluctuations in data (Nunnally 1978). An example of non-random error can be demonstrated by the previous example measuring joint range of motion with a goniometer. Random error would occur if the goniometer was accurate, but the physiotherapist had eyesight problems and misread the angle whilst taking repeated measurements, on some occasions reading slightly higher and on some occasions slightly lower. The larger the random error the less reliable the measurement instrument is.
For measurement instruments to perform as successful outcome measures, both random and non-random error of measurement need to be minimised and this is done by ensuring the measure is valid and reliable.
Validity of an Outcome Measure
Validity--or non-random error--indicates the extent an instrument measures what it is intended to measure (Jette 1993). Validity is a complex concept; it is not an all or nothing property and should be considered in relation to the specific purpose for use and, in the specific population of interest. The description of the population should at minimum include the following baseline characteristics; age, gender, ethnicity, diagnosis and severity, and the presence of co-morbidities (Case and Smith 2000, Juni et al 2001)
Validity has historically been divided into three basic types: content validity, criterion validity, and construct validity (Carmines and Zeller 1979, McDowell and Newell 1996, Nunnally 1978). The classic definitions of the types of validity are presented in Table 1. In the development of a measure the concepts to be measured would need to be clearly and comprehensively defined. Content validity addresses whether the measure adequately covers all of the concepts previously defined. As there is no statistical analysis to assess content validity this commonly relies on critique from health care experts and patients from within a particular field (McDowell and Newell 1996).
Unlike content validity, the estimation of criterion and construct validity include statistical methods. Criterion validity would be demonstrated by the extent of correlation between the given instrument and an identified external criterion considered to be gold standard. Therefore, empirical evidence is required for identification of an adequate correlation; however, theory is required to demonstrate the selection of the external criterion.
Criterion validity at first seems a straightforward aspect of validity; however, challenges exist. First, in some situations there is no criterion gold standard. A relevant example is the absence of a gold standard criterion theoretically linked to the outcome measure of HRQoL (Guyatt 1993, Jette 1993, Nanda and Andresen 1998). Due to this limitation some authors have argued that individuals should act as their own judge for the external criterion (Deyo and Centor 1986, Ni et al 2000). Others advise that using more than one external criterion aids in the validation process, as long as the choice of additional criterion is based on appropriate theoretical evidence (Carmines and Zeller 1979, McDowell and Newell 1996). A second challenge is that original gold standard outcome measures need to be carefully critiqued in their own right as new measures are often evaluated against them (Saltzman et al 1998).
Construct validity, the identification of theoretical concepts between the two measures is determined by examining their empirical relationship. If a high correlation was established this would then add to the body of knowledge supporting construct validity of a measurement. Construct validity is an aspect of validity where evidence accrues to support or refute the use of a specific instrument. Construct validity should be identified if there is no gold standard criterion or no universal content (validity) that is clearly accepted to define the measure. As the New Zealand population demonstrates considerable cultural diversity it is important to assess construct validity across differing ethnic and cultural groups. This concept has been referred to in the literature as equivalence (Bullinger et al 1993, Hahn and Cella 2003, Herdman et al 1997).
Reliability of an Outcome Measure
Reliability is the stability of a test over time when no important changes have occurred (Jette 1993). An outcome measure is considered to be reliable --have minimal random error--when it gives the same results (or close to the same) over time when no change has occurred (Carmines and Zeller 1979, Nunnally 1978). Reliability of a measure must be identified for a specific population. If adequate reliability has not been demonstrated then it is unknown if changes over a given time are the result of a specific intervention, or the fact that the outcome measure has poor reliability.
There is no one single attribute to assess the reliability of a measure. Table 1 identifies and defines three key classical aspects of reliability; internal consistency, interrater (inter-observer), and intrarater (intra-observer and also sometimes referred to test-retest, repeatability or reproducibility) which all support different aspects of reliability (McDowell and Newell 1996, Nunnally 1978).
It may not be appropriate or necessary to assess all three attributes of reliability for every outcome measure. Internal consistency is an aspect of reliability that is commonly associated with an instrument (such as a questionnaire measuring HRQoL) that has multiple items (questions/ statements) and would demonstrate the degree of correlation/cohesion among items within the instrument (Cronbach 1951). The assessment of internal consistency has the practical advantage in that it requires only one completion of a measure whereas the assessments of intra- and interater reliability both require the completion of a measure twice and hence in certain contexts may be more burdensome. Of course it only provides a single piece of information--the degree to which all items of the measure are addressing a related concept.
Psychometric literature can be challenging as some authors identify and define terms with subtle differences. An example of this challenge is where a fourth attribute of reliability has been described; test-retest (Bowling 2001, Ottenbacher and Tomchek 1993, Rousson et al 2002). As described previously test-retest has been a term that has been used interchangeably with intrarater reliability (McDowell and Newell 1996, Nunnally 1978). However, Ottenbacher and Tomchek (1993) describe test-retest as a term associated with reliability studies when the outcome measure did not require rater observation or judgement. An example of such a situation would be where the measurement of HRQoL was completed via a questionnaire. The completion of the questionnaire relies on the subjectivity of the patient rather than a rater observation or judgement. This therefore reserves the term intrarater reliability for situations where rater observation or judgement is required such as the use of a goniometer. Particular attention is required to clearly define, and adequately reference terminology used in the context of reliability.
An Outcome's Measure Ability to Detect Change
A 'third' psychometric property, responsiveness, has been proposed and described in terms of the ability of an instrument to detect a minimally clinically important change over time, when one is present (Guyatt et al 1987, Kirshner and Guyatt 1985).
Terminology applied to literature exploring the ability to detect change is confusing. Internal responsiveness, external responsiveness, sensitivity and responsiveness tend to be used interchangeably (Hocking et al 1999, Husted et al 2000). Furthermore, some authors describe the ability to detect change as another aspect of validity rather than a separate psychometric property (Hays and Hadorn 1992, Liang 2000, McDowell and Newell 1996). To add to the confusion, sensitivity has a specific but different technical meaning when used in the field of epidemiology as it refers to '... the proportion of persons with a particular disease who are correctly classified as diseased by the test' (McDowell and Newell 1996 p.31). For the particular purpose of this paper and to aid in the ability to clarify the application of statistical tests, the terms sensitivity to change and responsiveness will be used to signify the ability to detect change, with their accompanying definitions as offered by Liang (2000) as included in Table 1.
Sensitivity describes a significant statistical change over time. However, this change may not mean anything to either the patient or the health professional. An instrument, therefore, may detect statistically significant change (ie be 'sensitive') but the patient or the health professional may not consider that change to be meaningful or important. Responsiveness encompasses the notion of a statistical difference, but importantly includes a focus on clinically important change as evaluated or defined from the perspective of either (singularly or a combination of) the person, carer, society or health professional. Responsiveness is therefore reliant on a criterion, external to the instrument, whereas sensitivity is not.
Interpretation of the Outcome Measure Result
The next two steps to consider when interpreting outcome measures are: "How do we estimate psychometric properties?" and "What are the levels of acceptability?" There are many statistical approaches to evaluating the various psychometric properties of measures. Table 1 lists and gives examples of some common forms of statistical tests and guidelines on interpretations of findings. More information can be found in; Physical Rehabilitation Outcome Measures: A Guide to Enhancing Clinical Decision Making (Finch et al 2002), Measuring Health: A Guide to Rating Scales and Questionnaire (McDowell and Newell 1996) and Psychometric Theory (Nunnally and Bernstein 1994).
Reliability, validity (and responsiveness) are not finite concepts. Increasing confidence in the use of a measure would be achieved by evidence gained from multiple studies on differing (and adequately described) populations for clearly identified purposes.
As part of evidence based-practice and clinical decision making processes, physiotherapists are required to assess the effectiveness of their interventions and the use of appropriate and high quality outcome measures aids this process. This paper has provided a framework for understanding where outcome measures have come from and outlined the important conceptual, practical and mathematical properties that measures should have. Just because an outcome measure is commonly used does not guarantee that it is a good measure.
Unless we as physiotherapists have knowledge on the reliability, validity (and responsiveness) of an outcome measure for a particular purpose in a particular population how can we correctly, with confidence, interpret findings? Further evaluation of established outcome measures and the development of new outcome measures for differing interventions, in differing settings and with patients demonstrating diversity, should help build confidence in the use of outcome measures. However, knowledge, and careful consideration of terminology and methods, in particular statistical analysis, is needed if the most appropriate use and interpretation of outcome measure results is to occur.
Physiotherapists and other health professionals have to balance a number of practical and professional issues in striving for excellence in clinical practice. Being able to wisely use and interpret outcome measures is one such issue given that services are increasingly being contracted for according to their contribution to health gain. Ensuring our patients receive the very best physiotherapeutic interventions means contributing to the appropriate use (and critique) of outcome measures for research and practice. This commitment to health outcome measures is a professional responsibility.
The authors wish to thank Dr Kathryn McPherson, Professor Rehabilitation), Division of Rehabilitation and Occupation Studies, and Dr Jane Koziol-McLain, Associate Professor, Division of Health Care Practice, AUT University for their comments and advice in the development of this paper.
Baumberg L, Long A and Jefferson J (1995): International workshop: Culture and outcomes, Clearing Houses on Health Outcomes. http:www.leeds.ac.uk/nuffield/infoservices/UKCH/define.html [accessed February 1, 2006].
Bowling A (2001): Measuring Disease: A review of disease specific quality of life measurement scales (2nd ed.). Buckingham: Open University Press.
Bullinger M, Anderson R, Cella D and Aaronson N (1993): Developing and evaluating cross-cultural instruments from minimum requirements to optimal models. Quality of Life Research 2: 451-459.
Bulpitt CJ (1997): Quality of life as an outcome measure. Postgraduate Medicine 73: 613-616.
Carmines E and Zeller R (1979): Reliability and Validity Assessment. Newbury Park: Sage Publications.
Case L and Smith T (2000): Ethnic representation in a sample of the literature of applied psychology. Journal of Consultant Clinical Psychology 68: 1107-10.
Chartered Society of Physiotherapy (1994): Standards for Tests and Measures in Physiotherapy. London, United Kingdom.
Chartered Society of Physiotherapy (2000): Core Standards. London, United Kingdom.
Cohen J (1988): Statistical Power Analysis for the Behavioural Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Cole B, Finch E, Gowland C and Mayo N (1994): Physical Rehabilitation Outcome Measures. Ontario: Canadian Physiotherapy Association.
Cronbach L (1951): Coefficient alpha and the internal structure of tests. Psychometrika 16: 297-334.
Deyo RA and Centor RM (1986): Assessing the responsiveness of functional scales to clinical change: An analogy to diagnostic test performance. Journal of Chronic Diseases 39: 897-906. Duckworth M (1999): Outcome measurement selection and typology. Physiotherapy 85: 21-27.
Epstein A (1990): The outcome movement: Will it get us where we want to go? New England Journal of Medicine 323: 266-270.
Finch E, Brooks D, Stratford P and Mayo N (2002): Physical Rehabilitation Outcome Measures: A Guide to Enhance Decision Making (2nd ed.). Hamilton: Lippincott, Williams & Wilkins.
Fries JF (1983): Towards an understanding of patient outcome measurement. Arthritis and Rheumatism 26: 697-704.
Gersten P (1998): Outcome research: A review. Neurosurgery 43: 1146-1156.
Guyatt G (1993): Measurement of health-related quality of life in heart failure. Journal American College of Cardiology 22: 185-191.
Guyatt G, Walters S and Norman G (1987): Measuring change over time: assessing the usefulness of evaluative instruments. Journal of Chronic Diseases 40: 171-178.
Hahn E and Cella D (2003): Health outcomes assessment in vulnerable populations: Measurement challenges and recommendations. Archives of Physical Medicine of Rehabilitation 84: S35-S42.
Hanley JA and McNeil BJ (1982): The meaning and use of the area under a receiver operating characteristics (ROC) curve. Radiology 143: 2936.
Hays R and Hadorn D (1992): Responsiveness to change: an aspect of validity, not a separate dimension. Quality of Life Research 1: 73-75.
Herdman M, Fox-Rushby J and Badia X (1997): Equivalence and the translation and adaptation of health-related quality of life questionnaires. Quality of Life Research 6: 237-247.
Hocking C, Williams M, Broad J and Baskett J (1999): Sensitivity of Shal, Vanclay and Coopers's Modified Barthel Index. Clinical Rehabilitation 13: 141-147.
Huijbregts M, Myer A, Kay T and Gavin T (2002): Systematic outcome measurement in clinical practice: Challenges experienced by physiotherapists. Physiotherapy Canada 54: 25-31.
Husted J, Cook R, Farewell V and Gladman D (2000): Methods for assessing responsiveness: A critical review and recommendations. Journal of Clinical Epidemiology 53: 459-468.
Jette AM (1993): Using health-related quality of life measures in physical therapy outcome research. Physical Therapy 73: 528-537.
Juni P, Altman D and Egger M (2001): Assessing the quality of controlled clinical trials. British Medical Journal 323: 42-46.
Kendall N (1997): Developing outcome assessments: A step--by--step approach. New Zealand Journal of Physiotherapy Dec: 11-17.
Kernick D (2003): Introduction to health economics for the medical practitioner. Postgraduate Medicine 79: 147-150.
Kirshner B and Guyatt G (1985): A methodological framework for assessing health indices. Journal of Chronic Diseases 38: 27-36.
Klassen L, Grzybowski W and Rosser B (2001): Trends in physical therapy research and scholarly activity. Physiotherapy Canada 53: 40-47.
Landis J and Koch G (1977): The measurement of observer agreement for categorical data. Biometrics 33: 159-174.
Liang MH (2000): Longitudinal construct validity: Establishment of clinical meaning in patient evaluation instruments. Medical Care 38: 84-90.
Mayo N, Cole B, Dowler J, Gowland C and Finch E (1993): Use of outcomes in physiotherapy: A survey of current practice. Canadian Journal of Rehabilitation 7: 81-82.
McDowell I and Newell C (1996): Measuring Health: A Guide to Rating Scales and Questionnaires (2nd ed.). New York: Oxford University Press.
Muldoon M, Barger S, Flory J and Manuck S (1998): What are quality of life measurements measuring? British Medical Journal 316: 542-545.
Nanda U and Andresen EM (1998): Health-related quality of life: A guide for the health professional. Evaluation & The Health Professions 21: 197-215.
New Zealand Physiotherapy Board (1999): Registrations Requirements: Competencies and Learning Objectives (2nd ed.). Wellington, New Zealand.
Ni H, Toy W, Burgess D, Wise K, Nauman DJ, Crispell K and Hershberger RE (2000): Comparative responsiveness of Short-Form 12 and Minnesota Living with Heart Failure Questionnaire in patients with heart failure. Journal Cardiac Failure 6: 83-91.
Nicholls D and Larmer P (2005): Possible futures for physiotherapy: An exploration of the New Zealand context. New Journal of Physiotherapy 33: 55-60.
Nunnally J (1978): Psychometric Theory (2nd ed.). New York: McGraw-Hill.
Nunnally J and Bernstein I (1994): Psychometric Theory (3rd ed.). New York: McGraw-Hill.
Oldridge N (1997): Outcome assessment in cardiac rehabilitation: Health related quality of life and economic evaluation. Journal of Cardiopulmonary Rehabilitation 17: 179-194.
Ottenbacher K and Tomchek S (1993): Reliability analysis in therapeutic research: Practice and procedures. The American Journal of Occupational Therapy 47: 10-16.
Patrick D and Chiang Y (2000): Measurement of health outcomes in treatment effectiveness evaluation: Conceptual and methodological challenges. Medical Care 38: S14-S25.
Relman A (1988): Assessment and accountability: The third revolution in medical care. New England Journal of Medicine 319: 1220-1222.
Robinson R (1999): Limits to rationality: Economics, economists and priority setting. Health Policy 49: 13-26.
Rothstein J, Campell S, Echternach J, Jette A, Knecht H and Rose S (1991): Standards for tests and measurements in physical therapy practice. Physical Therapy 71: 589-622.
Rousson V, Gasser T and Burkhardt S (2002): Assessing intrarater, interrater and test-retest reliability of continuous measurements. Statistics in Medicine 21: 3431-3446.
Saltzman C, Mueller C, Zwior-Maron K and Hoffman R (1998): A primer on lower extremity outcome measurement instruments. Iowa Othopeadic Journal 18: 101-111.
Sanson-Fisher RW and Perkins JJ (1998): Adaptation and validation of the SF-36 health survey for use in Australia. Journal of Clinical Epidemiology 51: 961-967.
Stewart A, Green field S, Hays R, Wells K, Rodgers W, Berry S, McGlynn E and Ware J (1987): Functional status, well-being of patients with chronic conditions. Journal of American Medical Association 267: 907-913.
Ware J and Gandek B (1998): Overview of the SF-36 health survey and the international quality of life assessment (IQOLA) project. Journal of Clinical Epidemiology 51: 903-912.
Ware J, Kosinski M and Gandek B (2000): SF-36 health survey: Manual and interpretation guide. Lincoln: Qualitymetric Incorporated.
Ware J and Sherbourne C (1992): The MOS 36-item short-form health survey (SF-36). Medical Care 30: 473-483.
Wennberg J and Glittelsohn A (1982): Variations in medical care among small areas. Scientific American 246: 120-134.
World Health Organisation (2001): International Classification of Functioning, Disability and Health. http:/www3.who.int/icf/icftemplate.cfm [Accessed November 20, 2002].
* Health outcome measures are being embedded into physiotherapy practice.
* Not all health outcome measures are good health outcome measures.
* Health outcome measures should be placed within conceptual frameworks and be practical.
* Health outcome measures should be reliable, valid and responsive for a particular purpose in a particular population.
Diana Horner MHSc(Hons), BSc(Hons) Physiotherapy, Senior Lecturer, School of Physiotherapy, Auckland University of Technology, Private Bag 92006, Auckland, New Zealand. Email email@example.com. Tel. 00 64 9 921 9999 ext 7083. Fax. 00 64 9 921 9620
Senior Lecturer, School of
Physiotherapy, Auckland University of Technology
Peter J Larmer
Senior Lecturer, Division of Rehabilitation and Occupation Studies
Auckland University of Technology
Table 1. Psychometric Properties: Definitions, Common Statistical Tests and Guidelines on Interpretation. Properties Definitions Statistical Tests Validity Content The adequacy of which an Not applicable instrument addresses/samples all relevant aspects that were defined in the conceptual definition of the instrument (Nunnally and Bernstein 1994). Criterion The ability of an instrument Correlation to estimate some important statistics; feature or behaviour that is Peasons Product- external to the actual Moment--Correlation measuring tool itself, the Coefficient (r) feature or behaviour being known as the criterion Spearmans Rank-- (Nunnally 1978). If the Correlation criterion exists in the Coefficient present this is referred to as ([r.sub.]) concurrent validity. Whereas, the term predictive validity applies to an external criterion that is to be measured in the future. The '... extent to which a particular measure relates to other measures consistent with theoretically derived hypotheses concerning the concepts (or constructs) that are being measured' (Carmines and Zeller 1979 p.23). Convergent validity tests for correlations with other instruments intending to measure the same or similar concepts; divergent validity tests for a lack of correlations with instruments that assess concepts that are opposite (McDowell and Newell 1996). Reliability Internal Consistency or homogeneity of Cronbachs Alpha Consistency a particular instrument or (([alpha]) Kuder measurement across its items Richardson--Formular (Cronbach 1951). 20 (KR-20) Interrater Indicates the consistency/ Agreement statistics level of agreement of results Intraclass when two or more raters/ Correlation-- assessors complete the same Coefficients (ICC) measurement on the same Kappa statistics patient(s) where there is no ([kappa]) evidence of change (McDowell and Newell 1996). Intrarater Indicates the consistency/ level of agreement of results when repeatedly completing the same measurement by the same rater/assessor where there is no evidence of change (McDowell and Newell 1996). Ability to detect change Sensitivity '... the ability of an Effect sizes (ES) instrument to measure change in a state regardless of whether it is relevant or meaningful to the decision maker' (Liang 2000 p.85). Responsiveness '... the ability of an Receiver Operating-- instrument to measure a Characteristic (ROC) meaningful or clinically Curve important change in a clinical state' (Liang 2000 p.85). Correlation statistics Properties Guidelines to Interpretation Validity Content Not applicable Criterion r or [r.sub.s] .10 = small .30 = medium .50 = large (Cohen 1988). NB, Correlation coefficients range from -1 to +1. Negative coefficients indicate a negative correlation. Positive coefficients indicate a positive correlation. Reliability Internal [alpha] or KR-20 < .70 = inadequate Consistency [greater than or equal to].70 = good [greater than or equal to].80 = excellent (Nunnally 1978). NB. Coefficients range from 0-1. Interrater ICC < .70 = inadequate [greater than or equal to] .70 = good [greater than or equal to].80 = [kappa] excellent (Nunnally 1978). < .40 = poor .41-.60 = moderate Intrarater .61-.80 = substantial >.80 = almost perfect (Landis and Koch 1977). NB, Coefficients range from 0-1 Ability to detect change Sensitivity ES .20 = small size .50 = moderate size .80 = large size (Cohen 1988). NB. ES have no upper limit. Responsiveness ROC auc [less than or equal to].50 = inadequate discrimination > .60 = adequate discrimination [greater than or equal to].80 = good discrimination (Hanley and McNeil 1982). NB. Coefficients range from 0-1. r or [r.sub.s] as above (see validity),