Printer Friendly

Undercounts in offender data and closing the gap between indigenous and other Australians.


The accuracy of administrative data for Indigenous populations and ethnic minorities is an important matter for researchers and policy-makers. For example, if governments do not have reliable data on who is Indigenous in administrative data that are used to plan services, then they cannot plan adequately to address Indigenous disadvantage. Consequently the major thrust of Indigenous policy that attempts to 'Overcome Indigenous Disadvantage' (OLD) and 'Close the Gap' between Indigenous and non-Indigenous life expectancy is undermined (Steering Committee for the Review of Government Service Provision (SCRGSP) 2003; 2005; 2007).

Undercounts of Indigenous people are present in almost all data sets, even in the Census, and hence it is difficult to know the exact number of people covered by a particular policy. The Australian Bureau of Statistics (ABS) uses sophisticated techniques to correct for this tendency towards under-enumeration in the Indigenous population in Census data, but it is relatively rare to find similar corrections applied to data collected for administrative purposes by government agencies. In addition to undermining the credibility of the information provided from such sources, poor quality data have profound consequences for the policy settings. For example, the Commonwealth Grants Commission's (CGC's) horizontal fiscal equalisation formula is partially based on the Indigenous Estimated Residential Population (ERP), hence data quality for Indigenous mortality and fertility will fundamentally affect the distribution of resources in Australia's federal system.

While data quality is a rather dry topic, it is clearly very important. The objective of this paper is to present some analysis to illustrate how one might go about critically examining administrative data on the Indigenous population. While many administrative data sets collect information on the Indigenous status of clients, often the quality of such data is questionable.

One of the largest categories in both population Censuses and administrative data is that of Indigenous status listed as unknown. The fundamental premise of this paper is that it is important to understand this category lest the data quality be fundamentally compromised. If the unknowns are a substantial component of the population, then one cannot be certain one has correctly estimated the incidence of phenomena in the Indigenous and the residual Australian population. That is, if the unknowns are more like the non-Indigenous population than Indigenous Australians, then a comparison between the known Indigenous and non-Indigenous populations would overstate Indigenous disadvantage. Of course the obverse of this proposition is also true (that is, if the unknowns are more like those who clearly indicate their indigeneity, then Indigenous disadvantage is likely to be understated).

It is obviously important to understand who identifies as an Indigenous person at a particular point in time, but it is arguably even more important to understand how someone's identity might change over time. This is because, as alluded to above, administrative data are increasingly used to ascertain whether Indigenous outcomes are improving according to various indictors (for example, in OID reports). If data quality is unreliable and uncertain, then one should question the level of investment warranted to achieve desired outcomes. Real resources can be diverted from effective programs because of spurious trends in outcomes identified using unreliable data especially when funding available for social policy is scarce.

This paper illustrates some of the issues involved in using administrative data from the Re-Offending Database (ROD) provided by the NSW Bureau of Crime Statistics and Research (BOCSAR) (Snowball & Weatherburn 2006). This data set has several features that make it suitable for a study of the quality of Indigenous data: First, it is collected over a period of time during which it is possible for Indigenous people to change their identification; and second, data on Indigenous status is collected from two 'independent' sources and this allows us to validate our estimates.

The standard approach adopted to estimate the number of people missing from any particular enumeration is that of a follow-up survey (Marks et al. 1974). Such a survey is undertaken after major Censuses in most developed countries; in Australia it is known as the Post-Enumeration Survey (PES). (1) Another method for estimating the potential Indigenous population is the Dual System Estimator (DSE), sometimes referred to as 'dual survey estimators' or 'dual record systems', which can be used to benchmark the estimate of the Indigenous population within the NSW criminal court system.

The next section describes broader issues regarding historical changes in the Indigenous population, followed by a detailed introduction to the ROD data. It is proposed that a statistical model be used to predict whether the unknowns are more like the Indigenous or non-Indigenous population. After presenting a summary of the results estimated from this model, the penultimate section benchmarks these results using a simple DSE that has been used to estimate populations for various groups in many countries (see Hunter & Dungey 2006). The final section provides some concluding remarks about the utility of the estimators used and points to future research directions that might prove useful for policy makers and researchers. While this paper attempts to build confidence in the data, we also hope to raise basic issues that can and should be considered by anyone who collects and analyses data on Indigenous Australians.

The big picture for Indigenous population changes

In general, population levels change over time according to the demographic balancing equation, which is an accounting identity that measures the flows into and out of a particular population (Shyrock, Siegel & Associates 1976: 4):

[ERP.sub.t+1] = [ERP.sub.t] + [Births.sub.t] - [Deaths.sub.t] + Net [Migration.sub.t] + Census [Procedure.sub.t] + [E.sub.t] (1)

[ERP.sub.t] is the time-specific Estimated Residential Population and [E.sub.t] is an error or residual term. The term [ERP.sub.t] takes into account the tendency to miss some people when counting populations--that is, net undercount at a particular point of time. The recognition of this tendency does not deny the existence of double counting in some circumstances, but the reality is that many Indigenous people often do not identify or choose to identify as Indigenous, for reasons that could include past experiences of racism or discrimination. The balancing equation can be characterised as an accounting identity when one examines the total population because the residual term means that this equation is always true by definition. For sub-populations such as Indigenous and non-Indigenous population the changes are complicated by non-biological population growth (sometimes called an 'error of closure', which is largely embodied in the term E). This non-biological growth can include components such as those due to increased (or decreased) propensity to identify as Indigenous, to inter-marriage between-various sub-populations and to identification of the resulting progeny from such marriages. Another factor that affects non-biological growth is the change in both coverage of sub-populations (Guimond 1999) and the Census editing procedures (Ross 1999). The Census procedures in equation (1) are sometimes included in estimates of the 'error of closure' because it is difficult to get a precise measure of the effect of collection methodology on estimated populations.

There are three possible sources of change in responses to ethnic affiliation, sometimes called Ethnic Mobility (Brown et al. 2010: 46; also see Westbrooke & Jones 2002):

* unreliability in measurement;

* changes due to alterations in ethnicity questions; and

* conscious changes in ethnic affiliation.

Switching affiliation between ethnic groups can be the result of changing incentives, both positive and negative (Brown et al. 2010: 46). Overall, 8 per cent of respondents to New Zealand's Survey of Family, Income and Employment (SoFIE) changed ethnicity at least once during three recent waves of that survey between 2002-05 (Carter et al. 2009).

The substantial uncertainty about the size of the Indigenous population means that trends in related outcomes are particularly difficult to interpret. This is because changes in an individual's Indigenous status over time lead to changes in the composition of the Indigenous population, which is difficult, if not impossible, to take into account before the population has identified itself at a particular point in time. Predicting future outcomes for the Indigenous Australian population is fraught for the same reason: one can never be sure about the extent of the non-biological growth. The main point is that it is difficult to evaluate long-running initiatives and changes in a policy regime because researchers can never be sure that cross sectional data measured over time relates to the same group of individuals. That is, longitudinal data are required in order to make sense of trends in Indigenous welfare and other outcomes.-

The Census counts of Indigenous people, and the related issues of undercounts and the increasing propensity to identify as Indigenous, illustrate that many who do not self-identify as Indigenous in one Census may eventually identify as Indigenous in a latter Census.

Two questions arise in this context: 1. who is likely to change their Indigenous status?; and 2. is the process of increasing propensity to identify as Indigenous beginning to wane? With respect to the latter question the non-biological increase in the Indigenous population is likely never to be finalised because intermarriage between Indigenous and non-Indigenous people means that the resulting children are likely to self-identify as Indigenous in some circumstances. There is a high and increasing rate of intermarriage especially in urban areas.

With respect to question 1, it is worth noting that the level of non-response to the Indigenous status in the Censuses is over twice the size of the actual number of people who identified as Indigenous directly. ABS (2007a) reports that 5.7 per cent of Australians did not respond to the question on Indigenous status in the 2006 Census (that is, using usual residents counts).

Another general observation that can be made from the ABS (2007a) results is that the Census questions to which respondents are more likely not to respond are questions which relate only to part of the population or for which respondents are uncertain about the appropriate response (for example, residential status in non-private dwelling, unpaid domestic work and unpaid assistance to a person with a disability). The size of the non-response rates for the Indigenous status question is not unduly large compared to other questions, but the overall number of non-responses to that question is particularly significant given the small overall number of people who identified as Indigenous.

This paper argues that those who do not indicate their Indigenous status in one year are the most likely part of the population to indicate they are Indigenous in future. Even if a relatively small proportion of these 'unknown' respondents change their status to Indigenous in the future, the Census-based estimates of the Indigenous population will be measured with considerable error, because there are so many respondents who are 'unknowns'.

Figure 1 shows the demographic profiles for Indigenous, non-Indigenous and unknown-indigenous populations in the Census and the unknown Indigenous status in the court data used in the ROD. People with unknown Aboriginal and Torres Strait Islanders (ATSI) status in the ROD are actually closer to the Indigenous Census profile than that for other Australians. That may be partially driven by the younger profile of the population involved in the criminal justice system. The importance of this figure is that it illustrates that it is still worth asking question about the unknown categories and the remainder of this paper does that. Note that the following analysis refers to both ATSI status in criminal court data at a particular point in time, and the consolidated Indigenous status in the ROD. The ROD status indicates whether a person identified as an Indigenous person at any appearance in the ROD.


Another observation that can be made in Figure 1 is that the demographic profile of the not stated category in the Census is not substantially different to the self-identified non-Indigenous population. Given that Indigenous people are such a small minority of the Australian population this is not unexpected.

Data and Method

Overview of the ROD

The ROD is not a simple data source that has been consistently collected and collated; it is a compilation of several data sets all potentially constructed using different criteria. The ROD has evolved over time in response to data quality issues as they become apparent. BOCSAR has done an invaluable job in combining the data so that they are broadly comparable for major socio-demographic characteristics. Data from the Children's, Local and Higher Courts include a profile of sex, age, ATSI status and current location (measured at various levels--including Postcode, Local Government Area and Statistical Divisions) for each particular court appearance. These data sources also include information on offence and penalty but such data was not used in this study since we were primarily interested in a person's individual characteristics and identity to better estimate the Indigenous populations. Data from Youth Justice Conferences and custodial data from the Department of Correctional Services were also available but were not used for similar reasons.

The courts' ATSI status indicator is sourced from the police records (so when the matter goes to court, the ATSI status is filled in from the police file rather than the courts recording it separately). There is no audit between the court records and police records.

In the remainder of this paper we refer to the 'consolidated Indigenous identifier' whereas the raw police data will be referred to as the 'ATSI indicator'. Police data, and hence court data, on ATSI status vary over time because of the failure of police to record that information and the difficulty in reconciling names provided at different points in time. The quality of the police use of this flag apparently increased after 1995 when the Department began to increase the emphasis on the gathering of ATSI status (personal communication from Don Weatherburn, Director of BOCSAR). (2)

The total number of unique individuals appearing each year in the ROD data increased gradually over the period examined in this paper from around 90,000 in 1994 to almost 110,000 in 2006.

It is important to understand the structure of the data used in this study. We did not follow the characteristics of individuals for every court appearance (at any point in time) in the study period, as this would have entailed an enormous amount of data that would be difficult to manage. While BOCSAR had access to all the criminal court records in the ROD we reduced the size and dimensionality of the data management exercise by focussing on the characteristics of people for their first appearance in a court in each year between 1994 and 2006. Obviously, this means that our analysis did not capture all the information on the ROD where there were multiple court appearances in a particular year. However, we would argue that this simplification is justifiable in that the information used in our study either does not change or is slow to change over time (demographic characteristics, Indigenous status and location of residence). By focusing on annual changes we can capture most of the variation in the data.

Of the 89,945 individuals recorded in our ROD data at some stage in 1994, less than 20,000 of these also appeared in court in 1995. The number of these individuals in subsequent years declined over time but was generally over 10,000 people reappearing in any given year (n.b., the only exception was 2006 when the number of the original individuals appearing at least once was 9,579). While the rate of re-offence was quite high over the period examined, some of the original individuals in the ROD in 1994 did not appear in later years. Of course, other people appeared for the first time in the ROD in 1995 and later years. That is, the ROD data are intrinsically dynamic, and hence it is not easy to provide a summary overview.

The fluctuations in the percentage of unknowns in ROD provide prima facie evidence of the data quality issues. Of course, if the percentage of unknowns in the administrative data is over 50 per cent there is literally more you do not know about the Indigenous population than you do know. There is no definitive rule whereby one can establish that administrative data are reliable, but if you take this time series in Figure 2 as a starting point, then the level of unknowns decreases around 1997 or 1998 to around 30 per cent and stays stable at around 20-30 per cent thereafter. This observation accords with anecdotal evidence of a NSW Police push to use the ATSI indicator more consistently.


At the time of writing, a person for whom there is no information recorded on their ATSI status by the police is treated as non-Indigenous for the purposes of comparative statistics calculated from the ROD (Snowball & Weatherburn 2006: 6). The argument of this current paper is that the unknown category should be interrogated to see if there is some unused information that would allow more robust and reliable comparative statistics to be calculated for both Indigenous and non-Indigenous populations. The simplest way to do this is to randomly allocate the unknown category between Indigenous and non-Indigenous groups, using a uniform statistical distribution, on the seemingly reasonable grounds that we do know whether such people are more likely to be one or the other. In other words, random allocation assumes the same percentage of Indigenous and non-Indigenous persons in both the known and unknown populations.

Figure 3 reports the results of this calculation. In 1994, random allocation produced a 15 percentage point higher estimate of the ATSI population than that which treated the unknowns as all being non-Indigenous. The size of this wedge is due to the high levels of unknowns. Note that the difference in the estimated rate of ATSI is greatly reduced by 1997 and seems to stabilise at just over two per cent of the population who appeared in court at some stage during a particular year. There will always be a wedge between these estimates as long as there is some non-response to the questions about ATSI status. Given that we do not have any reason for expecting that the proportion of ATSI is going to vary much in the short-run, Figure 3 would seem to add weight to our assertion that the data quality for ASTI status was not very good before 1997 and one should probably discount results generated from the ROD for that period.


Figure 4 compares the consolidated Indigenous identifier, both with and without the unknowns randomly allocated, against the ATSI indicator with unknowns randomly allocated. The consolidated identifier usually implies a higher predicted Indigenous population within the ROD than would be predicted using the percentage of ATSI estimated after the random allocation of unknowns. This is understandable since people who indicate they are ATSI in later (or previous) court appearances, but did not do so in the current year, are reclassified as Indigenous. However, once the unknown category is randomly allocated for the ATSI indicator and the consolidated Indigenous indicator, there is no necessary reason why this is the case. However, the only time when the allocated per cent of ATSI is greater than the per cent with consolidated Indigenous status is in 1995. It is likely that this was due to the large number of people with unknown ATSI status in the court data in that year.


In addition to providing the best guess of the true Indigenous population without any sophisticated statistical treatment, Figure 4 confirms that the proportion of people appearing in the ROD with some Indigenous identity is relatively stable after 1997 or 1998. This is consistent with the above suggestion that the quality of the ROD data with respect to Indigenous status is credible and reasonably robust after 1997.

Geographic Issues

Indigenous Australians are more likely to live in less accessible areas than other Australians, and hence geographic information may be a predictor of the true population of Indigenous offenders. Furthermore many researchers argue that the processes of identification are fundamentally different in more remote environments (see Ross 1999). Postcode information is nominally available for the ROD. However, there are several problems when using it in the context of Indigenous Australians. The main issue is that there is an insufficient number of Indigenous Australians in the average Australian postcode for many statistical purposes (see Hunter, 1996). Neighbourhood analyses of postcode data are sometimes attempted for the total Australian population, but the lack of credible population estimates for Indigenous Australians at this geographic level means that such analyses cannot be conducted for Indigenous communities.

The ABS draws boundaries and estimates the population distribution for other geographic levels of analysis. Local Government Areas (LGAs), which can be aggregated from Statistical Local Area (SLA) boundaries relatively easily, are also revised using the latest Census data and the changes are recorded in the hierarchical Australian Standard Geographic Classification (ASGC). The LGA boundaries are relatively large and stable over time, compared with postcodes, and hence are large enough to minimise measurement error. (3)

Another way to limit the error introduced by uncertainty over geographic boundaries is to use a robust indictor that captures broad spatial differences, such as the ABS's Accessibility/Remoteness Index of Australia (ARIA) index. One ARIA-style indicator, the Levels of Relative Isolation (LORI) index that was used in Western Australian Aboriginal Child Health Survey (WAACHS), characterises the accessibility of local Indigenous communities (Zubrick et al. 2004). Aggregations of areas classified by LORI also provide a meaningful indication of the accessibility of services in an area and hence the LORI index values for SLAs are aggregated to LGAs for use in what follows.

Statistical Model of Unknowns

Figure 1 provides some evidence that individuals in the ROD with unknown ATSI status may be closer to the Indigenous population in the Census, at least in terms of their basic demographic age profile. However, this similarity may be a result of the distinctive nature of the court data and hence it is necessary to estimate the basic demographic profiles for those for whom we have direct information regarding ATSI status. At a minimum one would expect that the distinctive demographic profiles of males and females also be taken into account when trying to estimate whether unknown ATSI status profiles are closer to ATSI or other residents of NSW.

The following analysis uses a simple binary logistic framework to predict whether an individual with unknown status is ATSI or non-ATSI using a series of socio-demographic and geographic characteristics for their first court appearance in a particular year. (4) Logistic regressions are often used where the dependent variable has two possible values, zero or one--for example, a person can either identify themselves as having either ATSI or non-ATSI status. To overcome the fact that this is a limited dependent variable, a logit transformation is used to ensure that the predicted probabilities lie between zero and one. The basic formulation of the binomial logistic regression model is

Logit [P.sub.i] = log [(P/(1-P)).sub.i] = b[X.sub.i] + [e.sub.i]

where b is a coefficient vector, the explanatory variables [X.sub.i] and [e.sub.i] are the error terms which approximate a normal distribution. See Agresti (1984) and Hosmer and Lemeshow (2000) for fuller discussions. Logit P, also known as the log odds ratio, is the dependent variable in the logistic regression. The logistic regression models are estimated using maximum likelihood estimation techniques.

Often the coefficients of the binomial logistic model are interpreted using the log odds ratio. Hosmer and Lemeshow (2000) show that the log odds, or rather the natural log of the odds ratio, equals the individual coefficient of the respective variables. (5) When the explanatory variables are also categorical, the coefficients in a logistic model must be interpreted as relative to a reference person defined by the omitted categories of the respective groups of explanatory variables. The reference person, or base case, in the following analysis is a female, aged 25 to 34 years and living in a highly accessible SLA. Therefore, if the interest is in the effect of being male on the probability of being ATSI, then a negative coefficient implies that males are less likely to be identified as ATSI than females (that is, the odds ratio is less than one).

This regression model helped us to determine the differences between ATSI and non-ATSI population in the ROD data given that a person has some valid ATSI status identified for their records. Hunter and Ayyar (2009) report the odd-ratios, and associated standard errors, for a basic set of demographic and geographic characteristics. (6) Males are about half as likely to be identified as ATSI as females. Also, in general the older a person is the less likely they are to identify as ATSI, with the youngest age group, 15 to 17 year olds, being around three to four times as likely to identify as ATSI compared to the base age group in all years of the analysis. Finally, increasing the level of remoteness is significantly (and substantially) associated with being more likely to identify as ATSI.

The next step in the analysis is to provide an 'out-of-sample' prediction of the proportion of ATSI for those with unknown ATSI status in the respective years. That is, we then ask whether the people with unknown ATSI status are more like the ATSI or the non-ATSI populations, and classify the person as such for the purposes of estimating the true population.

Before reporting the results we reflect briefly on the implications of the fact that the data used are highly grouped and hence there are relatively few cells from which to impute ATSI status. As a result the predicted probabilities for such cells will be classified as either ATSI or non-ATSI (depending on whether the probability of being ATSI is greater than 0.5). There is nothing particularly wrong with this procedure except that the small number of cells means that it will lead to less reliable estimates than would otherwise be the case. Consequently, we use an alternative method, whereby we first estimate the probability of being ATSI for each cell (that is, conditioned on relevant demographic and geographic characteristics), and then multiply by the number of those with unknown status in that cell. (7)

Figure 5 reports the levels of reclassification of Indigenous status using the logistic and random allocation techniques after restricting the focus solely to those individuals for whom we have complete information on sex, age and geography. That is, it uses only the data included in the regression analysis.


When Figures 3-4 are compared with Figure 5 the estimated per cent of ATSI is higher in the latter after the random allocation procedure is performed on the unknown ATSI category. This is mainly because the latter was confined to those for whom all the relevant demographic and geographic data were available. For such data the proportion of people who identified as ATSI increased by around three percentage points, irrespective of the allocation of the unknown categories.

We surmise from the small size of the difference between the random allocation and logistic allocation of ATSI status that the diligent recording of demographic and geographic data by police for the courts is associated with more accuracy in recording ATSI status. Hence the auditing of records to improve the overall quality of demographic data will also improve the reliability of evidence for Indigenous over-representation in the criminal justice system.

As indicated above, the ROD includes a consolidated Indigenous identifier that takes into account previous identification patterns in NSW court data. The above analysis can also be conducted for the distribution of this indicator. Note that one would expect this indicator to deliver a higher level of Indigenous identification, with less likelihood of an undercount. In most circumstances it would be expected that this measure of Indigenous identification should be closer to the true Indigenous population within ROD. A priori, however, one cannot reject the hypothesis that changes in Indigenous status are not occurring in an arbitrary fashion that is unrelated to an individual's true identity. Figure 5 illustrated that the extent of reclassification when using regression estimates is only slightly higher than that done when unknowns are randomly assigned (using uniform distribution). Given that there are fewer unknowns to assign when the consolidated Indigenous identifier is used, it is reasonably certain that there will not be much difference between the randomly allocated and regression adjusted estimates of the proportion of Indigenous people in ROD. Another reason not to estimate the regression adjustments for the consolidated Indigenous identifier is that these are cross-sectional regressions and the ROD Indigenous status variable uses information across time. Therefore any regression adjustments are likely to be correlated over time, which would certainly induce less reliability into the resulting estimates and may induce some bias with more recent offenders having had less time for their consolidated Indigenous identifier to be updated or 'consolidated'. Notwithstanding this, as indicated above, the estimates based on the random allocation of unknowns for the consolidated Indigenous identifier are likely to provide a close approximation of our best guess for the true Indigenous population within NSW courts.

The DSE Methodology

The following uses the DSE estimator to validate the above estimates of the Indigenous offenders appearing in court within the ROD data. The simplest DSE is a two-sample model. The first 'sample' identifies certain individuals who are returned to the population of all offenders after the 'survey' is complete, while the second 'sample' provides an independent measure of the population. Using the numbers of individuals in both samples and the numbers identified in just one sample, it is possible to estimate the number not captured in either sample, thus providing an estimate of the total population size. The assumptions required for such an estimate to be valid are that:

1. there is no change to the population during the investigation (that is, the population is closed);

2. individuals can be matched from one sample to the next;

3. the chance of being in each sample is uncorrelated for each individual; and

4. The two samples are independent.

Sekar and Deming (1949) were the first to adapt this method for human populations when they used it to estimate birth and death rates, and the extent of their registration in 1949, with hospital data from India. There is also a substantial literature going back to the 1940s, dealing with the application of the two-sample method to Census data (Fienberg 1992). By taking another sample in addition to the Census, the method can be used for estimating undercount by the Census (Hogan 1993).

In terms of the validity of assumptions for estimating the potential numbers of Indigenous Australians, it was necessary to confine our attention to closed populations. Even populations with high mobility, such as people in remote Indigenous communities, may be considered 'closed' so long as the PES or follow-up survey takes place shortly after the initial survey or Census (Paradies et al. 2000).

With respect to assumption (2), matching will depend on the quality of the records and the uniqueness of respondents' names. BOCSAR has given a detailed assurance that all due care is taken to match the court data for individuals by spending considerable time and resources constructing a unique person identifier.

Another of the assumptions required for DSEs to be valid is the homogeneity of the population (assumption 3 above). That is, all the Indigenous population should have the same chance of being sampled in the follow-up survey. This assumption is unlikely to be violated as it is not a choice for most offenders. Both the police and the court records report information ATSI status although there is obviously room for doubt about the certainty of the categories, as is evidenced by the existence of the unknown category for both 'surveys'.

The question of independence is discussed by Sekar and Deming (1949) in some detail (also see Marks et al. 1974). ROD directly uses police data on ATSI identification, but subsequent identification is arguably independent in that the names of offenders are often matched statistically on ROD because of the systematic difficulties encountered in ensuring that the record refers to the same person over time. However, if one does reject the assumption of independence on the grounds of correlation between 'samples' (that is, police remember individual offenders and are consistent in their identification and marking of records), then one should expect that the DSE provides a small or understated adjustment to the estimate of indigenous offenders (because relatively few offenders will change status over time).

Another issue, which can cause problems for the DSE methodology, is that of coverage. If there are individuals who are not sampled in both samples, this results in potential upwards bias of the estimates (Shyrock et al. 1976). For Census data and administrative data which in principle cover the whole population, this source of error should be relatively small.

The key to the DSE method is an ability to match individual records on some different criteria (that is, different to the one of immediate interest), and then check the observation of interest for consistency. In a two-outcome situation, such as a yes/no question, four potential outcomes occur, as illustrated in Table 1. First, the record can be 'yes' on both the initial and second surveys, designated by the cell [x.sub.11]. Second, the record can be 'yes' on the first and 'no' on the second, designated by the cell [x.sub.12]. Third, the record can be 'no' on the first and 'yes' on the second, denoted by cell [x.sub.21], and finally the record can be 'no' on both surveys, given by [x.sub.22]. This method cannot, of course, pick up information that has been incorrectly recorded on both surveys (e.g. respondents answering 'yes' on both surveys when the true observation was 'no').

Using the Sekar-Deming (1949) formula, the revised population estimate is:

[??] = [x.sub.11] + [x.sub.12] + [x.sub.21] + [x.sub.12][x.sub.21]/[x.sub.11] (3)

If Table 1 refers to the response to a question about Indigenous status, then only [x.sub.22] people always deny they are Indigenous. Consequently, Hunter (1998) referred to the potential Indigenous population as being equal to [x.sub.11] + [x.sub.12] + [x.sub.21]. The consolidated Indigenous identifier on ROD is closely related to this 'potential Indigenous population'. The main difference is that the ROD estimate is potentially based on repeated appearances in court and hence takes into account more than two 'surveys'. However, some of the [x.sub.22] people may also admit to being Indigenous in other circumstances (that is, if other similar independent surveys were conducted repeatedly). The 4th term on the right-hand side of equation 3 is the number expected to identify as Indigenous at least once if all surveys are 'independent' (in statistical terms). (8)


Figure 6 reports the DSE of the percentage of people appearing in the ROD who are Indigenous between 1997 and 2006. The estimates are remarkably stable and we would argue that this approach provides the most accurate picture of the true population of Indigenous offenders in the ROD. It is probably not a coincidence that the DSE estimate based on any previous appearance is very similar to the estimates based on the consolidated Indigenous identifier (after random allocation) in 1997. Prior to that year, the high level of unknowns due in part to poor data quality of court records, especially with respect to Indigenous status, lead to unreliable estimates for both the consolidated Indigenous estimates and the DSEs. The DSE is particularly affected in those years because it is driven by the exceptionally small number of people specifically identifying as ATSI in the earlier years. (9)

Observant readers will note that there is a gradual decline in the proportion of people appearing in the ROD estimated to be Indigenous after the consolidated Indigenous identifier is allocated randomly. We suspect that this is due to the fact that the consolidated Indigenous identifier is less likely to assign an Indigenous status if there have been fewer years over which people could change their status. In a sense, the data are more affected by right-censoring when individuals only enter the court system for the first time in the years immediately leading up to 2006. The DSE seems to be less affected by this distortion because it is only defined for people who have appeared at least twice (that is, over several years).

Note that DSE estimates are always higher than the logistic regression adjusted estimates. One explanation for this is that the logistic adjustments only use information available in a particular year {that is, they are cross-sectional in nature) and do not use any of the information on changes in Indigenous status over time. Another reason why logistic estimates are lower than the other estimates is that DSE estimates are only defined for those offenders who go to court rather than to all offenders identified by police. That is, if Indigenous offenders are more likely to appear in court than other offenders, then one should expect the DSE estimates to be higher. Accordingly, it can be argued that the logistic adjusted estimates provide a conservative estimate of the Indigenous offenders in the NSW criminal court system.

In summary, the DSE estimates are more stable than the consolidated Indigenous identifier (after allocation) and the logistic adjusted estimates for each year, but there is very little difference in 1997. While the DSE estimates are always higher than the alternative techniques, the gap between estimates increases over time because of biases in the other techniques. Notwithstanding, it is not possible to discount the possibility that DSE over-enumerates the number of Indigenous offenders in the court system and hence a number of adjusted estimates can be justified.

Reflections on Knowing Something about the Unknowns

This paper argues that it is important to understand the processes that determine who is identified as, or rather chooses to identify as, Indigenous. If nothing else the size of Indigenous involvement in the criminal justice system may be severely underestimated if no attempt is made to establish or estimate the true identity of the large number of people with unknown ATSI status within the criminal justice system. It is unlikely that any auditing process of administrative data will entirely remove uncertainty because individuals are likely to have an incentive to not reveal their Indigenous status given widespread perceptions of discrimination against Indigenous people. In the presence of systematic undercounts, statistical procedures like those used in this paper are likely to be important for estimating the true Indigenous population of offenders.

One of the main implications of this paper is that Indigenous disadvantage may be understated if administrative data are not corrected to account for those who may at some later stage identify as Indigenous or whose Indigenous status is unrecorded or unknown. The secondary message is that data quality issues not only decrease the reliability of resulting estimates for Indigenous and other Australians, but also result in the potential for systematic biases which could affect conclusions about the size of Indigenous disadvantage and the ability of policy makers to 'close the gap'. These observations are particularly important for the administrative data collections reported in the Overcoming Indigenous Disadvantage Framework and for the policies that arise from such statistical reportage (for example, SCRGSP 2007).

Given that the trends in the adjusted estimates of the proportion of offenders who are Indigenous vary depending on which technique is used, governments should be particularly cautious about making policy on the basis of trends in administrative data on Indigenous Australians. Estimates of trends are more unreliable than is generally appreciated, as the assumptions underlying such trends are potentially problematic given the vagaries of the processes of Indigenous identification.

These conclusions are underscored by the significance of geographic factors in the processes that determine Indigenous status. The geographic distributions of ATSI status will be systematically biased with respect to the incidence of unknown status and the incidence of ATSI (given that ATSI status is 'known') as both are significantly correlated with accessibility of the local geographic area. In this case, inferences from the unadjusted ROD data about relative crime rates for Indigenous and non-Indigenous people will not be valid.

The reliability of measures of Indigenous disadvantage is further complicated by the need to estimate local ERPs for the Indigenous and other populations for use in the calculations of rates of offences in the respective populations. The calculation of accurate ERPs for the Indigenous population is itself hotly debated in academic and administrative areas (Taylor 1997), but the failure to use ERPs for the local Indigenous population will result in distorted pictures of Indigenous involvement in the criminal justice system. Given that the Indigenous ERPs usually experience higher undercounts than is evident for general ERPs, Indigenous rates are highly likely to be overstated by more than they are for the rest of the population (ABS 2007b). While some might argue that one should not worry too much at an aggregate level as Indigenous disadvantage is such a manifest problem, such distortions may disproportionately affect certain regions, and hence administrative data need to be as accurate and reliable as possible. However, a more accurate estimate of Indigenous offender populations could be achieved by alternative methods, including the allocation of unknowns or by using a DSE methodology. Such estimates should be combined with local ERPs estimates for Indigenous and other Australians to ensure that policy to address relative offence rates is only based on valid empirical evidence.

When information on Indigenous undercount in court data and ERPs are taken into account, the overrepresentation of Indigenous offenders in the NSW criminal justice system more than doubles. The Indigenous rates of offence increase from 119 to 243 offenders in every 1,000 Indigenous residents (measured by the relevant ERPs for NSW). The non-Indigenous rates of offence do not change substantially as the Indigenous population is still an ethnic minority. In summary, the over-representation of Indigenous people in the criminal justice system increases from 5.1 (according to the methodology historically used) to 11.5 (according to the DSE methodology)--that is, Indigenous people are almost 11.5 times more likely to be an offender than non-Indigenous people in the NSW court data. More importantly, the magnitude of Indigenous disadvantage in justice outcomes is clearly larger than has historically been appreciated.

There are other methods for estimating offender populations which could be considered (Collins & Wilson 1990). However, such methods are only valid for estimating the unobserved Indigenous and non-Indigenous rates by estimating the Indigenous and other Australians who do not appear in the court or are identified in police records (using count data models). While such methods are invaluable for estimating consistent offender rates in the relevant populations, and obviates the need to generate consistent and comparable ERPs, the above analysis is justified solely as an exercise in validating the quality of the ROD Indigenous identifier, given that a person has been observed in the administrative data source (for example, the court system).

Much humorous comment has been made about Donald Rumsfeld's observation that it is intrinsically difficult to 'know the unknown' (Hunter & Ayyar 2009) --but the cost of not attempting to understand the consequences of the category 'unknown' is likely to be particularly high given the potential to misallocate resources when attempting to design effective Indigenous policy.


ABS (Australian Bureau of Statistics) (2007a) Non-response Rates, AUST 2006 Usual Residence and Place of Enumeration, Cat. No. 2914.0.55.001.

ABS (Australian Bureau of Statistics) (2007b) Population Distribution, Aboriginal and Torres Strait Islander Australians 2006, Cat. No. 4705.0.

Agresti, A. (1984) Analysis of Ordinal Categorical Data, New York, Wiley.

Altman, J.C., Biddle, N. & Hunter, B.H. (2008) 'Prospects for "closing the gap" in socioeconomic outcomes for Indigenous Australians?', Australian Economic History Review, 49 (3) 225-51.

Brown, P., Callister, P., Carter, K. & Engler, R. (2010) 'Ethnic mobility: Is it important for research and policy analysis?', Policy Quarterly, 6 (3), 45-51.

Carter, K.N., Hayward, M., Blakely, T. & Shaw, C. (2009) 'How much and for whom does self-ethnicity change over time in New Zealand? Results from a longitudinal study', Social Policy Journal of New Zealand, 36, 32-45.

Collins, M.E & Wilson, R.M. (1990) 'Automobile theft: Estimating the size of the criminal population', Journal of Quantitative Criminology, 6 (4), 395-409.

Fienberg, S.E. (1992) 'Bibliography on capture-recapture modeling with application to Census undercount adjustment', Survey Methodology, 18 (1), 143-54.

Greene, W.H. (2000) Econometric Analysis (4th edn ed.), New Jersey, Prentice Hall.

Guimond, E. (1999) Ethnic Mobility and the Demographic Growth of Canada's Aboriginal Populations from 1986 to 1996, Ottawa, Cat. No. 91-209-XPE, Statistics Canada.

Hogan, H. (1993) 'The 1990 post-enumeration survey: Operations and results', Journal of the American Statistical Association, 88, 1047-60.

Hosmer, D. & Lemeshow, S. (2000) Applied Logistic Regression (2nd edn), New York, John Wiley & Sons.

Hunter, B.H. (1996) Indigenous Australians and the Socioeconomic Status of Urban Neighbourhoods, CAEPR Discussion Paper No. 106, Canberra, CAEPR, ANU.

Hunter, B.H. (1998) Assessing the Utility of 1996 Census Data on Indigenous Australians, CAEPR Discussion Paper No. 154, Canberra, CAEPR, ANU.

Hunter, B.H. & Ayyar, A. (2009) Some Reflections on the Quality of Administrative Data for Indigenous Australians: The Importance of Knowing Something about the Unknown(s), Canberra, CAEPR, ANU

Hunter, B.H. & Dungey, M.H. (2006) 'Creating a sense of "CLOSURE": Providing confidence intervals on some recent estimates of indigenous populations', Canadian Studies in Population, 33 (1), 1-23.

Marks, E.S., Seltzer, W. & Krtoki, K.J. (1974) Population Growth Estimates: A Handbook of Vital Statistics Measurement, New York, The Population Council.

Paradies, Y., Huppatz, S., Warnsey, J. & Barnes, T. (2000) 'Population and globalisation: Australia in the 21st century'. Paper delivered at the 10th Biennial Conference of the Australian Population Association, Melbourne.

Ross, K. (1999) Occasional Paper: Population Issues, Indigenous Australians, Cat. No. 4708.0. Canberra, Australian Bureau of Statistics.

Sekar, C., & Deming, E.W. (1949) 'On a method of estimating birth and death rates and extent of registration', Journal of the American Statistical Association, 44 (1), 101-15.

Shyrock, H.S., Siegel, J.S., & Associates (1976) The Methods and Materials of Demography, London, Academic Press.

Snowball, L. & Weatherburn, D. (2006) 'Indigenous over-representation in prison: The role of offender characteristics', Crime and Justice Bulletin, 99 (September), 1-20.

(SCRGSP) (Steering Committee for the Review of Government Service Provision) (2003) Overcoming Indigenous Disadvantage: Key Indicators 2003 Report, Melbourne, Productivity Commission.

SCRGSP (Steering Committee for the Review of Government Service Provision) (2005) Overcoming Indigenous Disadvantage: Key Indicators 2005 Report, Melbourne, Productivity Commission.

SCRGSP (Steering Committee for the Review of Government Service Provision) (2007) Overcoming Indigenous Disadvantage: Key Indicators 2007 Report, Melbourne, Productivity Commission.

Taylor, J. (1997) 'The contemporary demography of Indigenous Australians', Journal of the Australian Population Association, 14 (1), 77-114.

Westbrooke, I. & Jones, L. (2002) 'Imputation of Maori Descent for Electoral Calculations in New Zealand', Australian and New Zealand Journal of Statistics, 44 (3), 257-65.

Zubrick, S.R., Lawrence, D.M., Silburn, S.R., Blair, E., Milroy, H., Wilkes, T. et al. (2004) The Western Australian Aboriginal Child Health Survey: The Health of Aboriginal Children & Young People, Perth, Telethon Institute for Child Health Research.


(1.) The Australian PES is an interviewer-based survey conducted three weeks after Census night which allows comparison of the responses in the Census and the PES to identify whether they have changed, A matched sample of those who responded to both the Census and the PES is used. It is also possible that the PES may pick up some uncounted population from the Census, both samples are drawn from the population as a whole. Information is collected to determine whether persons have been missed or double counted in the Census and whether dwellings were missed. The PES collects personal information on indigenous origin, age, sex, marital status and birthplace. Note that there are several differences between the Census and PES collections. For example, the Census question on indigenous status is based on self-identification whereas the PES involves an interviewer. In addition there were slight differences in the wording of the question. More importantly, the PES question is asked of the entire household whereas the Census is asked of each person individually.

(2.) Notwithstanding, from July-2003, Police stopped asking ATSI status for alleged offenders in traffic offences (as opposed to serious driving matters). This change in procedure and the resultant increase in unknown ATSI status for a group that comprised a significant proportion of all court data was the reason why BOCSAR implemented the "ever identified ATSI" variable in ROD.

(3.) In contrast to postal areas, LGA boundaries are relatively slow to change over time. LGA-level ROD data are estimated by BOCSAR.

(4.) It could be argued that the analysis of the unknown category in ROD should investigate all the information available on the unknown category that is associated with Indigenous status. For example, the incidence of being recorded as unknown Indigenous status seems highly associated with offence-type and the propensity to offend. The inclusion of such factors may ultimately enhance the specification, but it is also possible that the processes for recording Indigenous status are correlated to the processes for recording offence type (especially driving offences} and the statistical processes that drive to propensity to offend. In Econometrics this would raise the possibility of simultaneous equation bias (Greene 2000: 710). In an attempt to avoid such issues, this "paper uses an extremely parsimonious specification that should be accurate on average. A more sophisticated analysis might be able to discount the possibility of simultaneity bias and hence could confidently use additional information on offence type and propensity to offend.

(5.) See Hosmer and Lemeshow (2000) for details of the interpretation of these ratios.

(6.) Concordance statistics were estimated to provide an indication of the adequacy of the models for prediction in the respective years (available from the authors on request}. It givcs the percent of all possible pairs of cases in which the model assigns a higher probability to a correct case than to an incorrect case. Hosmer and Lemeshow (2000: 162) provide guidelines for Concordance statistic, which indicates that any statistic over the value of 0.7 is evidence that the model is adequate. All reported results are based on adequate models according to this criteria.

(7.) While this should provide a reliable and robust estimate, it is rather difficult to estimate the standard error for the overall estimates. This does not matter excessively since this exercise is designed to illustrate the potential importance of the issue. However, given the large number of unknowns and the associated low standard errors for the estimates, it is anticipated that this estimator provides a highly accurate estimate of the number of unknowns re-classified as ATSI.

(8.) The variance of N can be estimated using the standard binomial approach (see Sekar & Deming 1949).

(9.) In the context of this DSE, the process of predicting whether unknowns could be assigned to ATSI or non-ATSI is more complex than the cross sectional DSE estimates reported above. Assigning unknowns in the DSE estimates requires a more sophisticated technique and hence is left for another paper.
Table 1: A two-outcome example of DSE methodology

              Response A

              Yes            No             Total
Response B
Yes           [x.sub.11]     [x.sub.12]     [x.sub.11] + [x.sub.12]
No            [x.sub.21]     [x.sub.22]     [x.sub.21] + [x.sub.22]
Total         [x.sub.11] +   [x.sub.12] +   [x.sub.11] + [x.sub.12] +
               [x.sub.21]     [x.sub.22]     [x.sub.21] + [x.sub.22]
COPYRIGHT 2011 Australian Council of Social Service
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2011 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Hunter, Boyd; Ayyar, Aarthi
Publication:Australian Journal of Social Issues
Article Type:Report
Geographic Code:8AUST
Date:Jul 9, 2011
Previous Article:Prisoners of love? Job satisfaction in care work.
Next Article:A comparison of the lifetime economic prospects of women informal carers and non-carers, Australia, 2007.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters