Printer Friendly

An Analysis of Inadvertent Data Disclosure Incidents, 2005-2017.


End-user error remains a very serious security problem. Inadvertent data disclosures account for a large percentage of end-user error. This article reports on a detailed analysis of the unintended data disclosures reported in the Privacy Rights Clearinghouse's comprehensive database of data breaches since 2005. This database contains descriptions of more than 5400 data breaches that have occurred in the United States in that time span. This article identifies the most common modalities of breach by organization type and provides guidance on particulars of how breaches occur. These data are important because they elucidate the nature and scope of the problem across organization type and they provide details that can yield highly targeted training programs to ameliorate unintended data disclosures.


unintended disclosure, root cause, end user error, training, data breach reporting, organization type


End-user error remains a significant problem area in cybersecurity. End-user error stems from social engineering attacks in which people are duped into taking dangerous actions, but it also arises from simple mistakes with no real precipitating root cause other than lack of attention or being rushed to complete a task. Unintended disclosure of personal data carries important potential consequences including monetization or weaponization of the data.

End-user training programs are often one-size-fits-all for the large majority of employees of an organization. The Payment Card Industry (PCI) has created a document citing best practices for end user security training programs [1]. That report specifies that it is not enough to have standard, basic cybersecurity training programs, but rather that tailored training programs are necessary in all organizations that hold sensitive personal data.

Best practices cited by PCI include basic security awareness training for all employees, specialized training for those who routinely handle sensitive information, and highly tailored training for certain groups such as system administrators and security managers. The PCI best practices report does not provide specifics regarding content for such training programs.

The purpose of the current work is to gain greater insight into the nature of end user errors as mediating factors in data breaches. This article will focus on unintended disclosure of sensitive information as reported in the Privacy Rights Clearinghouse database of data breaches [2], a repository describing more than 5400 data breaches that have occurred between 2005 and the present. This paper presents data on breaches by organization type and the various forms that the breaches took. The paper will also provide some insight on breach reporting requirements in the United States.


IBM's 2016 Cyber Security Intelligence Index [3] contained many interesting statistics on sources of cyber security attacks. The report cited a 5% increase in attacks coming from inside organizations (from 55% to 60%). Of the 55% in 2014, 31.5% were deliberate and 23.5% involved human error. In 2016, the percentage of error-mediated attacks decreased, but still accounted for 15.5% of all attacks. IBM's report showed an overall increase in insider-mediated attacks with a slight shift toward malicious attacks from those caused by end user error. Still, since the absolute number of attacks overall increased significantly, the absolute number of end user error mediated attacks increased as well.

Howarth [4] describes a range of human errors involving people inside organizations who do dangerous things either accidentally or deliberately. Major categories of human error include the inadvertent exposure of sensitive data, creating conditions that allow the introduction of malware into mission-critical systems, and creating conditions that allow theft of intellectual property or sensitive information.

Howarth concludes that organizations implementing strong technological security procedures still often pay insufficient attention to human sources of vulnerability, including errors made by end users. He strongly advocates for enhanced security training to decrease human error. Armerding [5] cites a report that indicates that as of 2014, 56% of workers who use the Internet on their jobs received no security training at all. While malicious insiders remain a significant threat to cyber security, it is clear that enormous problems arise from people with no malicious intent performing dangerous behaviors or being tricked into compromising sensitive information. Of the various sources of end-user error, unintended disclosures, while common, are potentially among the most preventable.

Furman, Theofanos, Choong and Stanton [6] found that end users tended to be aware of and concerned with cybersecurity but they lacked comprehensive understanding of existing threats and of how to protect themselves. When asked to define terms pertaining to security threats such as key logger, spoofing, virus, botnet, etc., participants in Furman's study professed familiarity with the terms but often failed to define the terms correctly. Obviously, without knowing precisely what these threats mean, the participants lacked necessary knowledge of how to counter them. While potentially very interesting, the study did not go into participant knowledge of prevention strategies at all.

Ruiz et al [7] state that inadvertent data disclosures can sometimes occur when end-users have some expectation of privacy, as in browsing in a private browsing mode. They enumerate four different potential gatherers of information in the browsing chain that could compromise privacy and cause unintended disclosures. One might consider this situation to be another example of human error, although this time the error lies in trusting software rather than in outright dangerous behaviors or negligence.


The Privacy Rights Clearinghouse (PRC) has been in existence since 1992. Its mission has been to collect data that are relevant to privacy concerns in the United States. This section contains a description of the scope of the PRC database of data breaches. In the following sections, the organization of the data and options for retrieval are described. Additionally, a discussion of the quality of the data is presented along with a discussion of the highly variable state-by-state reporting requirements that contribute to wide variations in the quality of entries in databases such as this one.

3.1 Organization of the Data

The data in the database enumerates breaches that occurred from 2005 to present. The database documents 5400+ data breach incidents in that time-period, and conservatively places the total number of records breached at 910,600,000. This estimate is highly conservative due to the large percentage of cases in which the number of records breached is unknown. The website reports on the data breaches made public in states and territories within the United States that have data breach notice laws. No international statistics are documented in this collection.

The database site allows querying by checking types of organization, breach type of interest, and year or years of interest. The taxonomy of organization types used by the PRC is as follows:

* BSF--Financial and Insurance Businesses (except Medical Insurance)

* BSR--Retail Businesses including online retail

* BSO--Other Businesses

* EDU--Educational Institutions

* GOV--Government including Military

* MED--Healthcare, Providers and Medical Insurance Companies

* NGO--Non-governmental Organizations

Breach types include payment card fraud, hacking or malware attacks, malicious insider attacks, physical loss of data records, loss of portable devices, losses on stationary devices, unintended disclosure of electronic or physical data, and "other."

The major categories of breach type and search words used to query are these:

* CARD--Payment Card Fraud

* HACK--Hacking or malware

* INSD--Malicious insider

* PHYS--Physical loss including paper documents

* PORT--Loss of portable devices including laptop, smartphone, jump drive, CD, etc.

* STAT--Losses on stationary devices, computers or servers

* DISC--Unintended disclosure not involving hacking or intentional breach (analyzed here)

* Unknown--no indication of how the breach occurred.

3.2 Quality of Data in the Database

The database contains fields for the date on which the breach was made public, the organization involved, category of breach, type of organization (both from the lists above), total records breached, grand total records that might have been breached, a narrative description of each breach, sometimes including updates, and the source of the information. Data comes from a variety of reporting entities including,, Dataloss DB, the California Attorney General's office, general media, and many others.

Not surprisingly, the data is of highly variable quality. In many cases, the number of records breached is estimated; often the number is unknown. For example, of the 984 cases analyzed in the unintentional disclosure set, 279 cases (28%) involved an unknown number of records. A significant number of cases involving medical records reveal an unknown number of records and no particulars regarding how the breached occurred.

In some cases, the total number of records breached is reported but without details regarding the amount of unique sensitive information such as number of unique SSNs. Sometimes the narratives provide ranges such as "between 5,600 and 23,000 patients were affected." In some cases, a total is given as "estimated that more than half exposed social security numbers." Some of the narratives are vague regarding whether physical or electronic data was compromised.

3.3 Data Breach Reporting Requirements

The National Council of State Legislatures publishes state reporting guidelines for data breaches in the United States [8]. Currently 48 states, the District of Columbia, Guan, Puerto Rico and the Virgin Islands have reporting requirements. Alabama and South Dakota do not. The laws typically specify a taxonomy of organizations and reporting requirements by organization type, specific definitions of what constitutes personal information, what type of event meets criteria as a breach, the type of notice that must be given, and exemptions from reporting, if any.

A typical reporting policy is long, technical, and contains significant legalese. It is possible to go directly to self-contained data breach laws of some states (for instance, in Alaska and Arizona), but it is necessary to go into several different sections of the state codes to see all relevant law in others (for instance, California, Georgia, and Florida).

A consequence of the widely varying requirements for breach reporting is that estimates of the number of records that have been exposed vary widely, with an unknown number being cited in many reports. Additionally, some reports contain detailed narratives regarding what happened, and others provide only sketchy accounts. It seems obvious that data breach reporting should be mandatory, standardized, and detailed. Holm and Mackenzie [9] provide a good discussion of costs and benefits of data breach reporting and of the criteria used to determine when a report must be made. They state that compromised personal data, whether obtained through unintended disclosure or other means, is damaging because of its subsequent use in identity crime. They conclude by stating that mandatory breach notification laws can be expected to have a favorable impact on identity crime.


The current study was conducted in the context of end-user error. The focus is on unintended disclosures of sensitive information. The goal is to provide highly informative statistical summarizations and detailed accounts of how they occurred, in order to inform highly targeted training programs for end-users. The following sections contain descriptions of methods and results of the study.

4.1 Methods

The dataset was extracted from the Privacy Rights Clearinghouse database by issuing the following query:

* Breach Type: DISC

* Organization Type: BSF, BSO, BSR, EDU, GOV, MED, NGO

* Year(s) of Breach: 2005 - 2017

* Company or Organization: all Results of the query were:

* Breaches made public fitting these criteria: 984

* Records total: 215,917,142

After issuing the query, the query result dataset is downloadable as a .csv file. The study started with an inductive method of determining categories of root causes of the disclosures. Each of the 984 case narratives was evaluated and the cause or the means through which the disclosure occurred was noted. After this initial analysis, the various individual causes were aggregated into 11 categories of root causes. The categories (with cases-in-category/total) are:

* An application error such as a bug in an app or an unprotected publically accessible database (60/984)

* A disclosure through email, typically via an attachment (132/984)

* Posting to a website or sharing a file on the internet (420/984)

* An error in a mailing through regular mail (126/984)

* A login deficiency such as a legitimate user logging in but having improper access to sensitive data (23/984)

* Physical data disclosed such as improperly discarded printed records or lost physical media (32/984)

* An accidental improper sale or transfer of data (64/984)

* An error faxing data (5/984)

* An error in producing printed material (9/984)

* Unknown error (103/984)

* Other, a catchall for any not fitting the other categories (7/984)

A single-character code was established for each of the identified root cause categories (e.g.: e = email, r = regular mail, etc.), and an additional column was added to the file containing the case descriptions to indicate the cause category for each case. The cases were evaluated again and the cause category coded for each one. Once coded in this fashion, the number cases that fell into each category could be determined.

After the coding, the file was sorted by organization type and the data in the column containing the number of records that were disclosed was extracted for each organization type and placed in its own column in the spreadsheet. This action facilitated counting the number of cases for which the number of records disclosed was known. Results of these analyses are presented in the next section.

4.2 Results

Table 1 presents data on the percentage of cases by organization type in which the number of records disclosed was known. It can be seen in Table 1 that financial services and retail businesses had the highest percentage of cases for which the number of disclosed records was unknown at 57%. The highest number in non-business organizations was governmental organizations at 25%. These data indicate that businesses were significantly more likely not to know the number of records breached than nonbusiness concerns.

Table 2 contains aggregated data on the most common root causes of unintended disclosure. The top three causes (website, email and regular mail) enumerated in Table 2 account for 69% of all disclosures and the top six causes account for almost 92%. As would be expected, there are significant differences in frequency among the 11 root cause categories, and such data have clear implications for highly targeted training programs to prevent such disclosures.

Another interesting finding regarding the most common root cause of a disclosure is that 89 of the 103 unclassifiable (unknown) root cause cases were in the MED category - healthcare organizations. This meant healthcare organizations, which had the largest absolute number of unintended disclosures of all organization categories, also had by far the most cases for which the root cause of the disclosure was unknown.

Without healthcare's contribution to this category, only 13 of the 984 cases would have had unknown causes. Table 3 provides a detailed breakdown by organization type of the top three causes of disclosures. Email errors were consistent across organization types, but inadvertent release by website or file sharing varied substantially. A Chi Square test for goodness of fit was performed on these data with n = 420, df = 6, p < 0.01. Educational organizations released sensitive information this way more frequently than other organization types did. Regular mail disclosures were higher in the financial services category than in the other categories. A Chi Square test for goodness of fit was performed on these data with n = 126, df = 6, p < 0.05.


These data provide clear guidance for security managers who are responsible for the establishment or maintenance of cybersecurity awareness training programs. Disclosures by uploading documents to websites or file sharing sites was by far the most common root cause of unintended disclosures. The second and third most common causes were making errors in sending emails with sensitive data and errors in regular mailings.

Website/file sharing disclosures usually entailed variations on placing sensitive data on a site that was accessible by the public, placing sensitive data correctly on a secured site but including other data that should not have been shared with users having access to that site, and having a private server reconfigured and made public.

Email mistakes took the form of simply sending a document to the wrong recipient, sending a groupmail to a combination of legitimate and unauthorized recipients, and sending legitimate sensitive information to a legitimate recipient, but having additional information in the document that should not have been disclosed.

Regular mail mistakes took many forms, but a surprising number of them involved having sensitive information such as social security numbers printed on the mailing labels on the outside of the letter. Additionally erroneous mailings included sending data on multiple people in a mailing that should have only had information on a single individual (such as school grades), and mailing sensitive information to the wrong address. Large groupmail mailings in both email and regular mail appear to be particularly error-prone.

It is troubling that the three categories of business concerns have a significantly larger percentage of cases in which the number of records that was disclosed is not known. Results of the current study would suggest that businesses particularly need better forensics following data breaches.

The healthcare field has some recurring problems that suggest a strong need for remediation. The healthcare field had the largest absolute number of unintended disclosure cases in this study (270) in the period from 2005 to present. The IBM 2016 Cyber Security Intelligence Index [3] indicates that more than 100 million healthcare records were breached in 2015, more than any other organizational category. IBM's result independently corroborates findings from this study. It is also troubling that the healthcare field has such poor reporting on root causes, with 33% of their cases in that time frame not citing a root cause. This statistic again suggests the need for improved forensics. Concerns with healthcare security are significant in the United States because of the privacy laws that have been enacted in HIPPA [10].

Kapis and Kambo [11] describe some of the special issues with privacy in the healthcare field. It is likely that disclosures through websites are not as significant a problem for healthcare organizations, but with increasing computerization of healthcare data in patient portals, that could change. Unintended disclosures through regular mail are likely a significant problem. Kapis and Kambo suggest ways to improve data security in the healthcare field.


Unintended disclosure of sensitive information is a serious security problem. Findings in this work provide guidance for highly targeted training to ameliorate the most common causes of unintended disclosures. This article cites the top root causes of unintended disclosures, particulars regarding the most common forms that those errors take, and problematic levels of failure in data breach forensics in some categories of organizations. This work also describes the variability in data breach reporting requirements in the United States. These inconsistencies and deficiencies in reporting requirements have negative impacts on the quality of data in studies of this type and on the prevention of further cybercrime.


[1] The PCI Security Standards Council. Best Practices for Implementing a Security Awareness Program. Online. Available:

[2] Privacy Rights Clearinghouse. Data Breaches Database. Online. Available:

[3] IBM Security Services 2016 Cyber Security Intelligence Index. Online. Available:

[4] F. Howarth. The Role of Human Error in Successful Security Attacks. Online. Available:

[5] T. Armerdeing. Security training is lacking: Here are tips on how to do it better. Online, Available:

[6] Furnam, S., Thelfanos, M.F., Choong, Y-Y., and Stanton, B. Basing Cybersecurity Training on User Perceptions. IEEE Security and Privacy 10 (2). pp 40 - 49. (2011). DOI: 10.1109/MSP.2011.180

[7] Ruiz, R., Amatte, F. P.,Brandini, K. J., Winter, R. Overconfidence: Personal Behaviors Regarding Privacy that Allows the Leakage of Information in Private Browsing Mode. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 4 (3). pp 404-416. (2015). ISSN: 2305-0012.

[8] National Council of State Legislatures. Security Breach Notification Laws. Online. Available:

[9] Holm, E., and Mackenzie, G. The Significance of Mandatory Data Breach Warnings to Identity Crime. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 3 (3). pp 141-152. (2014). ISSN: 2305-0012.

[10] Health and Human Services, HIPAA. Health Information Privacy. Online. Available:

[11] Kapis, K., and Kambo, E. Enhancing the Security of Electronic Medical Records Using Forward Secure Secret Key Encryption Scheme. International Journal of Cyber-Security and Digital Forensics (IJCSDF) 5 (3). pp 132-141. (2016). ISSN: 2305-0012.

John W. Coffey

Department of Computer Science The University of West Florida Pensacola FL 32514 USA
Table 1. Percentage of data breaches by organization type for which the
number of compromised records was known.

Org   Unknown  Known  Total  Percent
Type  Cause    Cause         unknown
BSF   64        48    112    57%
BSO   48        52    100    48%
BSR   28        21     49    57%
EDU   35       191    227    15%
GOV   54       160    214    25%
MED   44       226    270    16%
NGO    4         7     11    36%

Table 2. The top root cause categories for unintended data disclosures.

Root Cause     Number    Percentage
Category       of Cases  of Cases
Website/File   420       43%
Email Error    132       13%
Regular Mail   126       13%
Unknown cause  103       10%
Improper Sale   64        7%
or Transfer
Error in an     60        6%
Totals         905       92%

Table 3. Absolute numbers and percentages of unintended disclosures due
to email, Web posting/file sharing, and regular mail.

Org   Email  Email  Web/Sharing  WebSh Pct  Mail  Mail Pct  Combined
Type         Pct

BSF   19     17%     41          37%        26    23%       77%
BSO   15     15%     40          40%         8     8%       63%
BSR    6     12%     15          31%         5    10%       53%
EDU   36     16%    136          60%        22    10%       85%
GOV   27     13%    104          49%        25    12%       73%
MED   28     10%     79          29%        37    14%       53%
NGO    1      9%      5          45%         3    27%       82%
COPYRIGHT 2017 The Society of Digital Information and Wireless Communications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Coffey, John W.
Publication:International Journal of Cyber-Security and Digital Forensics
Article Type:Report
Date:Apr 1, 2017
Previous Article:Onto-Engineering: A Conceptual framework for Integrating Requirement Engineering Process with scientifically tuned Digital Forensics Ontologies.
Next Article:A Novel Framework for Secure E-Commerce Transactions.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |