Printer Friendly

Using Machine Learning to Aid the Interpretation of Urine Steroid Profiles.

In its simplest form, the clinical interpretation of laboratory test results requires the integration of discrete numerical values taken in the context of population-based reference intervals, clinical presentation, and medical knowledge. Testing procedures that produce multiple analyte measurements within a single test represent an additional level of complexity. An example of this is urine steroid profiling (USP) [2], a test used for the diagnosis and monitoring of disorders of steroidogenesis and adrenal pathologies. The interpretation of USPs requires the integration of multiple discrete steroid measurements in the context of specific diagnostic questions. This requires a high degree of specialist clinical and technical knowledge to produce a subjective diagnostic judgment. Such tests are, therefore, largely inaccessible to many laboratories and are hindered by interpretative subjectivity and the resulting variability between practitioners and laboratories (1). As analytical technologies advance, multivariate ("profiling") diagnostics will become more commonplace in clinical laboratories (2). As such, clinical laboratories should be motivated to adopt new and better ways of handling, and adding value to, the multivariate data they produce.

Clinical decision support (CDS) systems are useful tools for aiding test result interpretation and reducing interpretative subjectivity and inconsistency (3). Although most CDS systems rely on rule-based approaches built on previously established knowledge, CDS systems can also be constructed with learning algorithms. Many of these learning algorithms have become increasingly accessible, rendering their application to clinical data more tractable. Indeed, several such applications have been reported in the field of clinical chemistry (4-10), including a recent study demonstrating the utility of machine learning (ML) for the separation of prostate cancer, benign prostate hypertrophy, and controls (11). A learning algorithm-based decision support system has not yet been applied, however, to the profiling data routinely produced by clinical chemistry laboratories, although this has been previously proposed in the context of USPs (12).

ML focuses on the data-driven recognition of patterns within large data sets. As such, learning algorithms allow the "machine" to learn and recognize patterns without the need for the explicit programming of previously established rules and mathematical relationships by an expert operator. A plethora of supervised ML algorithms, each with their unique strengths and weaknesses, have been developed and applied in a variety of contexts (13-15). Each algorithm, however, operates on the same fundamental principles: (a) the construction of a model from a representative "training" data set and (b) using the derived model to make predictions on new unseen data of the same structure.

The aim of this study was to detail the training and assessment of several ML algorithms within a routine clinical laboratory environment and, using USPs as a model data set, demonstrate the algorithms' utility for the interpretation of complex biochemical data. Taken together, we present a proof-of-concept study for the application and utility of ML-based CDS systems to support routine interpretation of biochemical profiling data within the clinical laboratory.

Materials and Methods


This study was based on a data set of USPs compiled during routine clinical service provision between June 2012 and October 2016. Each profile consisted of up to 45 different features, including steroid metabolites quantified by GC-MS and demographic data. Extraction of anonymized data from the laboratory information management system and its processing and classifier derivation were retrospectively performed as described below. This was done in 2 stages: (a) as a binary classifier ("No significant abnormality" vs "?Abnormal") and (b) as a disease-specific classifier. The utility of the derived interpretation support tool was then further validated with external quality assurance (EQA) samples.


All USP data processed during routine clinical practice between June 2012 and October 2016 were extracted from the laboratory information management system by means of an integrated query function (4619 profiles). These data were imported into and processed within the R statistical computing environment (version 3.4.1). The data were preprocessed and cleaned as follows: (a) missing values, representing an undetected steroid, and values reported as "Undetected," "Not detected," or "None" were replaced with zeros; (b) values reported as "Present," "Detected," or "Trace," were replaced with an arbitrarily low value of 10 [micro]g/24 h; (c) values reported as >x or <x were replaced with the value of x; (d) accurate ages were retrospectively calculated from the sample date and the patient's date of birth; and (e) random urine samples and those with a collection time recorded as <24 h were excluded from the analysis (2814 profiles).


Interpretative comments for each urine steroid profile were provided by trained Health and Care Professions Council--registered clinical scientists during routine clinical practice. These interpretations were assumed to represent the "ground truth" in this study, i.e., the "reference interpretation" to which the ML algorithms' classifications could be compared and their performances assessed. Each interpretive comment was manually condensed and assigned to a discrete interpretative category for the purposes of training the classifiers (including "No significant abnormality," "?Adrenal suppression," "?Adrenal tumor," "?Cushing's," "?21-OH congenital adrenal hyperplasia (CAH)," and "?5[alpha]-reductase inhibition"). These interpretive categories were a high-level representation of each disease, in which ?Adrenal tumor included all forms of adrenocortical tumor such as adenomas or carcinomas, and no distinction in Cushing's subtype was made in the ?Cushing's class (i.e., primary vs secondary hypercortisolism). As with the other categories, ?Adrenal suppression included all causes of adrenal insufficiency that were not attributed to a known profile pattern, such as a disorder of steroidogenesis. All interpretative categories were based solely on the expert interpreter's comment. After initial data review, profiles that represented interpretative classes with too few observations for robust analysis (e.g., 11 [beta]-hydroxylase CAH, after mitotane treatment, and after synacthen administration); those that represented potential undercollections or overcollections (i.e., not bona fide 24-h collections); those where quantification was unable to be reported owing to analytical interference; and those where no interpretative comment was given were excluded (491 profiles). The final data set consisted of 6 USP interpretive classes, representing the most commonly encountered profile interpretations (see Table 1 in the Data Supplement that accompanies the online version of this article at http://, representing 1314 profiles. The data were first analyzed in an unsupervised manner using principal component analysis within the R statistical computing environment.


The random forest (RF), weighted-subspace random forest (WSRF), and extreme gradient boosted tree (XGBT) algorithms were trained using the caret package within the R statistical computing environment (version 3.4.1) (16-19). Although several ML algorithms are available, tree-based methods were selected for this study because of their ease of implementation, robustness to overfitting, and excellent performance on well-structured data such as those presented here (20-25). The predictive performances of the algorithms were assessed by means of nested, stratified, k-fold repeated cross-validation (herein referred to as nested CV) (Fig. 1). The inner loop of the nested CV (k = 4, repeated 5 times) procedure was used to tune the relevant hyperparameters of the models using the caret default tuning settings. The hyperparameters that provided the highest area under the ROC curve (for the binary classifiers) or mean balanced accuracy (for the multiclass classifiers) (26, 27) on the relevant validation set were selected in each case. The outer loop of the nested CV procedure (k = 4, repeated 5 times) was then used to assess the generalization performance of the tuned models on the testing data (Fig. 1). Point estimates of these values presented herein represent the mean [+ or -] 95% CIs of each repeated fold in the outer CV loop (n = 20). Generic downsampling was performed within each inner CV fold/repeat using the downSample function within the caret package. In brief, this randomly subsamples all the interpretative classes to match the frequency of the least abundant class (herein referred to as [down.sub.1]). Customized random downsampling, whereby the No significant abnormality cases were randomly downsampled to match the frequency of the most abundant abnormal class (?Adrenal suppression) while maintaining the relative frequencies of the other abnormal classes, was also performed within each inner CV fold/ repeat (herein referred to as [down.sub.2]). Class probabilities for the binary classifiers were calibrated by determining the threshold with the maximal Youden's index (sensitivity + specificity - 1) within each outer CV fold/repeat. Confusion matrices for the nested CV runs were determined by taking the sum of the confusion matrices for each fold within a repeat and taking the mean of these across each repeat.


The importance of different features in the data set was assessed with the Boruta algorithm, which generates synthetic "shadow" features for each true feature in the data set, as previously described (28). The Boruta algorithm was run on the data twice: first using the binary class labels and second, using the multiclass labels. In each case, the maximum iteration parameter was set to 250, and all other parameters were set to their default values. The features used in the data set are summarized in Table 3 of the online Data Supplement.


To demonstrate the utility of the developed classifiers in practice, we first trained the best performing ML algorithms and subsampling strategies for the binary (WSRF without subsampling; Table 1) and multiclass models (RF with [down.sub.2] sampling; Table 2) on the whole data set to produce a "finalized" model. We then applied the resulting classifiers to EQA samples processed during routine laboratory service. Importantly, these profiles were processed after retrospective collection of the data used for training. We then compared their predictions with the consensus interpretation of multiple laboratories as defined by the EQA scheme. The scheme is a joint enterprise between and University College London Hospitals, consisting of 26 participating laboratories around the world. Twelve urine samples from patients with known disorders affecting the urine steroid profile are analyzed across the year and scored for quantitative results and associated interpretative comments. The assessors are a panel of clinicians and scientists with expertise in the field, and a consensus score is given to each participant. The correct answer in this scheme is based on clinical diagnosis.



To gain an understanding of the underlying structure of the data set, we first analyzed the age and sex distributions of the patients within the data set (see Fig. 1 in the online Data Supplement). It was evident that the age distribution was skewed toward younger patients (median, 10.0 years; interquartile range, 7.5-20.2) (see Fig. 1A in the online Data Supplement). In addition, the data set was heavily biased toward samples from female patients (see Fig. 1B in the online Data Supplement). To assess the separation of profiles in an unsupervised manner, we performed principal component analysis (see Fig. 1, C-F, in the online Data Supplement). This demonstrated that the profiles separated to some extent in principal component space according to their original biochemical interpretation, with the separation being greatest for the ?Adrenal tumor, ?Cushing's, and ?21-OH CAH profiles (these representing the most aberrant profiles). The feature distributions between each of the interpretative classes (see Fig. 2 in the online Data Supplement) showed the patterns of urinary steroid excretion that would be expected for each class: raised 11-oxo-pregnanetriol, 17-hydroxypregnanolone, and pregnanetriol in the ?21-OH CAH profiles; raised total androgens, dehydroepiandrosterone, tetrahydro-11-deoxycortisol, pregnenediol, and pregnenetriol in the ?Adrenal tumor profiles; raised total cortisol metabolites and reduced tetrahydrocortisol ratio (5a-THF/THF) in the ?Cushing's profiles; reduced total androgens and total cortisol metabolites in the ?Adrenal suppression profiles; and reduced androsterone/etiocholanolone and 5a-THF/THF ratios in the ?5a-reductase inhibition profiles.


The abilities of the different tree-based models to separate No significant abnormality from ?Abnormal USP interpretations are shown in Table 1. ROC curves for each fold and repeat from the nested CV procedure (Fig. 1) for the best performing algorithm (WSRF without subsampling) are shown in Fig. 2A, and an averaged confusion matrix of this algorithm's performance across the outer CV loop is shown in Fig. 2B. Individual confusion matrices for each repeat of the CV procedure are shown in Fig. 3 in the online Data Supplement. The results of these comparisons demonstrated that the 3 ML algorithms were largely comparable, each producing classifiers with high predictive performance (Table 1). In addition, this analysis demonstrated that subsampling had little impact on the predictive performance of these algorithms in this setting, in some cases even reducing their predictive power (Table 1). To gain insight into which of the features were most useful for the separation of the No significant abnormality and ?Abnormal classes, we assessed feature importance using the Boruta algorithm. This demonstrated that 33 of 45 of the features were deemed important for the separation of the classes (see Fig. 4 in the online Data Supplement) and that the most important feature was the ratio between androgen and cortisol metabolites, followed closely by 11-oxopregnanetriol. This analysis also showed that the less prevalent 6-hydroxylated steroids were not as important to the separation of the classes (see Fig. 4, gray boxes, in the online Data Supplement).


To determine whether it was possible to differentiate between each of the abnormal disease classes, the ability of the same 3 ML algorithms used for the binary analysis to achieve this were compared. The results of these comparisons are shown in Table 2. These demonstrated that, as with the binary classifiers, the performances of the 3 ML algorithms were largely comparable. The subsampling method did, however, have a substantial impact on model performance, with the [down.sub.2] method producing the highest mean balanced accuracy (Table 2). An averaged confusion matrix for the best performing multiclass classifier (RF with [down.sub.2] subsampling) is shown in Fig. 3. Individual confusion matrices for each repeat of the CV procedure are shown in Fig. 5 of the online Data Supplement, and these demonstrated the overall high performance of the algorithm. An expected degree of overlap between ?Adrenal tumor and ?Cushing's cases was observed, however, and several No significant abnormality cases were misclassified as representing an abnormal class. In addition, several abnormal cases were classified as No significant abnormality--this was most prominent for the ?Adrenal suppression class. We also performed an analysis of feature importance for the separation of the individual interpretative classes, and this showed similar results to those of the binary classifier (see Fig. 6 in the online Data Supplement). Taken together, these data suggested that the multiclass classifier was less sensitive and specific than its binary counterpart but was still able to provide useful predictions for the interpretation of USPs.


Having demonstrated the power of both the binary and multiclass classifiers for the prediction of the biochemical interpretation of USPs, we next sought to demonstrate the utility of the developed classifiers in practice. To achieve this, we applied both classifiers to EQA samples processed within our laboratory. The predicted class probabilities from both classifiers and the EQA scheme's consensus interpretation are shown in Fig. 4. These analyses demonstrated that the tools could accurately predict the interpretation of the EQA USPs and produced an intuitive graphical representation of class membership. Consistent with the EQA consensus interpretation, the classifiers could correctly predict the binary interpretation of all the profiles (11 of 11 correctly identified as No significant abnormality or ?Abnormal) and the multiclass interpretation for most of the profiles when compared with the consensus EQA comment (7 of 11 correctly classified).

Discussion and Conclusions

ML algorithms have proved successful in a wide variety of contexts for the recognition of patterns within large, complex data sets (29). As a result, they are well suited to the task of aiding the interpretation of complex biochemical profiling data produced by clinical laboratories. The prevalence of these kinds of data is likely to increase in parallel with advances in assay and information technologies. Despite this, their application to laboratory medicine practice has (thus far) been limited, particularly in clinical biochemistry (30). Encouragingly, however, reports of the application of ML to clinical biochemistry and hormone analysis are increasing and have demonstrated substantial utility, with a recent study showing the ability of ML and multivariate statistics to stratify patients with prostate cancer from cases of benign prostate hypertrophy and controls using data from LC-MS/MS-based serum steroid analyses (11). Perhaps the most striking demonstration of the application of ML to steroid analysis, however, is the advent of the athlete biological passport (10), which continues to demonstrate utility and novel applications, such as the categorization of testosterone-treated transgender men in the context of antidoping detection tests (31). To this end, an ML-based decision support system for an assay used in our laboratory was developed and assessed for routine application within a clinical laboratory setting. This was aimed at providing a proof of principle for the application of ML within the clinical laboratory.

The diagnostic potential of USPs and other profiling tests arises from their multivariate nature, with multiple features representing unique diagnostic patterns. The interpretation of these patterns, however, requires a substantial amount of resources with respect to both time and training, and is limited by the investigator's capacity to integrate numerous and complex information. Thus, the appeal of an intelligent decision support tool is unsurprising. Indeed, the concept is not without precedence, having already been applied to areas such as inherited metabolic disease screening (32) and for the differentiation of adrenocortical adenomas from carcinomas (9). The application of an ML classifier as a routine interpretive support tool, however, has not previously been demonstrated in "profile" biochemistry, although the concept has early precedence (12).

The demographics and underlying data structure in this study were heavily skewed toward younger patients (see Fig. 1A in the online Data Supplement) and those of female sex (see Fig. 1B in the online Data Supplement).

This perhaps reflects the age at which the disorders of steroidogenesis initially present clinically, particularly with regard to CAH (33-35), in which female infants present with ambiguous genitalia, whereas males may not have overt signs of the disease at early stages (if not salt-wasting). Interestingly and unexpectedly, the assessment of feature importance for the binary classifiers did not deem the patient's sex to be an important feature in predicting the abnormality of a USP (see Fig. 4 in the online Data Supplement), which appears contrary to contemporary understanding. Our initial analyses demonstrated that the underlying differences between the most aberrant profiles separated well in principal component space (see Fig. 1, C-F, in the online Data Supplement). This finding is in agreement with previously published data, with the expected features of steroid excretion being present within each interpretative class (see Fig. 2 in the online Data Supplement) (9,36-38). The overlap observed between each of the classes, however, emphasized the inability of unsupervised methods, such as principal component analysis, to adequately separate the disease classes.

We demonstrated the ability of the WSRF and traditional RF algorithms to accurately predict the biochemical interpretation of USPs, despite the relatively small size of the data set and substantial class imbalance (Figs. 2 and 3; Tables 1 and 2). Imbalances in the data used to train ML algorithms are known to reduce their predictive performance (39). Several methods exist by which this can be alleviated, with one of the simplest being downsampling. This involves random subsampling (without replacement) of the majority class (being No significant abnormality in this case; see Table 2 in the online Data Supplement) to match the frequency of the minority class. Importantly, this is performed within each fold of the inner CV loop (Fig. 1). It seemed that downsampling had little impact on the performance of the binary classifiers (Table 1) yet substantially improved the performance of the multiclass classifiers (Table 2). This is likely to be because the imbalance between No significant abnormality and ?Abnormal cases was only approximately 3:1 (see Table 2 in the online Data Supplement), where in the multiclass context this was, at its worst, 36:1 (see Table 1 in the online Data Supplement). Overall, the binary classifiers had higher predictive performance than their multiclass counterparts, suggesting that they could be the first-line classifier to use in practice, followed by the less accurate, but nevertheless informative, prediction of the individual interpretative class using the second classifier. Indeed, these models could be built into a "stack," which may provide improved performance (40), although this would require additional validation on a larger data set.

Although the classifiers performed well for the interpretation of USPs, in both the binary and multiclass contexts, some incorrect classifications were evident. The multiclass classifier appeared to classify some ?Adrenal suppression profiles as ?No significant abnormality (Fig. 3). This may reflect some inherent subjectivity of the initial interpretation and illustrates the inconsistencies that may arise from a subjective "ground truth," for which no clear cutoff exists for normal and adrenal suppression in the context of USPs. Indeed, it must be highlighted that an assessment of between-interpreter agreement could not be assessed in this study. As such, the performance of the ML algorithm should be considered carefully, given that between-expert variation is likely to be high and that the binary ML algorithm may perform better than the average interpreter yet worse than the "best" interpreter. Further studies to assess expert agreement when including an ML algorithm within interpretive work flows should be conducted, with a view to demonstrate some improved harmonization across expert centers and improving consensus. The objectivity that ML algorithms could bring to these situations may prove useful in the future.

In some instances, the classifier misclassified ?Adrenal tumor as ?Cushings (Fig. 3). This may be expected to some extent, given that adrenally driven Cushing disease may form part of the adrenal tumor class, with increased adrenal androgen and cortisol metabolites. Admittedly, some disease classes were identified as normal in a few cases, particularly in the case of some tumors, Cushing disease, and 21-hydroxylase deficiency profiles. On further inspection of these cases, some could be justified by representing posttreatment samples or partial deficiencies (e.g., a 21-hydroxlase CAH profile showing the presence of some cortisol metabolites and low levels of CAH-related metabolites), and although the initial interpretative comment had mention of these being specific disease profiles, they did not necessarily represent samples taken at diagnosis. This again highlights the importance of validating the training data set and refining mechanisms for supervised classification. Hard outcomes (i.e., genetics or histology) are the desired ground truths, and future studies should endeavor to include these wherever possible. Training the algorithms on larger data sets (with a greater variety of cases) against definitive patient outcomes would likely improve on these aspects, and clinical laboratories are in an excellent position to collect these data. Nonetheless, the classifier's ability to accurately predict the interpretation of the major disease classes was clear.

It must be emphasized that this study was an evaluation of the system to predict the interpretation of profiles by experienced practitioners and not a model of diagnostic accuracy itself (i.e., not using definitive clinical outcome data). In other words, this was a model of the "interpreter's brain" and aimed to demonstrate the utility of ML algorithms in building CDS systems from readily available laboratory data to help streamline and improve our clinical service. To make claims of the ability of ML algorithms to predict clinical outcomes would require further work using gold standard diagnostic outcome data (e.g., histological, radiological, or genetic analyses) as the class labels.

It must be appreciated that ML-derived classifiers and their performances are heavily dependent on the data on which they are trained and the data to which they are applied. As such, they will perform poorly on patterns that are not included in the training data set (e.g., the EQA pregnancy profile in Fig. 4) and are sensitive to the biases of the assay used to generate the data. It must be ensured, therefore, that the data to which these models are applied are of the same type and structure as the training data, and classifiers are developed using the same data to which they will be applied. It is also important to note that the selection of the performance metric by which the algorithms are assessed should be carefully considered, as this has a large impact on the balance of false-positive and--negative findings provided by the final decision tool.

The use of ML-based classifiers in routine clinical practice appears particularly feasible, as it distills the entire profile into numerical representations. Diagnostic clarification by practitioners may be required only in cases in which the classifier returns homogeneous class probabilities (i.e., those that are of less certain classification), whereas profiles with a high class probability may not require human intervention. This could liberate a substantial amount of time for the practitioner to focus on the interpretation of the profiles more likely to represent disease or that are unusual.

Taken together, our analysis demonstrates the applicability of ML tools to the kinds of data produced by clinical laboratories and provides a proof of concept for their use in the interpretation of complex biochemical profiling data. Here, we have presented the use of such tools for the interpretation of USP data, an admittedly specialist assay; however, we believe many other profiling assays commonly used in clinical practice would be amenable to such an approach.

Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 4 requirements: (a) significant contributions to the conception and design, acquisition ofdata, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved.

E.H. Wilkes, statistical analysis.

Authors' Disclosures or Potential Conflicts of Interest: No authors declared any potential conflicts of interest.

Role of Sponsor: No sponsor was declared.

Acknowledgments: The authors thank both past and present members of our laboratory for their contributions of processing, analysis, and biochemical interpretation of the steroid profiles presented here. The authors also thank the patients whose anonymous samples constitute the data set analyzed and presented here.

Received May 21, 2018; accepted July 23, 2018.

Previously published online at DOI: 10.1373/clinchem.2018.292201


(1.) Phillips I, Conway E, Hodkinson R. External quality assessment of urinary steroid profile analysis. Ann Clin Biochem 2004;41:474-8.

(2.) Bennett A, Garcia E, Schulze M, Bailey M, Doyle K, Finn W, et al. Building a laboratory workforce to meet the future: ASCP task force on the laboratory professionals workforce. Am J Clin Pathol 2014;141:154-67.

(3.) Bright T, Wong A, Dhurjati R, Bristow E, Bastian L, Coeytaux RR, et al. Effect of clinical decision-support systems: a systematic review. Ann Intern Med 2012; 157:29-43.

(4.) Matheny ME, Ohno-Machado L. Generation of knowledge for clinical decision support. Statistical and machine learning techniques. In: Greenes RA, editor. Clinical decision support: the road to broad adoption. 2nd Ed. London (UK): Elsevier; 2014. p. 309-37.

(5.) Baron JM, Mermel CH, Lewandrowski KB, DigheAS. Detection of preanalytic laboratory testing errors using a statistically guided protocol. Am J Clin Pathol 2012; 138:406-13.

(6.) Baron JM, Cheng XS, Bazari H, Bhan I, Lofgren C, Jaromin RT, et al. Enhanced creatinine and estimated glomerular filtration rate reporting to facilitate detection of acute kidney injury. Am J Clin Pathol 2015;143: 42-9.

(7.) Luo Y, Szolovits P, Dighe AS, Baron JM. Using machine learning to predict laboratory test results. Am J Clin Pathol 2016;145:778-88.

(8.) Altinier S, Sarti L, Varagnolo M, Zaninotto M, Maggini M, Plebani M. An expert system for the classification of serum protein electrophoresis patterns. Clin Chem Lab Med 2008;46:1458-63.

(9.) Arlt W, Biehl M, Taylor AE, Hahner S, Libe R, Hughes BA, etal. Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors. J Clin Endocrinol Metab 2011;96:3375-84.

(10.) Van Renterghem P, Sottas P-E, Saugy M, Van Eenoo P. Statistical discrimination of steroid profiles in doping control with support vector machines. Anal Chim Acta 2013;768:41-8.

(11.) Albini A, Bruno A, Bassani B, D'Ambrosio G, Pelosi G, Consonni P, et al. Serum steroid ratio profiles in prostate cancer: a new diagnostic tool toward personalized medicine approach. Front Endocrinol 2018;9:110.

(12.) Dybowski R, Taylor NF. Towards a steroid-profiling expert system. Chemom Intell Lab Syst 1988;5:65-72.

(13.) Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet 2015; 16:321-32.

(14.) Cao L, TayFEH. Financial forecasting using support-vector machines. Neural Comput Appl 2001;10:184-92.

(15.) Hamilton EF, Dyachenko A, Ciampi A, Maurel K, Warrick PA, Garite TJ. Estimating risk of severe neonatal morbidity in preterm births under 32 weeks of gestation. [Epub ahead of print] J Matern Fetal Neonatal Med July 10, 2018 as doi: 10.1080/14767058.2018.1487395.

(16.) Liaw A, Wiener M. Classification and regression by random Forest. R News 2002;18 -22.

(17.) Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008;28(i05).

(18.) Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA; 2006. p. 785-94.

(19.) Zhao H, Williams G, Huang J. WSRF: an R package for classification with scalable weighted subspace random forests. J Stat Softw 2017;77:1-30.

(20.) Breiman L. Random forests. Mach Learn 2001;45: 5-35.

(21.) Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. New York, NY; 2008. p. 96-103.

(22.) Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, et al. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003;19:1636-43.

(23.) Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008;9:319.

(24.) Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 2005;48:869-85.

(25.) Maroco J, Silva D, Rodrigues A, Guerreiro M, Santana I, de Mendonca A. Data mining methods in the prediction of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res Notes 2011;4:299.

(26.) Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: Proceedings of the 20th International Conference on Pattern Recognition. Istanbul, Turkey; 2010. p. 3121-4.

(27.) Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, et al. A balanced accuracy function for epistasis modeling in imbalanced data sets using multifactor dimensionality reduction. Genet Epidemiol 2007;31:306-15.

(28.) Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw 2010;36:1-13.

(29.) Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intel 2000;22:4-37.

(30.) Richardson A, Signor BM, Lidbury BA, Badrick T. Clinical chemistry in higher dimensions: machine-learning and enhanced prediction from routine clinical chemistry data. Clin Biochem 2016;49:1213-20.

(31.) Savkovic S, Lim S, JayadevV, Conway A, Turner L, Curtis D, et al. Urine and serum sex steroid profile in testosterone-treated transgender and hypogonadal and healthy control men. J Clin Endocrinol Metab 2018;103:2277-83.

(32.) Baumgartner C, Bohm C, Baumgartner D, Marini G, Weinberger K, Olgemoller B, et al. Supervised machine learning techniques for the classification of metabolic disorders in newborns. Bioinformatics 2004;20: 2985-96.

(33.) Merke DP, Bornstein SR. Congenital adrenal hyperplasia. Lancet 2005;365:2125-36.

(34.) Speiser PW, White PC. Congenital adrenal hyperplasia. N Engl J Med 2003;349:776-88.

(35.) Miller WL,Auchus RJ. The molecular biology, biochemistry, and physiology of human steroid ogenesis and its disorders. Endocr Rev 2011;32:81-151.

(36.) Shackleton CHL, Taylor NF, Honour JW. An atlas of gas chromatographic profiles of neutral urinary steroids in health and disease. Delft (the Netherlands: Packard Becker; 1980.

(37.) Phillipou G. Investigation of urinary steroid profiles asa diagnostic method in Cushing's syndrome. Clin Endocrinol (Oxf)1982;16:433-9.

(38.) Christakoudi S, Cowan DA, Taylor NF. A new marker for early diagnosis of 21-hydroxylase deficiency: 3[beta], 16a, 17 [alpha]-trihydroxy-5[alpha]-pregnane-7,20-dione. J Steroid Biochem Mol Biol 2010;121:574-81.

(39.) Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intel Data Anal 2002;6:429-49.

(40.) Dzeroski S, Zenko B. Is combining classifiers with stacking better than selecting the best one? Machine Learning 2004;54:255-73.

Edmund H. Wilkes, [1] Gill Rumsby, [1] and Gary M. Woodward [1] *

[1] Department of Clinical Biochemistry, University College London Hospitals, London, UK.

[2] Nonstandard abbreviations: USP, urine steroid profiling; CDS, clinical decision support; ML, machine learning; EQA, external quality assurance; CAH, congenital adrenal hyperplasia; RF, random forest; WSRF, weighted-subspace random forest; XGBT, extreme gradient boosted tree; CV, cross-validation; THF, tetrahydrocortisol.

* Address correspondence to this author at: Department of Clinical Biochemistry, University College London Hospitals, 60 Whitfield Street, London, W1T4EU UK. Fax +4407891500610; e-mail

Caption: Fig. 1. Nested cross-validation procedure for the tuning of hyperparameters and assessment of model performance. The total data set is first randomly partitioned into [k.sub.outer] outer folds (where [k.sub.outer] = 4). Three of the outer folds are then combined and further randomly subdivided into [k.sub.inner] inner folds (where [k.sub.inner] = 4) to form the training set. Models are built using this training set with a given set of hyperparameters and their performances assessed on the inner fold held out from training (the validation set, light gray). This is then repeated for each inner fold and then further repeated a total of 5 times (with different partitions of folds) to achieve stable estimates of training set performance. A classifier is then trained on the total inner data set with the best performing set of hyperparameters (as determined by the inner CV loop) and used to predict the class of the samples in the outer fold held out from the inner CV process (the testing set, dark gray). This is repeated for each outer fold and then further repeated (with different fold partitions) a total of 5 times.

Caption: Fig. 2. ML algorithms separate No significant abnormality and ?Abnormal profiles with high predictive power. (A), ROC curves for the best performing binary classifier. Each curve represents a specific fold within each repeat from the outer CV loop (see Fig. 1). The red line represents the overall performance across all folds and repeats. AUC, area under the curve. (B), A confusion matrix representing the results of the best performing classifier (WSRF without subsampling). This was obtained by taking the sum of the confusion matrices of each fold (n = 4) and averaging these across repeated CV runs (n = 5). Inset numbers represent the number of individual cases. Colors represent the percentage of expected cases for each interpretative class.

Caption: Fig. 3. ML algorithms differentiate individual abnormal classes with high predictive power. A confusion matrix representing the results of the best performing classifier (RF with down2 subsampling). This was obtained by taking the sum of the confusion matrices of each fold (n = 4) and averaging these across repeated CVruns(n = 5). Inset numbers represent the number of individual cases. Colors represent the percentage of expected cases for each interpretative class.

Caption: Fig. 4. The developed tools find utility in routine clinical practice as decision support systems. EQA samples from October 2016 to September 2017 routinely processed in the laboratory were processed through the "finalized" binary and multiclass classifiers and their interpretative classes predicted. The predicted class in each case is shown inset within each box. The consensus comment from the EQA scheme is shown for each sample. The red line represents the decision threshold for the binary classifier, with probabilities for the ?Abnormal class above this threshold being classified as ?Abnormal. ACC, adrenocortical carcinom; ACA, adrenocortical adenoma.
Table 1. ML algorithm performance for binary
classification. (a)

Model   Subsampling    Mean AUROC (b)

RF      None           0.952 (0.946-0.958)
        [Down.sub.1]   0.949(0.943-0.955)
WSRF    None           0.955 (0.949-0.961)#
        [Down.sub.1]   0.950 (0.944-0.956)
XGBT    None           0.945(0.939-0.951)
        [Down.sub.1]   0.940 (0.934-0.946)

(a)(#) The algorithm and subsampling method with the highest
performance is high-lighted in bold. Values represent the
mean across the 4 outer CV folds, repeated 5
times (n = 20). 95% CIs are shown in parentheses.

(b) Area under the ROC curve.

Note: The algorithm and subsampling method with the highest
performance is high-lighted are indicated with #.

Table 2. ML algorithm performance for multiclass
classification (a)

Model   Subsampling    Mean balanced accuracy

RF      None           0.835 (0.826-0.845)
        [Down.sub.1]   0.858 (0.848-0.869)
        [Down.sub.2]   0.873 (0.865-0.880)#
WSRF    None           0.840 (0.83.-0.848)
        [Down.sub.1]   0.850 (0.840-0.86.)
        [Down.sub.2]   0.866(0.860-0.873)
XGBT    None           0.838 (0.828-0.848)
        [Down.sub.1]   0.852 (0.842-0.86.)
        [Down.sub.2]   0.872 (0.864-0.880)

(a) The algorithm and subsampling method with the highest
performance is high-lighted in bold. Values represent the mean
across the 4 outer CV folds, repeated 5 times (n = 20). 95%
CIs are shown in parentheses.

Note: The algorithm and subsampling method with the highest
performance is high-lighted are indicated with #.
COPYRIGHT 2018 American Association for Clinical Chemistry, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2018 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Informatics and Statistics
Author:Wilkes, Edmund H.; Rumsby, Gill; Woodward, Gary M.
Publication:Clinical Chemistry
Date:Nov 1, 2018
Previous Article:High-Sensitivity Cardiac Troponin-Based Strategies for the Assessment of Chest Pain Patients--A Review of Validation and Clinical Implementation...
Next Article:Direct Comparison of Cardiac Troponin T and I Using a Uniform and a Sex-Specific Approach in the Detection of Functionally Relevant Coronary Artery...

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters