Printer Friendly

Using genetic programming to transform from Australian to USDA/FAO soil particle-size classification system.


Two major soil textural classifications are used in the world, the International and the USDA/FAO systems. The difference between these two systems is the limit between the silt and sand particle sizes: 20 [micro]m for the International and 50 [micro]m for the USDA/FAO. This could be considered a problem when a pedotransfer function (PTF) generated in one system is used with data of the other system; thus, a conversion between both systems is necessary. Several attempts to achieve this have been made (Marshall 1947; Shirazi et al. 1988; Buchan 1989; Rousseva 1997). Minasny et al. (1999) predicted the fraction P20-50 to convert from the 2-20 to 2-50 [micro]m fraction with the model:

[[??].sub.2-50](%) = 48.4593 - 0.2225 [P.sub.20-2000] - 0.0029[([P.sub.20-2000]).sup.2] - 0.6952 [P.sub.<2] + 0.0018[([P.sub.<2]).sup.2] ([R.sup.2] = 0.76) (1)

In order to achieve better prediction, Minasny and McBratney (2001) used a larger dataset than that used for model 1 (Eqn 1), and generated a model (model 2) using a multiple linear regression:

[[??].sub.2-50](%) = -18.3914 + 2.0971 [P.sub.2-20] + 0.6726 [P.sub.20-2000] - 0.0142[([P.sub.2-20]).sup.2] - 0.0049 [([P.sub.20-2000]).sup.2] ([R.sup.2] = 0.823) (2)

If [[??].sub.2-50] <0 then [[??].sub.2-50] = 0.8289 [P.sub.2-20] + 0.0198 P20-2000

This model was reported to produce unreasonable estimates at high clay and low sand contents. It is also a two-part model that produces an unnatural 'break'. The aim of this work is to improve model 2 with a new tool based on genetic programming (GP).


Three datasets were used in this work. How they were used and their size is shown in Table 1. The USDA/NRCS dataset corresponds to the National Soil Characterisation database. The samples had data on soil texture measurements at <2, 2-20, 20-50, 50-100, 100-250, 250-500, 500-1000, and 1000-2000 [micro]m fractions. The Australian dataset contains data from soil profile observations collected by CSIRO from various soil projects in Australia that had measurements of <2, 2-20, 2-50, 20-200, and 200-2000 [micro]m. The IGBP-DIS dataset contains global data of soil properties that can be used for the development of PTFs with particle measurement at <2, 2-20, 20-50, 50-100, 100-250, 250-500, 500-1000, and 1000-2000 [micro]m.

The USDA/NRCS and IGBP-DIS datasets were standardised to <2, 2-20, 2-50, 20-200, and 200-2000 [micro]m. Particles <200 [micro]m were estimated from a log-linear interpolation between <100 and <250 [micro]m.

All of the outliers and abnormal observations were removed. Statistics of particle fractions are presented in Table 2.

Genetic programming

Genetic programming is a machine-learning method for evolving computer programs, following the concepts of natural selection and genetics, to solve problems. It is generally used to infer the underlying structure of a natural or experimental process in order to model it numerically. Applications of GP to soil science are varied. They range from determining soil characteristics (Makkeasorn et al. 2006; Parasuraman et al. 2007a), to water and nutrient management in agriculture (Ines et al. 2006; Sharma and Jana 2009), to development of PTFs (Johari et al. 2006; Parasuraman et al. 2007b). In a recent work, Selle and Muttil (2011) test the structure of a hydrological model using GP and give a good description of how the GP process works.

Genetic programming works with several solution sets, known collectively as a 'population', rather than a single solution at any one time; thus, the possibility of getting trapped in a 'local optimum' is avoided. GP differs from the traditional genetic algorithms in that it typically operates on 'parse trees' instead of bit strings. A parse tree is built up from a 'terminal set' (the input variables in the problem and randomly generated constants, i.e. empirical model coefficients) and a 'function set' (the basic operators used to form the GP model). The function set is user-defined and include algebraic operators, such as {+, -, *, %}, also take the form of logical rules ({IF, OR, AND}) or more complex operators ({sin, cos, exp}). An example of an initial population of parse trees can be found in Fig. 1.

Once the initial population of random parse trees is generated, GP calculates their fitness using the user-defined 'fitness function', e.g. absolute error, and subsequently selects the better parse trees for reproduction and variation to form a new population. This process of selection, reproduction, and variation iterates until a user-defined 'stopping criterion' is satisfied. The solutions in each iteration are collectively known as a 'generation'. As the population evolves from one generation to another, new solutions replace the older ones and are supposed to perform better. The solutions in a population associated with the best-fit individuals will, on average, be reproduced more often than the less-fit solutions. This is known as the Darwinian principle of 'natural selection'.

During each successive generation, a proportion of the existing population is 'selected' to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions are typically more likely to be selected. The next step is to generate a second generation population of solutions from those selected, through the two variation operators--crossover and mutation. Crossover is the random swapping of sub-trees between the selected 'parent' parse trees to generate the new 'children'. The crossover tends to enable the evolutionary process to move towards promising regions of the solution space. In contrast to crossover, in mutation, a single parent parse tree is selected and random changes are made to it. The mutation operator is introduced to prevent premature convergence to local optima. A high crossover rate is usually used so that useful sub-trees from the previous generations are transmitted to the new generation. In contrast, the mutation rate is usually kept low since a high mutation rate can cause a big loss of useful sub-trees evolved in previous generations. This process of selection, reproduction, and variation continues until a new population of solutions of appropriate size is generated. From generation to generation, the best solution evolved in previous generations is usually preserved, a process called 'elitism'.

In this work we used a specific method called symbolic regression, which uses GP to fit a function to a specific dataset, going from simple functions like those in Fig. 1 to a complex function like the solution proposed (see Eqn 4).

For further reading about GP, see Koza et al. (1999) and Koza (1994, 1992).

Particle-size conversion

In a routine soil survey in Australia, particle-size could be measured at clay, silt, and sand fractions (<2, 2-20, 20-2000 [micro]m) or with an extra intermediate fraction of fine sand (20-200 [micro]m). A symbolic regression was attempted (using the program Formulize v0.96b) for both cases, using F={+, -, *, %} as the function set for the GP routine, generating a model:

[P.sub.2-50] = F([P.sub.frac]) + [epsilon]

with [P.sub.frac] as the available particle fractions of the Australian classification system, expressed in percentage, and [epsilon] as the error of prediction. Data were randomly split in two groups (50% for training and 50% for internal validation), and minimising the absolute error as error metric, we obtained an approximate conversion as:

[[??].sub.2-50](%) = 2.26[P.sub.2-20] + [[5.55[P.sub.2-20] + 1.513[([P.sub.2-20]).sup.2]]/ [0.9966 - 1.236[P.sub.2-20] - 1.349[P.sub.20-2000]]] (3)

for survey data without the 20-200 gm fraction, presenting an [R.sup.2] of 0.82 and a root mean squared error (RMSE), which measures the average error of the prediction, of 8.54% (internal validation). A surface of its predictions as a function of clay (<2 [micro]m) and sand (20-2000 [micro]m) is shown in Fig. 2a. For survey data with measured 20-200 [micro]m (fine-sand) fraction, a different solution was generated:

[[??].sub.2-50](%) = 1.561 + 0.9664[P.sub.2-20] + 0.0003932[P.sub.<2][P.sub.2-20][P.sub.2-200] + 0.0003634[P.sub.2-20][([P.sub.2-200]).sup.2] (4)

with an [R.sup.2] of 0.91 and a RMSE of 5.91% (internal validation). A surface of its predictions as a function of clay (<2 [micro]m) and sand (20-2000 [micro]m) is shown in Fig. 2b.

The surface plot of Eqn 3 (Fig. 2a) shows decreasing predictions of the 2-50 [micro]m fraction as the content of clay (<2 [micro]m) or sand (20-2000 [micro]m) increases, with a slightly higher responsiveness to changes in sand content. The model including the 20-200 [micro]m fraction (Eqn 4; Fig. 2b) shows the same trend, also presenting instability at high silt contents. Note the surface in Fig. 2b is for the average 20-200 [micro]m content as a function of clay and sand.

Table 3 presents the RMSE and [R.sup.2] between predicted and measured values in the external validation sets and a comparison with the previous model (Eqn 2). Comparing with the model of Minasny and McBratney (Eqn 2), this work has a better performance when the 20-200 [micro]m fraction data are available. The model presents some limitations (higher absolute error) at low clay and high sand contents as shown in Fig. 3.


The use of a larger dataset in conjunction with genetic programming techniques reduced the RMSE (%) by 14.96% (from 8.69 to 7.39) in the IGBP-DIS dataset and 23.62% (from 10.67 to 8.15) in Australian dataset, compared with the previous model of Minasny and McBratney (2001).

Received 24 May 2012, accepted 3 August 2012, published online 18 September 2012


Buchan G (1989) Applicability of the simple lognormal model to particle-size distribution in soils. Soil Science 147, 155-161. doi:10.1097/00010694-198903000-00001

Ines AV, Honda K, Gupta AD, Droogers P, Clemente RS (2006) Combining remote sensing-simulation modeling and genetic algorithm optimization to explore water management options in irrigated agriculture. Agricultural Water Management 83, 221-232. doi:10.1016/j.agwat.2005.12.006

Johari A, Habibagahi G, Ghahramani A (2006) Prediction of soil water characteristic curve using genetic programming. Journal of Geotechnical and Geoenvironmental Engineering 132, 661-665. doi:10.1061/(ASCE) 1090-0241(2006)132 : 5(661)

Koza J (1992) 'Genetic programming: On the programming of computers by means of natural selection.' (The MIT Press: Cambridge, MA)

Koza J (1994) 'Genetic programming II: Automatic discovery of reusable subprograms.' (The MIT Press: Cambridge, MA)

Koza J, Bennett H, Andre D, Keane M (1999) 'Genetic programming III: darwinian invention and problem solving.' (Morgan Kaufmann Publishers: Burlington, MA)

Makkeasorn A, Chang N, Beaman M, Wyatt C, Slater C (2006) Soil moisture estimation in a semiarid watershed using RADARSAT-1 satellite imagery and genetic programming. Water Resources Research 42, W09401. doi:10.1029/2005WR004033

Marshall T (1947) Mechanical composition of soil in relation to field descriptions of texture. Council for Scientific and Industrial Research, Bulletin No. 224. Melbourne, Australia.

Minasny B, McBratney A (2001) The australian soil texture boomerang: a comparison of the australian and USDA/FAO soil particle-size classification systems. Australian Journal of Soil Research 39, 1443-1451. doi:10.1071/SR00065

Minasny B, McBratney A, Bristow K (1999) Comparison of different approaches to the development of pedotransfer functions for water-retention curves. Geoderma 93, 225-253. doi:10.1016/S0016-7061(99)00061-0

Parasuraman K, Elshorbagy A, Carey SK (2007a) Modelling the dynamics of the evapotranspiration process using genetic programming. Hydrological Sciences Journal 52, 563-578. doi:10.1623/hysj.52.3.563

Parasuraman K, Elshorbagy A, Si BC (2007b) Estimating saturated hydraulic conductivity using genetic programming. Soil Science Society of America Journal 71, 1676-1684. doi:10.2136/sssaj2006.0396

Rousseva S (1997) Data transformations between soil texture schemes. European Journal of Soil Science 48, 749-758. doi:10.1046/j.1365-2389.1997.00113.x

Selle B, Muttil N (2011) Testing the structure of a hydrological model using genetic programming. Journal of Hydrology 397, 1-9. doi:10.1016/j.jhydrol.2010.11.009

Sharma DK, Jana R (2009) Fuzzy goal programming based genetic algorithm approach to nutrient management for rice crop planning. International Journal of Production Economies 12l, 224-232. doi:10.1016/j.ijpe.2009.05.009

Shirazi M, Boersma L, Hart J (1988) A unifying quantitative analysis of soil texture: improvement of precision and extension of scale. Soil Science Society of America Journal 52, 181-190. doi:10.2136/sssaj1988.03615995005200010032x

Soil Survey Staff (1995) 'Soil characterization and profile description data.' (Soil Survey Laboratory, Natural Resources Conservation Service, USDA: Lincoln, NE)

Tempel P, Batjes N, van Engelen V (1996) 'IGBP-DIS soil data set for pedotransfer function development.' (International Soil Reference and Information Centre: Wageningen, The Netherlands)

Jose Padarian (A,B), Budiman Minasny (A), and Alex McBratney (A)

(A) Faculty of Agriculture and Environment, The University of Sydney, Biomedical Building, 1 Central Avenue, Australian Technology Park, NSW 2015, Australia.

(B) Corresponding author. Email:

Table 1. Datasets used in this work

Dataset              Reference                  No. of    Use

USDA/NRCS            Soil Survey Staff (1995)    104864   Calibration
Australian (CSIRO)   --                             758   Validation
IGBP-DIS             Tempel et al. (1996)         55282   Validation

Table 2. Statistics of datasets by particle fractions
All statistics in percentage of mass basis

Dataset      Fraction    Mean    s.d.    Min.   Median    Max.

USDA/NRCS       <2       23.14   16.35   0.00   20.60     97.90
               2-20      21.19   12.76   0.00   20.30     93.80
             20-2000     55.64   23.42   0.00   55.70    100.00
CSIRO (A)       <2       31.21   17.34   3.20   27.00     77.70
               2-20      18.01    8.56   0.60   22.00     58.90
             20-2000     50.77   18.87   4.60   53.00     96.20
ICBP-DIS        <2       23.07   16.30   0.00   20.50     95.00
               2-20      20.98   12.60   0.00   20.10     93.80
             20-2000     55.96   22.79   0.30   56.30    100.00

(A) National Soil Database.

Table 3. External validation statistics of prediction quality

Dataset     Model                             [R.sup.2]   RMSE (%)

CSIRO       Minasny and McBratney (Eqn 2)       0.52       10.67
            Without 20-200 [micro]m (Eqn 3)     0.48       11.19
            With 20-200 [micro]m (Eqn 4)        0.72        8.15

IGBP-DIS    Minasny and McBratney (Eqn 2)       0.81        8.69
            Without 20-200 [micro]m (Eqn 3)     0.81        8.66
            With 20-200 [micro]m (Eqn 4)        0.86        7.39
COPYRIGHT 2012 CSIRO Publishing
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Padarian, Jose; Minasny, Budiman; McBratney, Alex
Publication:Soil Research
Article Type:Report
Geographic Code:8AUST
Date:Sep 1, 2012
Previous Article:Short-term effects of organic waste amendments on soil biota: responses of soil food web under eggplant cultivation.
Next Article:Models for estimation of hourly soil temperature at 5 cm depth and for degree-day accumulation from minimum and maximum soil temperature.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters