Probabilistic models for assessing the impact of salinization and chemical pollutants.
Keywords: Bayesian Reasoning, Dynamic Bayesian Networks, Groundwater Quality Assessment, Classical Time Series Models
Water is an essential requirement for irrigated agriculture, domestic uses, including drinking, cooking and sanitation, as critical input in industry. Declining surface and groundwater quality is regarded as the most serious and persistent issue affecting Oman in particular. The Sultanate faces severe challenges as it confronts the extremely growing and complicated issues of contamination of the groundwater supply in and around hazardous waste disposal sites across the nation. In Salalah area of Oman, groundwater has been an important natural resource and the only available water source other than the seasonal rainfall. The population of Salalah completely depends on the limited groundwater resources to meet all of their water requirements from open dug wells.
There are many observable factors contributing to the deterioration of water quality. These factors need to be monitored and their maximum allowable limits need to be determined. Decline in water quality is manifested in a number of ways, for example, elevated nutrient levels, acid from mines, domestic and oil spill, wastes from distilleries and factories, salt water intrusion and temperature. These factors and others will provide the input data for our computer system.
Groundwater quality and pollution are determined and measured by comparing physical, chemical, biological, microbiological, and radiological quantities and parameters to a set of standards and criteria. A criterion is basically a scientific quantity upon which a judgment can be based . In this work, however, we considered only the chemical parameters, total dissolved solids (TDS), electrical conductivity (EC) and water pH, section 4 presents more details. This is mainly because these parameters are recommended by the experts and the researchers in the area. In addition, the results of our analysis of data collected from many wells implied that these chemical parameters are useful indicators of groundwater quality because they constitute the majority of the variance in the data scatter.
Various countries have attempted to develop satisfactory procedures for assessing, monitoring and controlling contamination of the groundwater supply in and around hazardous waste disposal sites . These attempts resulted in various environmental regulations that focus attention on the maximum allowable limits of hazardous pollutants in the groundwater supply. However, they pay scant attention to the nature of groundwater data and the development of valid statistical procedures for detecting and monitoring groundwater contamination.
Recent attempts based on Artificial Intelligence (AI) were first applied to the interpretation of biomonitoring data . Other works were based on pattern recognition using artificial neural networks (NNs). A more recent study described a prototype Bayesian belief network for the diagnosis of acidification in Welsh rivers. Hobbs  uses Bayesian probabilities to examine the risk of climate change on water resources, but does not extend this to drinking water quality or quantity.
Bayesian methods of statistical inference offer the greatest potential for groundwater monitoring. This is because these methods can be used to recognize the variability arising from three different sources of errors, namely, analytical test errors, sampling errors and time errors, in addition to the variability in the true concentration. The Bayesian methods can also be used to significantly increase the precision and the accuracy of the test methods used in a given environmental laboratory [1, 23]. The mobility of salt and other pollutants in steady state and transient environmental conditions can be predicted by applying Bayesian models to a range of spatial and temporal scales under varying environmental conditions. Bayesian networks use statistical techniques that tolerate subjectivity and small data sets. Furthermore, these methods are simple to apply and have sufficient flexibility to allow reaction to scientific complexity free from impediment from purely technical limitations.
The process of Bayesian analysis begins by postulating a model in light of all available knowledge taken from relevant phenomenon. The previous knowledge as represented by the prior distribution of the model parameters is then combined with the new data through Bayes' theorem to yield the current knowledge (represented by the posterior distribution of model parameters). This process of updating information about the unknown model parameters is then repeated in a sequential manner as more and more new information becomes available.
This work addresses the assessment of groundwater quality in the Sultanate of Oman, especially in the Salalah plain. Its primary aim is to develop a groundwater quality model and computer system prototype to assess and predict the impact of pollutants on the water column.
II. Problem Description
Oman, has very substantial groundwater resources on which the country's agriculture depends. The oil boom, the resultant population boom (possibly fivefold since the 1960's) and the new investment have led to a large expansion in irrigated areas. The demand for domestic water supply has also increased, as living standards have risen. Oman has to tackle simultaneously, within a compressed timescale, the need to evaluate its groundwater resources and manage them effectively.
The main populated areas are located in the north, along the flanks of the mountains, and in the south, around Oman's second city, Salalah.
The Salalah plain extends over a 253 [km.sup.2] area to the north of the Omani coastline of the Arabian Sea to the Mountains of Dhofar. It is the only region in Oman to benefit from a substantial amount of rainfall from the southern monsoon Khareef.
The average annual rainfall is about 110 mm but can range from 70 to 360 mm. July-August is normally the "wet" period. Groundwater derived from aquifers in the central part of the plain is of good quality. Some of the spring water is utilized by Falajs (they consist of tunnels dug horizontally to tap and transport underground water to agricultural fields that are often tens of kilometers away) to provide irrigation water to a part of the plain. Recharge is by underflow from mountains and from the springs. Modern irrigation techniques are in operation in large commercial farms mainly for the production of forage crops such as alfalfa and Rhodes grass.
Recent economic development in the country, together with rapid expansion of the population has not only increased the demand for water, but also caused many threats to water resources and quality. A number of groundwater pollution incidents have been reported. The extensive utilization of groundwater resources without taking into consideration the safe yield of aquifers is considered the main cause of pollution. Point and non-point source contamination from agriculture, industrial and domestic uses are other sources of contamination of groundwater. Sea water intrusion is also another problem of concern since lots of farms are situated along the coastal line.
The Ministry of Water Resources (MWR) in the Sultanate of Oman has been monitoring the groundwater quality since 1994. The regional monitoring networks were completed in 1995 . More than 50,000 monitoring wells have been inventoried, in the course of which water samples have been collected and analyzed providing baseline data for environmental monitoring, which is consolidated in the national water quality database.
The MWR has attempted to predict the groundwater quality by using traditional linear regression and non-metric multidimensional scaling models to interpret groundwater data . So far these models have proven unsatisfactory mainly because they ignore the probabilistic temporal dependencies between water quality constituents, prompting the development of new models based on Bayesian techniques, which are the focus of this work.
Therefore, this work shows the development and the applications of Bayesian techniques to forecast groundwater pollution levels in the Salalah plain, in particular in the Taqah area, which is the eastern part of the Salalah plain, see Figures 1 and 2.
[FIGURES 1-2 OMITTED]
III. Data Collection
The Ministry of Water Resources (MWR) maintains data on the concentration of the harmful substances in the groundwater at Taqah monitoring sites, which are located to the south of the Sultanate of Oman, in the Salalah plain (MWR, 2004). We observed that good quality data were obtained from several monitoring wells in this region. Because of the lack of monitoring wells in certain areas in that region, we filled in the missing measurements with data obtained from Oman Mining Company (OMCO) and Ministry of Environmental and Regional Municipalities (MRME).
The MWR identified that the datasets collected from these monitoring wells in the Sultanate are important in assessing the groundwater quality and in the prediction of the effect of certain pollutants on drinking water. The period covered in these locations is from 1984 to 2004 ,[ 7]. Each site has several monitoring wells and water samples were collected periodically from these wells and the concentration of the pollutants in these water samples was recorded. We also collected data for the period 1984-1994 from OMCO and MRME However, the datasets are not complete. We, therefore, filled the gaps with data collected by some researchers at the Sultan Qaboos University.
A. Data Pre-processing Using Bayesian Reasoning
Data for water quality assessment are normally collected from various monitoring wells and then analyzed in environmental laboratories in order to measure the concentration of a number of water quality constituents. We realized that the methods used by these laboratories do not emphasize accuracy. There is a lack of awareness among both laboratory and validation personnel regarding the possibility of false positives in environmental data. In order to overcome this problem and to have representative data, we, therefore, used the following modified Bayesian model to that developed by Banerjee, Plantinga and Ramirez , to preprocessing the datasets used for the development of the Bayesian Networks.
1) Bayesian Models
The formulation of the model is as follows:
Let S denote a particular hazardous constituent of interest. Since the concentration of the substance may vary from well to another, it is necessary to consider each well separately. Let [x.sub.t]= ([x.sub.t1], [x.sub.t2], [x.sub.t3], [x.sub.tm]) be the vector of m measurements of the concentration of S in m distinct water samples from a given well at a given sampling occasion where (m>=1) and (t=1, 2,...). Each measurement consists of the true concentration of S plus an error.
Let [X.sub.t] be the true concentration of S in the groundwater at sampling occasion t. If we assume that the true concentration [X.sub.t] is unknown and is a random variable, the model evaluates the posterior distribution of [X.sub.t] given the sample measurements [x.sub.t] at sampling occasion t.
Using the normality assumption and given [X.sub.t] = [x.sub.t] [[delta].sup.2], the concentration measurements in [x.sub.t] represent a random sample of size m for random distribution with mean [x.sub.t] and variance [[delta].sup.2].
Since the concentration of the substance S in water samples obtained at different sampling occasions might vary considerably, we assume that the parameters [X.sub.t] and [[delta].sup.2] of the normal distribution are random variables with certain prior probability distribution. Therefore, the model for prior distribution of [X.sub.t] and [[delta].sup.2] can be presented as follows:
For t =1, 2, ... and given [[delta].sup.2] the conditional distribution of [X.sub.t] at sampling occasion t is a normal distribution with mean [[mu].sub.t-1] and variance [[delta].sup.2.sub.t-1] [[delta].sup.2]. The marginal distribution of [[delta].sup.2] is an inverted gamma distribution with parameter [[beta].sub.t-1] and [[beta].sub.t-1]. This model uses the following prior distribution, which represents the concentration measurements before the first sampling.
The pdf of the prior distribution of [X.sub.0] is:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.1)
which is the pdf of the student's t-distribution with 2[v.sub.0] degrees of freedom, location parameters [[mu].sub.0] and variance [[delta].sub.0.sup.2]/[v.sub.0].
Now suppose that the observations are available on the concentration of S, given the sample [X.sub.t] the posterior marginal distribution of [X.sub.t] is a student's t-distribution with 2[v.sub.t] degree of freedom, location parameters [[mu].sub.t] and variance [[delta].sub.t] [[beta].sub.t]/[V.sub.t] where the pdf has the form:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.2)
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.3)
It is obvious from the equation of [[mu].sub.t] the sequential nature of this posterior distribution. That is, at each sampling occasion t, when more new information about concentration of S in the groundwater is received, the posterior distribution is revised forming a recursion process. This process of updating the posterior distribution may be continued indefinitely when new data [x.sub.t] becomes available.
To present the true unknown concentration of the substance S in the well under consideration, it is frequently more convenient to put a range (or interval) which contains most of the posterior probability. Such intervals are called highest posterior density (HPD) intervals. Thus for a given probability content of (1-[alpha]), 0< [alpha]<1, a 100(1-[alpha]) percent HPD interval for [X.sub.t], is given by:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.4)
when [t.sub.2vt]([alpha]/2) is the 100(1-[alpha]/2) percentile of the student's t-distribution with 2vt degree of freedom.
2) The Bayesian Algorithm
In brief, the monitoring algorithm, which is based on the Bayesian model, is as follows:
1. Fix a value of [alpha] (0< [alpha] <1) based on the desired confidence level. In this case, we chose [alpha] to be 0.01.
2. Since we do not have enough data to work with, we used the same parameters of the prior distribution used in the model of Banerjee, Plantinga and Ramirez. These parameters are :
[[beta].sub.0]=0.0073, [v.sub.0]0=2.336, [[mu].sub.0]=9.53, [[delta].sub.0.sup.2]=3056.34.
3. At each sampling occasion t, (t= 1,2, ...), compute the parameters [[beta].sub.t], [v.sub.t], [[mu].sub.t] and [[delta].sub.t] of the posterior distribution [X.sub.t] given the set of observations in [x.sub.t] on the concentration of S available from a given well in a given site using (2.3). Compute LHPD and UHPD using these parameter estimates and (2.4).
4. Plot [[mu].sub.t], LHPD, and UHPD that are obtained in step 3 above against sampling occasion t.
5. For the next sampling occasion, update the values of the parameters [[beta].sub.t], [v.sub.t], [[mu].sub.t] and [[delta].sub.t] using (2.3) and the datasets just obtained. Then recomputed LHPD, and UHPD using the updated parameter values in (2.4) and repeat step 4 above.
We have applied this algorithm on the datasets that were collected from Salalah in the Sultanate of Oman. It is expected that the dataset from each well is not normal, but each one is taken from a normal distribution. Some of these datasets needed to be scaled down to simplify the process and to have a smooth graph so that we can study them easily. For this purpose, we have used the following normalization technique:
x = [bar.x] - u/[sigma], where [bar.x] = [n.summation over i] [x.sub.i]/n, and
[sigma] = [square root of ([n.summation over i] [x.sup.2.sub.i] - n[[bar.x].sup.2]/n - 1)]
3) Algorithm Implementation
The pre-processing system is implemented on PC platform using Visual Basic programming language.
Tables 1 presents the concentration data for TDS (Total Dissolved Solids) for Well 001/577 in the Taqah area. In particular, the table shows the true concentration data for TDS produced by our pre-processing system.
Figure 3 representing Table 1, shows whether the three parameters (expected true concentration, LHPD, and UPHD) are within the maximum and minimum level allowed for TDS. This figure provides a rudimentary prediction of the groundwater quality. For example, Figure 3 shows that the well MW1 is contaminated because the true concentration of TDS for this well is above the allowed level and hence the water becomes more corrosive.
[FIGURE 3 OMITTED]
IV. Bayesian Networks
After the pre-processing stage, we constructed and used a Bayesian Network (BN) as an initial building network for the construction of two Dynamic Bayesian Networks in order to predict the impact of pollution on groundwater quality.
A. Bayesian Belief Networks (BBNs)
Bayesian Belief Networks, Bayesian Networks (BN) for short, are effective and practical representations of knowledge for reasoning under uncertainty. There are a number of successful applications of these networks in such domains as diagnosis, prediction, planning, learning, vision, and natural language understanding , , , .
Bayesian Networks (see Figure 4) are graphical structures used for representing expert knowledge, drawing conclusions from input data, and explaining the reasoning process to the user. These networks are also called knowledge maps, probabilistic causal networks, and qualitative probabilistic networks . They have been increasingly popular knowledge representations for reasoning under uncertainty. A BN is a directed acyclic graph (DAC) whose structure corresponds to the dependency relations of the set of variables represented in the network (nodes). Each node in a belief network represents a random variable, or uncertain quality, that can take two or more possible values. The arcs signify the existence of direct influences between the linked variables and the strengths of these influences are quantified by conditional probabilities. These links can be said to have a causal meaning.
[FIGURE 4 OMITTED]
The graph in Figure 4 represents the following joint probability distributions of the variables V, Y, U, W, X and T.
P (U, V, Y, W, X, T) = P (T/W). P (X/W). P (W/V, Y). P (U/V). P (V/Y). P (Y)
This result is obtained by applying the chain rule and using the dependency information represented in the network. P (Y) is called the prior probability; and P (T/W), P (X/W), P (W/V, Y), P (U/V), and P (V/Y) are called the conditional probabilities. While, prior probabilities, probabilities based on initial information, can be obtained from statistical data using the relative frequencies, conditional probabilities can be elicited from experts or calculated using different types of mathematical models.
Within a Bayesian network, the basic computation is to calculate the belief of each node (the node's conditional probability) based on the evidence that has been observed. This consists of instantiating the input variables, and propagating their effect through the network to update the probability of the hypothesis variables. An important purpose of BNs is to facilitate calculation of arbitrary conditional probabilities. Various techniques have been developed for evaluating node beliefs and for performing probabilistic inference. The most popular methods are due to Pearl . Similar techniques have been developed for constraint networks in the Dempster-Shafer formalism .
We observed dependencies within the network dependency model in order to establish weak and strong influences among the variables in the model and to find important variables for water quality. This procedure assists in forming some heuristics that will be cost-effective and useful not only for probabilistic inference but also for automatic construction of a belief network from data.
B. Dynamic Bayesian Networks (DBNs)
The problem of assessing and forecasting water quality requires not only modelling the static probabilistic dependencies between its constituents (variables) but also the dynamic behaviour of these constituents. Dynamic Bayesian Networks (DBNs) can easily capture these static and dynamic behaviours . They extend Bayesian Networks from static domains to dynamic domains . A static Bayesian Network can be extended to a Dynamic Belief Network by introducing relevant temporal dependencies between the representations of the static network at different times. In contrast to the time series models that use regression to represent correlations, DBNs represent the temporal causal relationships between variables. Therefore, DBNs can introduce more general dependency models that capture richer and more realistic models of dynamic dependencies as well as the traditional static-belief network dependencies .
A series of BNs, which act as time slices, can be connected to create a Dynamic Bayesian Network (DBN). As new evidence is added to a DBN, new time slices are added. To reduce computational complexity, old time slices are commonly removed and their information summarized into prior probabilities of following slices. This produces a moving window of slices.
The main characteristic of DBNs is as follows:
Let [X.sub.t] be the state of the system at time t, and assume that
1. The process is Markovian, i.e., P([X.sub.t]/[X.sub.0], [X.sub.1], ..., [X.sub.t-1])= P([X.sub.t/[X.sub.t-1])
2. The process is stationary or time-invariant, i.e., P([X.sub.t/[X.sub.t-1]) is the same for every t.
Therefore, we just need P([X.sub.0]), which is a static Bayesian network (BN), and P([X.sub.t/[X.sub.t-1]), which is a network fragment, where the variables in [X.sub.t-1] have no parents, in order to have a Dynamic Bayesian Network (DBN).
DBNs can be effectively and cheaply used for monitoring and predicting complex situations that change over time such as the assessment of water-quality. For example, they have recently been used for predicting the outcome in critically-ill patients. They have also been used for monitoring and controlling highway traffic (Forbes et al, 1995), for identifying gene regularity from microarray data (Zoe and Conzen, 2005), and for prediction of river and lake water pollution , .
Inference is performed as if the network were a normal BN, although the nature of DBNs usually results in larger and more complex networks, requiring more computation to update. Several researchers have recently developed adaptations of standard belief network representation and inference techniques to support temporal reasoning. Dagum and Galper , for example, introduced the additive generalisations of belief-networks representation and inference techniques. They integrated these techniques with the fundamental methods of Bayesian time series analysis to generate a dynamic network model. The model is applied to predict the progress of a patient in a surgical intensive care unit.
Other techniques developed by Shortliffe and his colleagues  have been applied to the problems of diagnosis in internal medicine, diagnosis of gas turbines in power generation and text retrieval from a large body of writing.
The temporal repetition of identical model structures encourages the integration of object oriented techniques with Bayesian networks. This modeling technique has received increasing interest in the literature over the past decade. It started with methods for reusing elements of network specifications and division of large networks into smaller pieces. These and other successful object-oriented Bayesian networks (OOBNs) models and their applications to real-world problems have greatly encouraged us to develop a model and a computer system based on the OOBN representation to assess and predict the water quality. Therefore, we used the Hugin and dHugin tools for implementing our Bayesian networks [9, 13]. The Hugin system allows the implementation of an OOBN. The system considers a Bayesian Network (BN) as a special case, initial building network, of an OOBN. Other networks in the OOBN are nodes that represent instances of the base network. On the other hand, dHUGIN , implemented on the top of Hugin, is based on message passing in junction trees . Inference over the current time window and time slices preceding it is performed using message-passing between junction trees.
C. Bayesian Networks Development
As is mentioned above, this study covers the Taqah area (see Figure 2), which is the main part of the Salalah plain. This area extends from the foothills of the mountains to the arid desert. The desert here is of two types--the semi-desert (Badiyah) and the arid desert (Al Sahra). Some of the rural areas around Taqah experience a touch of the drizzle that descends on Salalah during the rainy season (Khareef).
Among more than twenty wells in the Taqah area, four wells only were selected to be analyzed. Those four wells have had, to the greatest extent, complete data measurements and provide sufficient information for the assessment of the groundwater quality for this selected basin. Another point worth mentioning here is that all other wells in the Taqah area are close to each other. We, therefore, ignored these wells because they add no additional information.
Identifying the domain variables (pollution constituents) and the causal relationships between these variables constitute the main part of development process. In our study, we only considered the dependencies between total dissolved solids (TDS), electrical conductivity (EC) and water pH. In the Sultanate of Oman, these are the main factors that researchers in the area were dealing with and, therefore, maintained good data about them. In fact, we used our literature-based network structure as a starting point for discussion with the researchers to explain the Bayesian network approach and to get their input. In addition, we analyzed the data collected from many wells and the results revealed that these chemical parameters are useful indicators of groundwater quality because they constitute the majority of the variance in the data scatter.
The electrical conductivity (EC) of the water has been used as a measure for the salinity hazard of the groundwater used for irrigation in the Salalah plain. According to international water-quality standards, irrigation water with EC values up to 1 mS/cm is safe for all crops and between 1 and 3 mS/cm is acceptable, but values higher than 3 mS/cm restrict the use of water for many irrigated crops. Changes in conductivity can be caused by changes in water content of the soil and by soil or groundwater contamination.
The total dissolve solid (TDS) limit is 600 mg/L, which is the objective of the current Plan of the MWR. TDS contains several dissolved solids but 90% of its concentration is made up of six constituents. These are: sodium Na, magnesium Mg, calcium Ca, chloride Cl, bicarbonate HC[O.sub.3] and sulfate SO4. We, therefore, considered only these elements in the calculation of TDS, which is represented as a node without parents in the network structure. This simplification is necessary to make the problem tractable and to keep it consistent with available data without losing information.
Other factors that are also considered less significant to groundwater quality in Oman were not recoded and therefore neglected in this study.
We also used the following relationship between TDS and EC (Wu-Seng, 1993).
TDS = A * EC; where A is a constant with value between 0.75 and 0.77.
Both TDS and EC can affect water acidity or water pH. Solute chemical constituents are variable in high concentration at lower pH (higher acidity). On the other hand, acidity allows migration of hydrogen ions (H+), which is an indication of conductivity. Therefore, our work concentrated on the following relations:
TDS [right arrow] EC, EC [right arrow] pH, TDS [right arrow] pH.
Table 2 shows the limits for a number of constituents of drinking water. Knowing that the maximum allowable TDS in the drinking water is 600 mg/l, the data sample is divided into two intervals (categories), considering TDS=550 is the central point. Thus, the first category has TDS < 550 and the second category has TDS >= 550. For EC, we also divide the data sample into two categories: data with EC < 670 and data with EC >= 670. Regarding pH, we also divided the data sample into two categories, data with pH < 7.5 and data with pH >= 7.5. The data table and the probability tables produced by this analysis for two wells are as follows:
Table 3 shows the monitoring measurements of the main components of TDS for the well Well 001/577. Data for the constituents Mg, SO4, Na, Ca, K, and CL of TDS were only reported. Differences for other parameters were not significant in the Salalah area. Table 4 shows the measurements of EC and pH for the same well along with the measurements of TDS copied from Table 3.
To analyze the relations mentioned above the following probabilities were calculated.
P (TDS < 550) = 0.556 and P (TDS >= 550) = 0.444.
From the relationship between TDS and EC, the conditional probability presented in Table 6 was produced.
Table 5 shows the conditional probability table that shows the conditional probability of pH given TDS and EC. In a similar way, we obtained the conditional probability tables for other wells.
Figure 5 shows the data related to Well 001/580 that was treated by HUGIN . It shows a HUGIN window that has two parts; the conditional probability table (CPT) P (EC/pH) and the network that represents the relationships between the variables TDS, EC and pH.
[FIGURE 5 OMITTED]
After providing the prior probabilities and the conditional probability tables, the results of the run session (probability update) for new-presented data for any selected node of HUGIN are also shown in Figures 6 and 7.
[FIGURES 6-7 OMITTED]
We processed the dataset for other wells in the same way to build a static Bayesian network (BN) representing each well. We tested these BNs with different values of TDS, EC and pH taken from the collected data for the four wells that were selected for the development of this predictive model.
Once the static BN model (static model) for each monitoring well was built, parameterized and tested, we used these models as initial building networks in the construction of two OOBNs for groundwater quality prediction. The first OOBN, as shown in Figure 6, models the time slices for each well characterizing the temporal nature of identical model structures, where the initial building network, see Figure 7, describes a generic time-sliced network.
Four initial identical BN networks (each BN represents a well) interconnected in order to cover the whole area under study characterize the second OOBN network. Figure 8 shows a typical OOBN representing four monitoring wells in the Salalah plain, where Figure 9 shows the initial building network generating this OOBN.
[FIGURE 8 OMITTED]
1) Temporal Networks
As mentioned above, the initial-building network of the system is a one time step representing a year. It is a model of the analysis of data sampled for each well of the four wells that are selected for this study. This one time step network fragment, shown in Figure 9, represents a class in the object oriented paradigm. Objects (entities with identity, state and behavior) are instances of classes that correspond to type declarations in traditional programming languages.
[FIGURE 9 OMITTED]
In this context, a class is a description of an object by structure, behavior and attributes. Whenever an object of a class is needed, an instance of that class is created. The initial building network is, therefore, a class containing the following three sets of nodes:
* A set of input nodes: Input nodes act as placeholders for parents of nodes inside instances of the class. They cannot have parents within the class.
* A set of output nodes: They should be connected to the input nodes of the next time slice network; hence they can be parents of nodes outside instances of the class.
* A set of protected nodes: These nodes can only have parents and children inside the class itself.
The input and output nodes are collectively referred to as interface nodes, see  for more details. The final OOBN is constructed by creating instances networks (objects) from the basic building network, spanning a number of time slices. Figure 9 shows a single time slice class for well MW1 and Figure 8 shows an OOBN representing three time-sliced networks.
2) Basin Monitoring
We model each well of the four selected wells in the Ta qah area with an initial building network (generic models). These models are, however, identical, both qualitative and quantitative (structure and conditional probability tables) so they can be modeled by a single class containing input, output and protected nodes. Each object is, therefore, an instance of this class representing one well. The instances are interconnected in order to cover the whole basin. Figure 8 shows an OOBN representing three monitoring wells in Taqah area.
3) Improving the CPT derivation process
One of the problems with developing BBNs for groundwater quality is the difficulty inherent in the establishment of conditional probability tables (CPTs). Therefore methods need to be sought to make it easier to select and justify values for CPTs. Studies have shown that sensitivity analysis can identify the relative importance of parameters in a BBN for overall BBN performance. Concentrating efforts on gathering accurate data for the most important CPTs enables time and effort reduction. Sensitivity analysis has also been used for validation of BBNs.
In addition to using the pre-processing technique for correcting the laboratory data, we used SamIam tool, see section V, for analyzing the dependencies between variables in our Bayesian networks . SamIam suggests single and multiple parameter changes that satisfy experts query constraint. The tool helps the user to make the smallest possible change to a parameter value that can satisfy the constraint.
V. Application Results
We noticed that the developed Bayesian Networks provide a useful approach to the currently available datasets maintained by the MWR. The results of validation gave experts a realistic assessment of the chances of achieving desired outcomes. The application was carried out with special emphasis on the advantages of Bayesian against traditional techniques. It involved monitoring the groundwater quality parameters and validating crucial assumptions.
We tested the resulting network models in two phases. In the first phase, we examined data resulting from our preprocessing model, which was organized as yearly measurements covering data from the whole basin. The first task was to identify dependencies between the variables of groundwater quality in order to detect useful information on the process dynamics. The resulting network agreed very closely with the intuition of the experts. In the second phase, the aim was to test the constructed OOBNs for predicting the values of the variables in the future. The resulting networks were investigated by using Hugin Bayesian network inference tool for analyzing measurement data from three successive time slices. Using Bayesian networks to predict future values requires discovery of a dependency model that relates together variables from successive time slices or in some other way embeds temporal features into the model. In Bayesian reasoning, the marginal probability distribution of any node may be updated upon acquiring evidence for other nodes.
Since our Bayesian Networks are tractable models, we also implemented the exact inference for the network described in Figure 8 and compared the results with that produced by OOBN. Figures 10 and 11 show the KL-divergence between the true and the approximate distribution . Since the KL-distance converges to zero, this is an indication of the accuracy and reliability of OOBN.
[FIGURES 10-11 OMITTED]
VI. Using Classical Time Series for the Assessment of Groundwater Quality
The purpose of this section is to apply the classical time series analysis to groundwater quality data and to compare the results with that obtained by the application of Dynamic Bayesian Networks (DBN). The continuous and regular monitoring data of electrical conductivity (EC), total dissolved solid (TDS), pH measured by the Ministry of Water Resources (MWR) were also used here for the time series analysis.
Time series analyses of water supply wells with respect to the concentration of chemical constituents are presented in Figures 12-17.
[FIGURES 12-17 OMITTED]
Total dissolved solids (TDS) are a measure of the dissolved minerals in water and also a measure of drinking water quality. There is a secondary drinking water standard of 500 milligrams per liter (mg/L) TDS; water exceeding this level tastes salty. Groundwater with TDS levels greater than 1500 mg/L is considered too saline to be a good source of drinking water. Figure 12 shows the concentration of TDS for the well Well001/577 for a period of twenty one years.
The fluctuation of the concentration of the chloride (Cl), sodium (Na), and calcium (Ca) with respect to time is shown in Figure 14. The values were averaged during the initial analysis as there were no significant differences among the monthly data. Chloride values above 250 mg/l give a slight salty taste to water which is objectionable by many people.
Relationships between TDS, EC and pH is examined using multiple regression analysis, see Figure 15. Multiple regression analysis is used to explain as much variation observed in the response variable as possible, while minimizing unexplained variation from "noise". The results of this analysis is used to produce the moving average chart, Figure 16, and the linear regression chart, Figure 17. We used Excel Business Tools, Microsoft Excel, and Matlab for producing these and other charts.
As it is shown in Figure 15 that the trend is as follows: TrendWQ=19.01*TDS - 5.42*EC -270.16*pH + 205.14
Figure 17 shows the groundwater quality trend over time (linear regression). The trend has the following properties:
Linear model Poly1: f(x) = p1*x + p2
Coefficients(with 95% confidence bounds): p1 = 0.8954 (0.7962, 0.9947) p2 = 1.332 (0.08589, 2.579)
Goodness of fit: SSE: 32.91 R-square: 0.9494 Adjusted R-square: 0.9467 RMSE: 1.316
Although the classical time series mode are used here to assess the presence and strength of temporal patterns of groundwater quality. These models are based on the assumption of stationary (i.e. time invariant). They have been widely used in many domains such as financial data and weather forecasting. Yet these models do not readily adapt to domains with dynamically changing model characteristics, as is the case with groundwater quality assessment. In addition to the above mentioned assumption, the classical models are restricted in their ability to represent the general probabilistic dependency among the domain variables and they fail to incorporate prior knowledge.
The observed groundwater quality data are irregularly spaced and not predetermined as in the case with ordinary time series. This may cause the traditional time series techniques to be ineffective(Prediction: what is the predicted value for one period ahead). It is evident that the time series casts doubts on the positive or negative effects of any chemical constituent on the groundwater quality for the long run, and is thus not as clear and reliable as in the case of using Dynamic Bayesian Techniques. While some groundwater quality constituents, such as chloride and TDS, show an increasing trend, the other constituents, such as pH, Mg, and SO4 do not demonstrate obvious trends. Therefore, we can draw a reliable conclusion on the cause of the increasing trend of the groundwater quality and we cannot investigate the effect of the increasing or decreasing other constituents, such as pH and EC. In addition to this ignorance of the cause-effect relationships, classical time series models assume the linearity in the relationships among variables and normality of their probability distributions.
VII. Conclusion and Further Work
This work presents the assessment of groundwater quality. Bayesian methods have been investigated and shown to offer considerable potential for use in groundwater quality prediction. These methods are based on reasoning under conditions of uncertainty. They present effectively the relationships between the constituents of groundwater quality. Therefore, the simple Bayesian networks presented here are the first step towards having a comprehensive network that contains the other variables that are considered by the researchers significant for the assessment of groundwater quality in the Salalah plain in particular. These variables include:
* NO3: Nitrate is an increasingly important indicator of water pollution from animal waste, human waste, fertilizers and solid waste. Nitrate and ammonium are indicators of pollution because both are soluble in water and could penetrate to deeper zones underground.
* Microbiological indicator organisms such as E. coli and fecal coliform bacteria. For the most part, these organisms are not harmful themselves, but they indicate the presence of fecal material, which may contain disease-causing (pathogenic) organisms.
* COD: chemical oxygen demand, a parameter, which reflects the organic and inorganic content of the water, mainly needed for pollution assessment.
Data were collected from many monitoring wells in the Salalah plain, which is allocated to the south of the Sultanate of Oman. We spent significant time and effort to gather sufficient relevant data for this study. We plan to continue this work by adding these variables to the resulting models in order to improve the models' predictive accuracy.
We also demonstrated the general benefit of using OOBN that describes identical structures that can be interconnected to represent a successive time slices network, i.e. Dynamic Bayesian Network (DBN).
As a reference, a comparison study between DBNs and the classical time series models is also conducted. It shows that, DBN outperforms the classical models.
 Banerjee A. K. et al. 1985. TR no. 773, Monitoring groundwater quality, Department of Statistics, University of Wisconsin.
 Borsuk, M. Stow, C. Reckhow, K. 2004. A Bayesian network of eutrophication models for synthesis, prediction, and uncertainty analysis, Ecological Modeling, Ecological Modeling, 173, 219-239.
 Brandherm, B. and Jameson, A. 2004. An extension of the differential approach for Bayesian network inference to dynamic Bayesian networks, International Journal of Intelligent Systems, 19(8):727-748.
 Chan, H. and Darwiche, A., 2002. When do Numbers Really Matter? Journal of Artificial Intelligence Research, 17, 265-287.
 Dagum, P. and Galper, A. 1993. Additive Belief-Network Models, UAI, 91-98.
 Dames and Moore. 1992. Investigation of The Quality of Groundwater Abstracted from the Salalah Plain: Dhofar Municipality, Final Report.
 Entec Europe Limited. 1998. Consultancy Services for The Study of Development Activities on Groundwater Quality of Wadi Adai, Al Khawd and Salalah Well field Protection Zones, Contract No 96-2133, Final Report, Volume 4, Hydrogeology and Modeling, Salalah, Ministry of Water Resources.
 Hobbs B. F. 1997. Bayesian methods for analyzing climate change and water resource uncertainties, Journal of Environmental Management, 49, 53-72.
 HUGIN Expert Brochure. 2005. HUGIN Expert A/S, P. O.Box 8201 DK-9220, Aalborg, Denmark, (http://www.hugin.com).
 Jensen, F. V. 2001. Bayesian Networks and Decision Diagrams, Springer.
 Kevin, B. and Nicholson, A. 2004. Bayesian Artificial Intelligence, Chapman & Hall/CRC.
 Kim, S., Imoto, S., and Miyano, S. 2004. Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data, Biosystems, 75, 57-65.
 Kjaerulff, U. 1995. dHugin: A computational system for dynamic time-sliced Bayesian Networks, International Journal of Forecasting, 11, 89-111.
 Milligan, H. and Gharbi, A. 1995. Groundwater Management on the Salalah Plain, in Oman Ministry of Water Resources, International Conference on Water Resources Management in Arid Countries, 2, 530-538.
 Ministry of Water Resources (MWR), Sultanate of Oman. 2004. Law on the Protection of Water Resources, promulgated by Decree of the Sultan No. 29 of 2004, and its implementing regulations (Regulations for the organization of wells and aflaj, and Regulations for the use of water desalination units on wells), (in Arabic).
 Nefian, L. Liang, X. Pi, X. Liu and Murphy, K. 2002. Dynamic Bayesian Networks for Audio-Visual Speech Recognition EURASIP, Journal of Applied Signal Processing, 11, 1-15.
 Nicholson A. E. and Brady J. M. 1994. Dynamic Belief Networks for Discrete Monitoring. IEEE Transactions on Systems, Man, and Cybernetics, 24(11), 1593-1610.
 Russell, S., and Norvig, P. 2003. Artificial Intelligence: A Modern Approach, 2nd Edition, Prentice Hall, Inc.
 Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann.
 Shihab, K. and Al-Chalabi, N. 2004. Treatments of Water Quality Using Bayesian Reasoning, Lecture Notes in Computer Science, 3029, 728-738.
 Shortliffe, E. H. et al (eds). 1990. Medical Informatics: Computer Applications in Health Care, Reading, MA, Addison-Wesley.
 Stow, C. A., Borsuk, M. E., and Reckhow, K. H. 2003. Comparison of estuarine water quality models for TMDL development in Neuse River Estuary, J. Water Res. Plan. Manag. 129, 307-314.
 Varis, O. 1995. Belief networks for modeling and assessment of environmental change, Environmetrics, 6, 439-444.
 Wu-Seng, L. 1993. Water Quality Modeling, CRC Press, Inc.
 Zou, M. and Conzen, S. D. 2005. A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data, Bioinformatics, 21(1), 71-79.
 Excel Business Tools, http://www.excelbusinesstools.com/index.htm, the site was accessed on 19/10/06.
Khalil Shihab (1) and Maki Rashid (2)
(1) Department of Computer Science, SQU, Box 36, Al-Khod, 123, Oman
(2) Department of Mechanical & Industrial Engineering, SQU, Oman
Table 1. Concentration Data of TDS for Well001/577 in the Salalah plain, where OC stands for Observed Concentration and ETC stands for Expected True Concentration. Date OC LHPD ETC UHPD 84 1.15 0.85 1.15 1.45 85 1.11 1 1.13 1.26 86 1.94 1.12 1.4 1.68 87 2.24 1.33 1.61 1.88 88 3.86 1.6 2.06 2.52 89 3.83 1.91 2.35 2.79 90 3.96 2.18 2.58 2.98 91 3.76 2.38 2.73 3.08 92 4.3 2.58 2.9 3.23 93 3.96 2.72 3.01 3.3 94 1 2.54 2.83 3.11 95 3.71 2.64 2.9 3.16 96 3.65 2.73 2.96 3.19 97 3.38 2.78 2.99 3.2 98 3.4 2.83 3.02 3.2 99 3.48 2.87 3.04 3.22 00 3.5 2.91 3.07 3.23 01 3.23 2.93 3.08 3.23 02 3.24 2.95 3.09 3.22 03 3.27 2.97 3.1 3.22 04 3.3 2.99 3.11 3.22 Table 2. Drinking Water Standard Element Limit for Drinking Water pH 7.0-8.5 Chloride mg/l 250 TDS 500-1000 Sulphate mg/l 200 Copper mg/l 1.3 Iron mg/l 0..5 Sodium 200-400 Table 3. TDS data for the well Well 001/577. Mg [So.sub.4] Na Ca K Cl HC[O.sub.3] Year mg/L mg/L mg/L Mg/L mg/L mg/L mg/L 84 12 11 21 91 11 172 224.7 85 10 12 20 88 13 148 234.5 86 9 14 18 92 17 140 275.4 87 14 12 43 86 14 148 287.2 88 12 12 20 90 20 132 255.8 89 10 12 17 86 16 148 276.9 90 32 13 15 92 18 164 224.6 91 19 11 45 89 21 168 287.4 92 12 14 152 92 17 176 291.5 93 21 12 165 93 19 192 296.7 94 27 14 88 96 23 204 294.4 95 7 11 65 60 22 140 310.8 96 16 25 64 52 15 244 321.5 97 13 19 83 102 18 204 314.6 98 19 26 97 107 26 248 412.6 99 56 38 217 98 57 220 487.7 00 41 20 201 104 31 236 388.4 01 43 23 204 135 30 244 387.6 02 55 32 210 138 41 308 438.6 03 52 20 147 121 33 272 410 04 48 21 152 130 34 284 405.4 Table 4. TDS, EC, and pH data for the well Well 001/577. TDS EC Yr mg/L [micro]S/cm pH 84 542.7 548 85 525.5 548 7.8 86 565.4 579 7.75 87 604.2 588 7.57 88 541.8 601 7.43 89 565.9 625 7.34 90 558.6 638 7.32 91 640.4 798 7.27 92 754.5 739 7.24 93 798.7 758 7.28 94 746.4 799 7.29 95 615.8 514 7.3 96 737.5 619 7.28 97 753.6 869 7.19 98 935.6 558 7.15 99 1174 855 7.15 0 1021 796 7.06 1 1067 855 6.98 2 1223 844 6.94 3 1055 881 6.9 Table 5. P(EC / TDS) TDS < 550 TDS >= 550 EC < 670 0.75 0.2 EC >= 670 0.25 0.8 Table 6. P(pH/TSD, EC) Well 577 TDS < 550 TDS >= 550 EC < 670 EC >= 670 EC < 670 EC >= 670 pH < 7.5 0.333 0 0 0.667 pH >= 7.5 0.667 1 1 0.333
|Printer friendly Cite/link Email Feedback|
|Author:||Shihab, Khalil; Rashid, Maki|
|Publication:||International Journal of Computational Intelligence Research|
|Date:||Jul 1, 2007|
|Previous Article:||Aerodynamic shape optimisation of unmanned aerial vehicles using hierarchical asynchronous parallel evolutionary algorithms.|
|Next Article:||Classification of fuzzy-based information using improved backpropagation algorithm of artificial neural networks.|