# Sample dependence in the maximum entropy solution to the generalized moment problem.

1. Introduction and Preliminaries

To state what the generalized moment problem is about, let ([OMEGA], F, P) be a probability space and let (S, B, m) be a measure space, with m a finite or sigma-finite measure. Let X be an S-valued random variable, such that its distribution has a density with respect to the measure m. The generalized moment problem consists in determining a density f(x) such that

[[integral].sub.S] [h.sub.k] (x) f (x) m (dx) = [d.sub.k] for k = 0, 1, ..., M, (1)

where [h.sub.k] is a collection of measurable functions [h.sub.k] : S [right arrow] R and [d.sub.k] are given real numbers, and we set [h.sub.0] = 1 and [d.sub.0] = 1 to take care of the natural requirement on f. A typical example is the following. X stands for a positive random variable (a stopping time, o perhaps being total risk severity) and we can compute E[exp (-[[alpha].sub.k]X)] = [d.sub.k] by some Monte Carlo procedure at a finite number of points [[alpha].sub.k]. The problem that we need to solve amounts to inverting the Laplace transform from such finite collection of values of the transform parameter [alpha].

Actually this last problem is of much interest in the banking and insurance industries, where the density is necessary to compute risk premia and regulatory capital of various types, samples may be small, and the estimation of [d.sub.k] reflects that. We direct the reader to Gomez-Goncalves et al. [1], where this issue was addressed in the context of risk modeling and Laplace transform inversion.

Let us denote by [f.sup.*] the solution to problem (1) obtained by the maximum entropy method as explained in Section 3 below when the moments in the right hand side are known exactly. As in many situations the moments are to be estimated empirically as detailed in (2) below, it is to be expected that the maxentropic distribution obtained will depend on the sample [X.sub.1], ..., [X.sub.N] used to compute [[??].sub.k]. We shall denote this maxentropic density by [[??].sup.*.sub.N] to emphasize its dependence on the empirical moments [[??].sub.k].

The problem that we address in this note is the convergence on the [f.sup.*.sub.N] to [f.sub.*] as well as that of the oscillations of mean values computed with [f.sup.*.sub.N] about the oscillations of mean values computed with respect to [f.sup.*].

When we have a sample [X.sub.1], ..., [X.sub.N], the empirical generalized moments (the sample averages) are given by

[[??].sub.k] = 1/N [N.summation over (n=1)][h.sub.k] ([X.sub.n]) (2)

which fluctuate around the exact moments [d.sub.k]; we thus expect the output of the maxentropic procedure to somehow reflect this variability.

For that, in the next section we recall in a (short) historical survey the notion of entropy of a density, and in the following section we present the basics of the maximum entropy method.

In Section 4 we take up the main theme of this work: the variability of [[??].sup.*.sub.N] that comes in through [[??].sub.k]. There we prove that [[??].sup.*.sub.N] converges pointwise and in [L.sub.1] to the maxentropic density [f.sup.*] obtained from the exact data, and we shall examine how [[??].sup.*.sub.N] deviates from [f.sup.*] in terms of the difference between the true and the estimated (sample) moments. We examine as well the deviations of expected values like [integral] g (x) [[??].sup.*.sub.N] (x) dm (x) from [integral] g(x) [f.sup.*] (x)dm(x). That the density reconstruction from empirical moments has to depend on the sample seems to be intuitive, but neither the behavior of the maxentropic density as the sample size increases nor the fluctuations of the expected values with the densities [[??].sup.*.sub.N] seem not to have been studied before.

2. The Entropy of a Density

As there seem to be several notions of entropy, it is the aim of this section to point out that they are all variations on the theme of one single definition. Let us begin by spelling out what it is that we call the entropy of a density. Let P be a measure on ([OMEGA], B). Suppose that P < m and let f denote its density. The entropy [H.sub.m](P) (when we want to emphasize the density we write [H.sub.m](f)) is defined by

[H.sub.m] (P) = - [[integral].sub.s] f ln (f) dm (3)

whenever ln(f) is P-integrable, or -[infinity] if not. [H.sub.m](P) is called the entropy of P (with respect to m) and [H.sub.m](f) is called the entropy of f. Actually, we can also define [H.sub.m](f) for f [member of] {g [member of] [L.sub.1] (dm),g [greater than or equal to] 0} as follows. When P is not necessarily a probability measure having a density f with respect to m, (3) is to be modified as follows:

[H.sub.m] (P) (= [H.sub.m] (f)) = - [[integral].sub.S] f ln (f) dm + ([[integral].sub.S] f dm -1) (4)

When m is a probability measure, denote it by Q, and both P and Q are equivalent to a measure n, with densities given, respectively, by f = dP/dn and g = dQ/dn; then (3) can be written as

[H.sub.n] (P, Q) = - [[integral].sub.S] (f/g) ln (f/g) g dn = -[[integral].sub.S] ln (f/g) f dn = -[[integral].sub.S] ln (f/g) dp (5)

and we call it the entropy of f with respect to g.

Comment. For the applications that we shall be dealing with, Q will stand for a closed, convex subset in some [R.sup.n], and m will be the usual Lebesgue measure. We also mention that when m is a discrete measure, then the integrals would become sums.

The expression (3) seems to have made its first appearance in the work of Boltzmann in the last quarter of the XIXth century. There it was defined in [R.sup.3] x [R.sup.3], where f(x, v)dxdv was to be interpreted as the number of particles with position within dx and velocity within v. The function happened to be a Lyapunov functional for the dynamics that Boltzmann proposed for the evolution of a gas, which grew as the gas evolved towards equilibrium. Not much later Gibbs used the same function, but now defined on [R.sup.6N] x [R.sup.6N], whose points (q, p) denote the joint position and momenta of a system of N particles. This time dm = dqdp, and f(q, p)dqdp is the probability of finding the system within the specified "volume" element. Motivated by earlier work in thermodynamics, it was postulated that in equilibrium the density of the system yielded a maximum value to the entropy [H.sub.m](f).These remarks explain the name of the method.

The expression (3) (with a reversed sign) made its appearance in field of information transmission under the name of information content in the density f; that is why it is sometimes called the Shannon-Boltzmann entropy. Also, expression (5) appeared in the statistical literature under the name of (Kullback-) divergence of the density f with respect to the density g, and it is denoted by K(f, g) and equal to -[H.sub.n](P, Q). See Cover and Thomas [2] or Kullback [3] for a detailed study of the properties of the entropy functions.

Having made those historical remarks and having stated those equivalent definitions, we mention that we shall be working mostly with (3). In what comes below we make use of some interesting and well-known properties of (3) and (5), which we gather under the following.

Theorem 1. With the notation introduced above, one gets the following:

(i) The function f [right arrow] [H.sub.m](f) is strictly concave.

(ii) For any two densities f and g, Hn(f, g) [less than or equal to] 0, and [H.sub.n] (f, g) = 0 if and only if f = g a.e. n.

(iii) For any two densities f, g such that [H.sub.n] (f, g) is finite, one has (Kullbacks inequality)

1/2 [([parallel] f - g [[parallel].sub.1]).sup.2] [less than or equal to] - [H.sub.n] (f, g). (6)

The reader is directed to either Cover and Thomas [2] or Kullback [3] for proofs.

3. The Standard Maximum Entropy Method

Here we recall some well-known results about the standard maximum entropy (SME) method along with some historical remarks. Even though the core idea seems to have been first made in the work of Esscher [4], where he introduced what nowadays is called the Esscher transform, it was not until the mid-1950s that it became part of the methods used in statistics, through the work of Kullback [3]. It seems to have been first formulated as a variational procedure by Jaynes [5] to solve the (inverse) problem consisting in finding a probability density f(y) (on the phase space of a mechanical system), satisfying the following integral constraints:

[[integral].sub.S] [h.sub.k] (x) f (x) dm (x) = [d.sub.k] for k = 0, 1, ..., M, (7)

where [d.sub.k] are observed (measured) expected values of some functions ("observables" in the physicist's terminology), of the random variable X. That problem appears in many fields; see Kapur [6] and Jaynes [7], for example.

Usually, we set [h.sub.0] [equivalent to] 1 and [d.sub.0] = 1 to take care of the natural requirement on f(x). It actually takes a standard computation to see that when the problem has a solution, it is of the type

[f.sup.*.M] (x) = exp (-[M.summation over (k=0)][[lambda].sup.*.sub.k] [h.sub.k] (x)) (8)

in which the number of moments M appears explicitly. It is usually customary to write [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is an M-dimensional vector. Clearly, the generic form of the normalization factor is given by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (9)

With this notation the generic form of the solution can be rewritten as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (10)

Here (a, b) denotes the standard Euclidean scalar product in [R.sup.M], and h(x) is the vector with components [h.sub.k](x). At this point we mention that the simple minded proof appearing in many applied mathematics or physics textbooks is not really correct. That is because the set of densities is not open in [L.sub.1] (dm). There are many alternative proofs. Consider, for example, the work by Csiszar [8] and Cherny and Maslov [9].

The heuristics behind (10) and what comes next are the following. If in statement (ii) of Theorem 1 we take g(x) to be any member of the exponential family [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], the inequality becomes

[H.sub.m] (f) [less than or equal to] ln Z ([lambda]) + ([lambda], d) (11)

which suggests that if we find a minimizer [[lambda].sup.*] such that the inequality becomes an equality, by Theorem 1 we conclude that (10) is the desired solution. This dualization argument seems to have been first proposed in Mead and Papanicolaou [10] and is expounded in full rigor in Borwein and Lewis [11]. The vector [[lambda].sup.*] can be found minimizing the dual entropy:

[SIGMA] ([lambda], d) = ln Z ([lambda]) + ([lambda],d), (12)

where d is the M-vector with components [d.sub.k], and obviously the dependence of [f.sup.*] on d is through the minimizer [[lambda].sup.*]. We add that, technically speaking, the minimization of [SIGMA]([lambda], d) is over the domain of Z([lambda]) which is a convex set with a nonempty interior, and usually the minimum is achieved in its interior. In many applications the domain of Z([lambda]) is [R.sup.M]. And for the record, we state the result of the duality argument as follows.

Lemma 2. With the notations introduced above, if the minimizer [[lambda].sup.*] of (12) lies in the interior of the domain of Z([lambda]), then

[H.sub.m] ([f.sup.*]) = Z([[lambda].sup.*], d) = lnZ([[lambda].sup.*]) + ([[lambda].sup.*], d>. (13)

The proof goes as follows. Note that if [[lambda].sup.*] is a minimizer of (12), the first order condition is -[[nabla].sub.[lambda]] lnZ([[lambda].sup.*]) = d, which written explicitly states that (10) satisfies the constraints (7). Since the entropy of this density is given by the right hand side of (13), it must be the density that maximizes the entropy.

4. Mathematical Complement

In this section we gather some results about Z(A) that we need as follows.

Proposition 3. With the notations introduced above, suppose that the matrix C which we use to denote the covariance of h(v) computed with respect to the density [f.sup.*] is strictly positive definite. Let one suppose as well that the set {[lambda] [member of] [R.sup.M] | Z([lambda]) < [infinity]} is an open set. Then one gets the following:

(1) The function Z([lambda]) defined above is log-convex; that is, ln Z([lambda]) is convex.

(2) Z([lambda]) is continuously differentiable as many times as one needs.

(3) If one sets [phi]([lambda]) = -[[nabla].sub.[lambda]] ln Z[lambda]), then [lambda] = [[phi].sup.-1](d) is continuously differentiable in d.

(4) The Jacobian D of [[phi].sup.-1] at d equals the (negative) covariance matrix of h(v) computed with respect to [f.sup.*](x).

The first two assertions are proved in Kullback's book. Actually, the log-convexity of Z([lambda]) is a consequence of Holder's inequality, and the analyticity of Z([lambda]) involves a systematic estimation procedure. The third drops out from the inverse function theorem in calculus. See Fleming [12], and the last one follows from the fact that the Jacobian of [[phi].sup.-1] equals the negative of the inverse of the Hessian matrix of ln Z(A), which is (minus) the covariance matrix C. As a simple consequence of item (4) in Proposition 3 we have the following result which is relevant for the arguments in the next section.

Theorem 4. With the notations introduced above, setting [delta]d = [[??].sub.N] - d, the following assertions hold. The change in [[lambda].sup.*] as d [right arrow] d + [delta]d up to terms o([delta]d) is given by

[delta][lambda] = D[delta]d (14)

and, more importantly, using (10) and, again, up to terms o([delta]d),

[[??].sup.*.sub.N] (x) - [f.sup.*] (x) = -[f.sup.*] (x)<(h (x) - d), D ([delta]d)>. (15)

To sketch a proof of (15) we proceed as follows. Let [delta][lambda] = [[lambda].sup.*.sub.N] - [[lambda].sup.*]; then

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (16)

Now, neglecting terms of second order in dA, we approximate the numerator by

[e.sup.-<([lambda]+[delta][lambda]),h(x)>] = [e.sup.-<[lambda],h(x)>] (1 + <[delta][lambda], h (x)>) (17)

and the denominator by

Z ([[lambda].sup.*] + [delta][lambda]) = Z ([[lambda].sup.*]) (1 +,[[nabla].sub.[lambda]] ln Z ([[lambda].sup.*]), [delta][lambda]>) = Z([[lambda].sup.*]) (1 + <d,[delta][lambda]>), (18)

where we used the fact that at the minimum

[-[nabla].sub.[lambda]] ln Z ([[lambda].sup.*]) = d. (19)

Therefore

[[??].sup.*.sub.N] (x) = [e.sup.-<[lambda],h(x)>] (1 + <[delta][lambda],h (x)>)/Z([[lambda].sup.*] (1 + <d,[delta][lambda]>) (20)

from which the desired result readily drops out after neglecting terms of second order in [delta][lambda].

5. Sample Dependence

Throughout this section, we shall consider a sample {[X.sub.1], ..., [X.sub.N]} of size N of the random variable X. Here we shall relate the fluctuations of [[??].sub.k] = (1/N) [[summation].sup.N.sub.n=1] ([X.sub.n])

around its mean to the fluctuations of the density. The following is obtained from an application of the strong law of large numbers.

Theorem 5. Suppose that h is integrable, with mean d and covariance matrix [SIGMA](h). Then, for each N, (2) is an unbiased estimator of d and

[[??].sub.k] 1/N [N.summation over (n=1)] [H.sub.k] ([X.sub.n]) [right arrow] d a.s. P, when N [right arrow] [infinity]. (21)

Consider now the following.

Proposition 6. Define the empirical moments as in (2). Denote by [[??].sup.*.sub.N] the Lagrange multiplier determined as explained in Section 2. Then, as N [right arrow] [infinity, [[??].sub.N] [right arrow] d and therefore [[??].sup.*.sub.N] [right arrow] [[lambda].sup.*] (a.s.) P.

If [[??].sup.*.sub.N] and [f.sup.*] are the maxentropic densities given by (10), corresponding, respectively, to [??] and d, then [[??].sup.*.sub.N] / [f.sup.*] pointwise for every x and almost surely P.

The proof hinges in the following arguments. From Theorems 5 and 4 we obtain the first assertion. The rest follows from the continuous dependence of the densities on the parameter [[lambda].sup.*]. Also, taking limits as N [right arrow] [infinity] in (15), we obtain another proof of the convergence of [[??].sup.*.sub.N] to [f.sup.*].

The next result concerns the convergence of [[??].sup.*.sub.N] to [f.sup.*] in [L.sub.1] (dm).

Theorem 7. With the notations introduced above, one has

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (22)

Proof. The proof is a consequence of the continuous dependence of [SIGMA]([lambda], d) on its arguments, of the identity (13) and item (iii) in Theorem 1 with [[??].sup.*.sub.N] playing the role of f and [f.sup.*] playing the role of g. In this case -[H.sub.m] ([[??].sup.*.sub.M], [f.sup.*]) happens to be

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (23)

which, as mentioned, tends to 0 as N [right arrow] [infinity].

To continue, consider the following.

Theorem 8. With the notations introduced above, one has the following:

(1) [[??].sup.*.sub.N] is an unbiased estimator of [f.sup.*].

(2) For any bounded, Borel measurable function g(x),

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (24)

The proof follows from (15). Multiply both sides of that identity by #(x) and integrate with respect to dm(r) and then invoke the Cauchy-Schwartz inequality in [R.sup.K] to obtain the inequality.

What is interesting about (2) in Theorem 8 is the possibility of combining it with Chebyshev's inequality to obtain rates of convergence. It is not hard to verify that

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (25)

where [parallel] * [parallel] is the Euclidean norm in [R.sup.K] and

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (26)

Corollary 9. With the notations introduced in Theorem 8 and the two lines above,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (27)

Comment. If we take g = [I.sub.A], we obtain a simple estimate of the speed of decay of [absolute value of ([[integral].sub.A] [[??].sup.*.sub.N](x)dm(x) - [[integral].sub.A][f.sup.*] (x)dm(x))] to zero, or of the speed of convergence of [[integral].sub.A] [[??].sup.*.sub.N] (x)dm(x) to [[integral].sub.A] [f.sup.*](x)dm(x) if you prefer. Regarding the fluctuations around the mean, consider the following two possibilities.

Theorem 10. With the notations introduced in Theorem 4 and in the identity (15), one has

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (28)

in law as N [right arrow] [infinity]. Also, for any bounded, Borel measurable g(x),

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (29)

in law. Above, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The proof of the assertions involves applying the central limit theorem to the vector variable [square root of N] ([[??].sub.N] - d).

Final Comments. Some numerical results illustrating the results presented here appear in the paper by Gomez-Goncalves et al. [1]. There, they display graphically a "cloud of densities" corresponding to samples on various sizes and see how they shrink to the plot of the true (or exact) density as the sample size increases.

http://dx.doi.org/ 10.1155/2016/8629049

Competing Interests

The author declares that they have no competing interests.

References

[1] E. Gomez-Goncalves, H. Gzyl, and S. Mayoral, "Loss data analysis Analysis of the sample dependence in density reconstruction by maximum entropy methods," Insurance: Mathematics and Economics, vol. 71, pp. 154-153, 2016.

[2] T. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, USA, 2nd edition, 2003.

[3] S. Kullback, Information Theory and Statistics, Dover Publications, New York, NY, USA, 2nd edition, 1968.

[4] F. Esscher, "On the probability function in the collective theory of risk," Skandinavisk Aktuarietidskrift, vol. 15, pp. 175-195,1932.

[5] E. T. Jaynes, "Information theory and statistical mechanics," Physical Review, vol. 106, pp. 620-630,1957.

[6] N. Kapur, Maximum entropy Models in Science and Engineering, Wiley Eastern, New Delhi, India, 1998.

[7] E. T. Jaynes, Probability Theory: The Logic of Science, Cambridge University Press, Cambridge, UK, 2003.

[8] I. Csiszar, "I-divergence geometry of probability distributions and minimization problems," Annals of Probability, vol. 3, no. 1, pp. 146-158,1975.

[9] A. S. Cherny and V. P. Maslov, "On maximization and minimization of entropy functionals in various disciplines," Theory of Probability and Its Applications, vol. 3, pp. 447-464, 2003.

[10] L. R. Mead and N. Papanicolaou, "Maximum entropy in the problem of moments," Journal of Mathematical Physics, vol. 25, no. 8, pp. 2404-2417, 1984.

[11] J. M. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization, CMS Books in Mathematics, Springer, New York, NY, USA, 2000.

[12] W. Fleming, Functions of Several Variables, Springer, Berlin, Germany, 1987

Henryk Gzyl

Centro de Finanzas, IESA, Caracas, Capital District, Venezuela

Correspondence should be addressed to Henryk Gzyl; henryk.gzyl@iesa.edu.ve

Received 7 July 2016; Accepted 11 October 2016