# Undecidability of Context-Sensitive Data-Dependence Analysis.

1. INTRODUCTIONA number of program-analysis problems can be tackled by transforming them into certain kinds of graph-reachability problems in labeled directed graphs [Callahan 1988; Cooper and Kennedy 1988; 1989; Hecht 1977; Horwitz et al. 1990; 1995; Khedker and Dhamdhere 1994; Kou 1977; Melski and Reps 2000; Reps 1995; 1998; Reps et al. 1994; 1995]. It is useful to consider not just ordinary reachability (e.g., transitive closure), but a generalization in which the edge labels are used, in effect, to filter out paths that are not of interest [Callahan 1988; Horwitz et al. 1990; 1995; Melski and Reps 2000; Reps 1995; 1998; Reps et al. 1994; 1995]: a path P from vertex s to vertex t only counts as a "valid connection" between s and t if the word spelled out by P (i.e., the concatenation, in order, of the labels on the edges of P) is in a certain language.

Definition 1.1. Let L be a language over alphabet [[Sigma].sub.L], and let G be a graph whose edges are labeled with members of [[Sigma].sub.L]. Each path in G defines a word over [[Sigma].sub.L], namely, the word obtained by concatenating, in order, the labels of the edges on the path. A path in G is an L-path if its word is a member of L. An instance of the (single-source/single-target) L-path problem asks whether there exists an L-path in G from a given source vertex s to a given target vertex t.

Let L be a family of languages, and let L [element of] L be a language over alphabet [[Sigma].sub.L]. An instance of the L-reachability problem is an L-path problem instance (L, [[Sigma].sub.L], G, s, t).

Example. When the family of languages L in Definition 1.1 is the context-free languages, we have the CFL-reachability problem [Yannakakis 1990]. Consider the graph and the context-free grammar shown in Figure 1. Note that L(matched) is the context-free language that consists of strings of matched parentheses and square brackets, with zero or more e's interspersed.(1) In this graph, there is exactly one L(matched)-path from s to t: the path goes exactly once around the cycle and generates the word "[(e[])eee[e]]".

[Figure 1 ILLUSTRATION OMITTED]

A number of program-analysis problems can be viewed as instances of the CFL-reachability problem [Reps 1998]. In program-analysis problems, the languages used for such filtering purposes are often languages of matching parentheses. In some cases, the matched-parenthesis condition is used to filter out paths with mismatched calls and returns in order to implement so-called "context-sensitive" program analyses.

Example 1.2. The use of CFL-reachability in context-sensitive program-analysis, as opposed to ordinary graph-reachability, is illustrated in Figure 2.(2) The diagram on the right in Figure 2 shows the program's data-dependence graph. (Strictly, neither the two large ovoid shapes nor the rectangular boxes labeled Call, Enter, and Exit are part of the data-dependence graph. The two ovoids indicate which elements belong to which procedure; the rectangular boxes provide some context about the control points in the program to which the various vertices are associated.) Each directed edge in the graph represents a data-dependence (also known as a flow dependence [Kuck 1978; Kuck et al. 1981]): an edge from vertex [v.sub.1] to vertex [v.sub.2] indicates that the value produced at [v.sub.1] may be used at vertex [v.sub.2].

[Figure 2 ILLUSTRATION OMITTED]

For instance, the edge

0 [right arrow] x = 0

in the dependence graph for procedure f indicates that the value of x after the execution of the statement x = 0 could be (and, in this case, must be) 0.

In the above program, procedure g has no parameters. However, our data-dependence graphs reflect a somewhat nonstandard treatment of global variables: a global variable such as x is treated as if it were a "hidden" value-result parameter whose value (and subsequent return value) is passed from one scope to another via the special scope-transfer variables x_in and x_out. For example, the interprocedural data-dependence edge

[{.sub.1] x_in = x [right arrow] x = x_in

represents the passing of x from f's scope to g's scope at the first call on g. Note that data-dependence edges for dependences transmitted from the caller to the callee (i.e., from f to g) are labeled by the symbols "[{.sub.1]" and "[{.sub.2]" whereas data-dependence edges for dependences transmitted from the callee back to the caller are labeled by the symbols "[}.sub.1]" and "[}.sub.2]. In particular, the data-dependence edges that represent how x and y are passed from f's scope to g's scope at the first call on g are labeled with [{.sub.1]; the data-dependence edges that represent how x and y are passed from f's scope to g's scope at the second call on g are labeled with [{.sub.2]. Likewise, the data-dependence edges that represent how x and y are passed back from g's scope to f's scope after the two calls finish are labeled with [}.sub.1] and [}.sub.2], respectively.

Our data-dependence graphs also have vertices that correspond to various annotations in the program; annotations, denoted here as comments, indicate that we are interested in a given variable at a given point in the program. In the example above, the comments at p1 and p2 give rise to the vertices p1:y and p2:y.

A context-insensitive analysis that tracks dependences between constants and variables in the program will report that y depends on 0 and 1 at both p1 and p2. The reason is that a context-insensitive analysis ignores the fact that the only paths that can possibly be feasible execution paths are those in which returns are matched with corresponding calls. For instance, the existence of the (mismatched) path

(1.3) [{.sub.1] 0 [right arrow] x = 0 [right arrow] x_in = x [right arrow] x = x_in [right arrow] y = x

[}.sub.2] [right arrow] y_out = y [right arrow] y = y_out [right arrow] p2:y

in the graph shown above serves as "evidence" that p2:y depends on 0 (i.e., that the value of y at p2 could be 0).

In contrast, a context-sensitive analysis only reports possible transmissions of values along paths in which returns are matched with corresponding calls. For this example, it reports that p1:y depends on 0 but not on 1 and that p2:y depends on 1 but not on 0 (i.e., y could have only the value 0 at p1 and 1 at p2). The context-sensitive analysis can be expressed as a CFL-reachability problem with respect to a language of matched indexed curly braces. It would say that path (1.3) is not a valid connection between 0 and p2:y, because the label [{.sub.1] on the interprocedural data-dependence edge

[{.sub.1] x_in = x [right arrow] x = x_in

(which represents the transfer of x's value from f to g as g is entered at the first call site) does not match the label [{.sub.2] on the edge

[}.sub.2] y_out = y [right arrow] y = y_out

(which represents the transfer of y's value back to f when a return is performed in g from a call on g that originated at the second call site). On the other hand, because [{.sub.2] matches with [}.sub.2], the (matched) path

(1.4) [{.sub.2] 1 [right arrow] x = 1 [right arrow] x_in = x [right arrow] x = x_in [right arrow] y = x

[}.sub.2] [right arrow] y_out = y [right arrow] y = y_out [right arrow] p2:y

does count as a valid connection between 1 and p2:y, and this path serves as evidence that the value of y at p2 could be 1.

Problems in which CFL-reachability has been used to devise context-sensitive program analyses include interprocedural slicing [Horwitz et al. 1990; Reps et al. 1994] and interprocedural dataflow analysis [Horwitz et al. 1995; Reps et al. 1995].(3)

CFL-reachability--and, in particular, a matched-parenthesis constraint--has also been used to capture a graph-theoretic analog [Melski and Reps 2000; Reps 1995] of McCarthy's [1963] rules (i.e., "car(cons(x,y)) = x" and "cdr(cons(x,y)) = y"), as illustrated in the following example.

Example 1.5. The program in Figure 3 illustrates the use of CFL-reachability in (context-insensitive) structure-transmitted data-dependence analysis. In the graph shown in Figure 3, the labels on the data-dependence edges serve a different purpose than the labels used in Example 1.2: here an edge labeled "(" corresponds to a data construction in which the value is placed in the first position of a cons; an edge labeled "[" corresponds to a data construction in which the value is placed in the second position of a cons; an edge labeled ")" corresponds to a selection via car; an edge labeled "]" corresponds to a selection via cdr.

[Figure 3 ILLUSTRATION OMITTED]

The matched-parenthesis path

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

from NULL to p3:y serves as evidence that the value of y at p3 could be NULL. The path

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

does not serve as evidence that the value of y at p2 could be NULL, because the first occurrence of "(" has no matching ")". In fact, there is no matched-parenthesis path from NULL to p2:y, and a (context-insensitive) structure-transmitted data-dependence analysis would conclude that the value of y at p2 cannot be NULL.

The structure-transmitted data-dependence analyses given in Melski and Reps [2000] and Reps [1995] are context-insensitive, because there are no constraints that filter out paths with mismatched calls and returns. Thus, a natural question is whether the two kinds of uses of matching delimiters illustrated in Examples 1.2 and 1.5 can be combined to create a context-sensitive analysis for structure-transmitted data-dependences. The following interprocedural variation on Example 1.5 illustrates what we would hope to gain from a context-sensitive, structure-transmitted data-dependence analysis.

Example 1.6. Consider the following example program:

List *x, *y; void g() { y = car(x); } void f() { x = cons(NULL,NULL); g(); p1: /* Could y be NULL here? */ x = cons(x,NULL); g(); p2: /* Could y be NULL here? */ x = y; g(); p3: /* Could y be NULL here? */ }

The relevant portions of this program's data-dependence graph are shown in Figure 4. For this example, a context-sensitive analysis should report that variable y can have the value NULL at p1 and p3, but not at p2. The following path from NULL to program point p3 serves as evidence that the value of y at p3 could be NULL:

(1.7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[Figure 4 ILLUSTRATION OMITTED]

In contrast, a context-insensitive analysis would also use the following path from NULL to program point p2 as evidence that the value of y at p2 could be NULL:

(1.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The problem with this path is that there is a mismatch between the labels on the edge

[{.sub.3] x_in = x [right arrow] x = x_in

and the subsequent edge

[}.sub.2] y_out = y [right arrow] y = y_out.

Consequently, this path would be excluded from consideration by a context-sensitive analysis.

The other fact to note about Example 1.6 is that in path (1.7), although "(" symbols match with ")" symbols, and "[{.sub.i]" symbols match with "[}.sub.i]" symbols, the two patterns of matched symbols are interleaved.(4) This observation serves to motivate the study of interleaved matched-parenthesis languages carried out in Sections 2 and 3.

This article shows that it is impossible to create an algorithm that captures all, and only, interleaved matched-parenthesis paths of the kind illustrated in Example 1.6: in general, the problem of context-sensitive, structure-transmitted data-dependence analysis is undecidable. In other words, you can capture either (1) the matching of calls and returns or (2) "car(cons(x,y)) = x cancellation," but not both simultaneously, in any amount of time. (Of course, there may be useful algorithms that compute approximate, but safe, solutions to this problem, cf. Heintze and McAllester [1997].)

In terms of the programming-language features needed for this result to apply, higher-order functions are not required: the main result of this article implies that context-sensitive, structure-transmitted data-dependence analysis is undecidable for first-order languages (both functional and imperative). This result applies to such languages as C, C++, Java, ML, and Scheme, as well as to many others.

It should be noted that questions of the kind posed in Example 1.6 (i.e., "Does a given variable have a given value at a particular point in a program?") often turn out to be undecidable in their most general form, and often there are several independent reasons why the problem is undecidable (e.g., it is undecidable whether a given statement is ever executed; it is undecidable whether a given path is ever executed; etc.). This article shows that context-sensitive, structure-transmitted data-dependence analysis is undecidable even if a conservative approximation is made that, for many other program-analysis problems, overcomes other sources of undecidability. (In particular, we assume that all paths in a procedure's control-flow graph are executable.)

The remainder of the article is organized into four sections. The undecidability result is shown by a reduction from a variant of Post's Correspondence Problem (PCP). Section 2 defines PCP and discusses a variant of it that is particularly suited to our needs. Example 1.6 motivates our interest in languages with interleaved patterns of matching delimiters; Section 3 shows that a certain set of L-path problem instances--where L is a language of strings formed by interleaving two languages of matching parentheses--is undecidable. Section 4 relates the result from Section 3 to the undecidability of context-sensitive, structure-transmitted data-dependence analysis. Section 5 discusses what our results imply about other program-analysis problems.

2. A VARIANT OF POST'S CORRESPONDENCE PROBLEM

Our undecidability result is shown by a reduction from a variant of Post's Correspondence Problem:

Definition 2.1. An instance of Post's Correspondence Problem, or PCP, consists of two lists of strings, X and Y, where X and Y each consist of k strings in [{0, 1}.sup.+]:

X = [x.sub.1], [x.sub.2], ..., [x.sub.k]

Y = [y.sub.1], [y.sub.2], ..., [y.sub.k]

The instance of PCP has a solution if there exists a nonempty sequence of integers [i.sub.1], [i.sub.2], ..., [i.sub.j], ..., [i.sub.m] such that (1) for all 1 [is less than or equal to] j [is less than or equal to] m, we have 1 [is less than or equal to] [i.sub.j] [is less than or equal to] k, and (2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Example 2.2. Consider the following instance of PCP, where k is 3:

X = 0101, 101, 111

Y = 01, 011, 0111101

This instance of PCP has the solution 1, 2, 3, 1 because

[x.sub.1][x.sub.2][x.sub.3][x.sub.1] = 0101 101 111 0101 = 01 011 0111101 01 = [y.sub.1][y.sub.2][y.sub.3][y.sub.1].

PCP is known to be undecidable; for proofs, see Hopcroft and Ullman [1979, pp. 193-198], Lewis and Papadimitriou [1981, pp. 289-293], or Harrison [1978, pp. 249-256].

For our purposes, it is more convenient to work with the following variant of PCP:

Definition 2.3 (Parenthesis-PCP). Given an instance of PCP,

X = [x.sub.1], [x.sub.2], ..., [x.sub.k]

Y = [y.sub.1], [y.sub.2], ..., [y.sub.k]

we define the corresponding instance of parenthesis-PCP, or P-PCP, as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where, for 1 [is less than or equal to] i [is less than or equal to] k,

--[[bar]x.sub.i] is the string in [{(, [}.sup.+] equal to [x.sub.i] with 0 replaced by "(" and 1 replaced by "[".

--[[bar]y.sub.i] is the string in [{), ]}.sup.+] equal to [y.sub.i] with 0 replaced by ")" and 1 replaced by "]".

--The superscript "R" on a string denotes the reversed string.

A solution to an instance of P-PCP is defined with the aid of the following linear context-free grammar:(5)

balanced [right arrow] ( balanced ) | [ balanced ] | ( # ) | [ # ]

An instance of P-PCP has a solution if there exists a nonempty sequence of integers [i.sub.1], [i.sub.2], ..., [i.sub.j], ..., [i.sub.m] such that (1) for all 1 [is less than or equal to] j [is less than or equal to] m, we have 1 [is less than or equal to] [i.sub.j] [is less than or equal to] k, and (2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Example 2.4. The instance of P-PCP that corresponds to Example 2.2 is

[bar]X = ([([, [([, [[[

[[bar]Y.sup.R] = ]), ]]), ])]]]])

This instance of P-PCP has the solution 1, 2, 3, 1 because

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Clearly an instance of PCP has a solution if and only if the corresponding instance of P-PCP has a solution.

For a given instance of P-PCP,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

every solution (if one exists) corresponds to a string in the language generated by the following linear context-free grammar:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

While not every string in L([S.sub.0]) corresponds to a solution, all strings in L([S.sub.0]) that are also in L(balanced) correspond to a solution. That is, the given instance of P-PCP has a solution exactly when the language L([S.sub.0]) [intersection] L(balanced) is nonempty. This observation implies the following theorem (cf. [Lewis and Papadimitriou 1981, pp. 293-294]):

THEOREM 2.5. It is undecidable for arbitrary linear context-free grammars [G.sub.1] and [G.sub.2] whether L([G.sub.1]) [intersection] L([G.sub.2]) is empty.

The fact that the existence of a solution to a given instance of P-PCP can be characterized by the nonemptiness of the intersection of two linear context-free grammars underlies our result on the undecidability of context-sensitive data-dependence analysis. However, for the purpose of investigating the latter problem, it is useful to develop a slightly more elaborate way of characterizing the solutions to an instance of P-PCP:

Definition 2.6. If an instance of P-PCP

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

has a solution [i.sub.1], [i.sub.2], ..., [i.sub.m], we say that the following string exhibits the solution in tagged form:(6)

(2.7) MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In general, suppose that [i.sub.1], [i.sub.2], ..., [i.sub.j], ..., [i.sub.p] is some nonempty sequence of integers such that, for all 1 [is less than or equal to] j [is less than or equal to] p, we have 1 [is less than or equal to] [i.sub.j] [is less than or equal to] k. Regardless of whether [i.sub.1], [i.sub.2], ..., [i.sub.j], ..., [i.sub.p] is actually a solution to the instance of P-PCP, we say that a string of the form

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

exhibits a candidate solution in tagged form.

For instance, because the sequence 2, 1, 2, 3, 1 is not a solution to Example 2.4, the following string exhibits a candidate solution in tagged form (but not a solution in tagged form):

(2.8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

In contrast, the following string exhibits a candidate solution to Example 2.4 in tagged form that is also a solution in tagged form:

(2.9) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

3. AN UNDECIDABLE FAMILY OF L-PATH PROBLEM INSTANCES

In this section, we show how to formulate P-PCP in graph-theoretic terms. In particular, we construct an undecidable family of L-path problem instances, where each problem instance corresponds to an instance of P-PCP. Throughout the remainder of the article, we assume that we have been given a fixed, but arbitrary, instance of P-PCP consisting of the k pairs of strings

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Our interest in P-PCP is motivated by the fact that a string that exhibits a P-PCP solution in tagged form has two interleaved patterns of matched delimiters:

(1) The string [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is in L(balanced).

(2) The string [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is in L(balanced'), the language of balanced strings made up of [{.sub.i] and [}.sub.i], for 1 [is less than or equal to] i [is less than or equal to] k, defined by the following linear context-free grammar:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

It is important to note that, in general, in a string that exhibits a P-PCP solution in tagged form the two balancing processes can be out of sync. For instance, in the following prefix of string (2.9)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

the last symbol, namely "[}.sub.1]", does not match with the seventh-from-last symbol, namely "[". However, we can capture the structure of P-PCP solutions in tagged form via the intersection of two linear context-free languages:

(1) The language L([S.sub.1]) consists of strings of L(balanced) with an arbitrary number of symbols of the form "[{.sub.i]" interspersed among the open-parenthesis and open-bracket symbols, and an arbitrary number of symbols of the form "[}.sub.i]" interspersed among the close-parenthesis and close-bracket symbols:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

(2) The language L([S.sub.2]) consists of all possible candidate solutions, in tagged form, to the given instance of P-PCP:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Languages L([S.sub.1]) and L([S.sub.2]) capture the two interleaved patterns of matched delimiters noted above: L([S.sub.2) consists of all possible candidate solutions, in tagged form, to the given instance of P-PCP; L([S.sub.1]) consists of strings such that when all "[{.sub.i]" and "[}.sub.i]" symbols are excluded, we are left with a string in L(balanced). Furthermore, the language L([S.sub.1]) [intersection] L([S.sub.2]) consists of exactly the solutions, in tagged form, to the given instance of P-PCP. For instance, for Example 2.4, strings (2.8) and (2.9) are both in L([S.sub.2]), but only string (2.9) is in L([S.sub.1]); that is,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Similar to what we observed for the language L([S.sub.0]) [intersection] L(balanced), the given instance of P-PCP has a solution exactly when the language L([S.sub.1]) [intersection] L([S.sub.2]) is nonempty.

We now show how to construct a graph (with two distinguished vertices, s and t) such that there is an L([S.sub.1]) [intersection] L([S.sub.2])-path from s to t if and only if the given instance of P-PCP has a solution. Figure 6 shows a schematic diagram that illustrates the construction. For an instance of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], the graph contains k regions of the form shown in Figure 5. Call the left part of such a region an x-string segment, and the right part a [[bar]y.sup.R]-string segment. Note that the jth [bar]x-string segment begins with the sequence "e[{.sub.j]", whereas the jth [[bar]y.sup.R]-string segment ends with the sequence "[}.sub.j]e". The dotted edges labeled e around the outsides of Figure 6 serve to connect each [bar]x-string segment to all of the other [bar]x-string segments, and each [[bar]y.sup.R]-string segment to all of the other [[bar]y.sup.R]-string segments. Any number of [bar]x-string segments can be concatenated together to form a path in any order; however, each such segment is labeled in the word of the path by the appropriate "{" symbol. Similarly, any number of [[bar]y.sup.R]-string segments can be concatenated together to form a path in any order; however, each such segment is labeled in the word of the path by the appropriate "}" symbol.

[Figures 5-6 ILLUSTRATION OMITTED]

The fact that certain edges in Figure 6 are dotted has no special significance; they are displayed in this way to highlight the fact that these edges correspond to interprocedural data-dependence edges in dependence graphs. This correspondence will be made clear in Section 4 (cf. Figure 9).

[Figure 9 ILLUSTRATION OMITTED]

By following one of the edges that is labeled with "#", a path can pass from an [bar]x-string segment to a [[bar]y.sup.R]-string segment. However, once a #-edge is taken, the path can only be extended with [[bar]y.sup.R]-string segments. Consequently, all paths from s to t are of the form

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] some number of [bar]x-string segments

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] some number of [[bar]y.sup.R] string segments [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Here we see the reason for the remark made in footnote 6: in the word of a path from s to t, each occurrence of "{" is immediately preceded by two occurrences of the symbol e, and each occurrence of "}" is immediately followed by two occurrences of e. Technically, the definition of a (candidate) P-PCP solution in tagged form should be changed accordingly, and each occurrence of "[{.sub.i]" in grammars [S.sub.1] and [S.sub.2] should be replaced with "ee[{.sub.i]", and each occurrence of "[}.sub.i]" should be replaced with "[}.sub.i]ee".

We now observe the following:

-- If there is an L([S.sub.1]) [intersection] L([S.sub.2])-path P from s to t, a solution to the instance of P-PCP can be read off from the word of P by reading it as a P-PCP solution in tagged form.

-- If the instance of P-PCP has the solution [i.sub.1], [i.sub.2], ..., [i.sub.j], ..., [i.sub.m], then we can find an L([S.sub.1]) [intersection] L([S.sub.2])-path P from s to t by

(1) following the e edge from s,

(2) choosing [bar]x-string segments in the order [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], thereby generating a subpath whose word is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

(3) following the #-edge from the [i.sub.m]th [bar]x-string segment to the [i.sub.m]th [[bar]y.sup.R]-string segment,

(4) choosing [[bar]y.sup.R]-string segments in the order [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], thereby generating a subpath whose word is [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

(5) following the e edge to t.

The word of this path exhibits the solution to the instance of P-PCP in tagged form (modulo the extra occurrences of e). As we observed earlier, L([S.sub.1]) [intersection] L([S.sub.2]) consists of exactly the solutions, in tagged form, to the given instance of P-PCP. Hence, the path constructed above is an L([S.sub.1]) [intersection] L([S.sub.2])-path from s to t.

We shall call a graph constructed in the manner described above a P-PCP graph. For the particular case of the instance of P-PCP introduced in Example 2.4, the corresponding P-PCP graph is shown in Figure 7. The above observations prove the following lemma:

LEMMA 3.1. Given an instance of P-PCP (with corresponding grammars [S.sub.1] and [S.sub.2], and P-PCP graph G), there is an L([S.sub.1]) [intersection] L([S.sub.2])-path from s to t in G if and only if the instance of P-PCP has a solution.

[Figure 7 ILLUSTRATION OMITTED]

4. UNDECIDABILITY OF CONTEXT-SENSITIVE, STRUCTURE-TRANSMITTED DATA-DEPENDENCE ANALYSIS

In this section, we show how a slight modification of the construction presented in the previous section implies that it would be impossible to create a precise algorithm for context-sensitive, structure-transmitted data-dependence analysis. In particular, we construct a family of programs whose data-dependence graphs encode the P-PCP graphs.

For instance, Figure 8 shows a C program fragment whose data-dependence graph (see Figure 9) corresponds to the instance of P-PCP given in Example 2.4. A context-sensitive, structure-transmitted data-dependence analysis should report that variable x may have the value atom("A") at program point t (which corresponds to the fact that this instance of P-PCP has the solution 1, 2, 3, 1). In Figure 9, the symbols "[{.sub.j]" and "[}.sub.j]" correspond to data-dependences associated with the call from procedure f to procedure fj and the return from fj to f, respectively; the symbol "(" corresponds to a data construction in which the value is placed in the first position of a cons; the symbol "[" corresponds to a data construction in which the value is placed in the second position of a cons; the symbol ")" corresponds to a selection via car; the symbol "]" corresponds to a selection via cdr. Data dependences associated with calls to procedure f are labeled by symbols of the form "[<.sub.i]"; data-dependences associated with corresponding returns from f are labeled by "[>.sub.i]"; the symbol "#" corresponds to a data-dependence that occurs when a (recursive) call on f is finally bypassed.(7)

Fig. 8. The C program scheme that would be constructed for the instance of P-PCP given in Example 2.4 .The relevant part of this program's data-dependence graph is shown in Figure 9.

List *x; void f1() { x = cons(NULL, cons(cons(NULL, cons(x, NULL)), NULL)); /* Encodes ([([ */ if (. . .) { f(); } x = car(cdr(x)); /* Encodes ]) */ } void f2() { x = cons(NULL, cons(cons(NULL, x), NULL)); /* Encodes [([ */ if (. . .) { f(); } x = car(cdr(cdr(x))); /* Encodes ]]) */ } void f3() { x = cons(NULL, cons(NULL, cons(NULL, x))); /* Encodes [[[ */ if (. . .) { f(); } x = car(cdr(cdr(cdr(cdr(car(cdr(x))))))); /* Encodes ])]]]]) */ } void f() { if (. . .) f1(); else if (. . .) f2(); else f3(); } void main() { s: x = atom("A"); /* A special value used nowhere else in the program */ f(); t: /* Could x be atom("A") here? */ }

When inspecting Figure 8, the reader should keep in mind that the left-to-right encoding of a string--where a string consists of either all open parentheses or all closed parentheses--corresponds to working one's way out from the inner occurrence of x in the expression that encodes the string.

The idea illustrated in Figures 8 and 9 carries over to all instances of P-PCP: in general, the pair of strings [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] is encoded by a procedure of the form

void fj(){ x = cons(...x...); /* Construction expression encoding [[bar]x.sub.j] */ if (...){ f(); } x = ...car(...cdr(...x...)...)...; /* Selection expression encoding [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] }

Procedure f has the form

void f(){ if (...) f1(); else if (...) f2(); . . . else if (...) fk-1(); else fk(); }

Procedure main is the same as in Figure 8.

The undecidability of context-sensitive, structure-transmitted data-dependence analysis follows from two properties: (1) the data-dependence graph for a program of the form given above is very much like the P-PCP graph for the given instance of P-PCP (see Figure 6) and (2) variable x can have the value atom("A") at program point t if and only if there is a path from s to t whose word is in a language very similar to L([S.sub.1]) [intersection] L([S.sub.2]) (except that <'s and >'s must also be balanced).

For instance, the portion of the data-dependence graph shown in Figure 9 (for the program given in Figure 8) is identical to the P-PCP graph shown in Figure 7, except that certain dotted edges, corresponding to calls to f and returns from f, now have labels of the form [<.sub.i] or [>.sub.i].(8) Therefore, the path language of interest for identifying context-sensitive, structure-transmitted data-dependences must now incorporate the labels [<.sub.i] and [>.sub.i] (for 0 [is less than or equal to] i [is less than or equal to] k). Formally, this is accomplished by considering L([S'.sub.1]) [intersection] L([S'.sub.2])-paths, where the grammars [S'.sub.1] and [S'.sub.2] are defined as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

We also change the notion of a P-PCP solution in tagged form to one in which each occurrence of [[bar]x.sub.i] (except for the innermost one) is immediately followed by an occurrence of [<.sub.i], and each occurrence of [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (except for the innermost one) is immediately preceded by an occurrence of [>.sub.i].

By the same argument used to prove Lemma 3.1, there is an L([S'.sub.1]) [intersection] L([S'.sub.2])-path from s to t in the data-dependence graph if and only if the given instance of P-PCP has a solution. Therefore, a context-sensitive, structure-transmitted data-dependence analysis would determine that x could have the value atom("A") at program point t if and only if the given instance of P-PCP has a solution. Consequently, context-sensitive, structure-transmitted data-dependence analysis is undecidable (i.e., an algorithm for context-sensitive, structure-transmitted data-dependence analysis cannot exist).

5. CONCLUSIONS AND IMPLICATIONS FOR OTHER PROGRAM-ANALYSIS FRAMEWORKS

Earlier work by the author and his colleagues has demonstrated the usefulness of formulating program-analysis problems in terms of graph-reachability questions [Reps 1998]. This approach has been used to obtain a number of positive results about program-analysis problems (specifically, polynomial-time algorithms for solving a variety of different problems [Horwitz et al. 1990; 1995; Melski and Reps 2000; Reps 1995; Reps et al. 1994; 1995; Sagiv et al. 1996]). The present article demonstrates that this viewpoint is also valuable from the standpoint of obtaining negative results about program-analysis problems (see also Reps [1996]).

The undecidability result in this article concerns a situation in which there are two interleaved patterns of matching "events." More broadly, the notions of "interleaved matched-parenthesis paths" and "P-PCP solutions in tagged form" are two concepts that can provide insight into whether other program-analysis problems are undecidable. For instance, Ramalingam [1999] showed recently that synchronization-sensitive, context-sensitive interprocedural analysis of multitasking concurrent programs is undecidable. His result was inspired by the one given in the present article, using the insight that synchronization-sensitive, context-sensitive interprocedural analysis also involves two interleaved patterns of matching events.

5.1 Set Constraints and Set-Based Analysis

Following earlier work by Reynolds [1968] and Jones and Muchnick [1981], a number of people in recent years have explored the use of set constraints for analyzing programs. Set constraints are typically used to collect a superset of the set of values that the program's variables may hold during execution. Typically, a set variable is created for each program variable at each program point; set constraints are generated that approximate the program's behavior; program-analysis then becomes a problem of finding the least solution of the set-constraint problem. Set constraints have been used both for program-analysis [Aiken and Murphy 1991a; Heintze 1993; Heintze and Jaffar 1994; Jones and Muchnick 1981; Reynolds 1968], and for type inference [Aiken and Murphy 1991b; Aiken and Wimmers 1993].

Melski and Reps [2000] have obtained a number of results on the relationship between certain classes of set constraints and CFL-reachability. Their results establish relationships in both directions: they showed that CFL-reachability problems and a subclass of what have been called definite set constraints [Heintze and Jaffar 1991] are equivalent. That is, given a CFL-reachability problem, it is possible to construct a set-constraint problem whose answer gives the solution to the CFL-reachability problem; likewise, given a set-constraint problem, it is possible to construct a CFL-reachability problem whose answer gives the solution to the set-constraint problem. Melski and Reps [2000] also show that CFL-reachability is equivalent to a class of set constraints that was designed to be useful for (context-insensitive) analysis of programs written in a higher-order language--so-called "set-based analysis" [Heintze 1993]. The results of Sections 3 and 4 imply that if you start with a version of context-insensitive set-based analysis that is at least as precise as the context-insensitive structure-transmitted data-dependence analysis illustrated in Example 1.5, then it is impossible to create an algorithm for the context-sensitive version of your set-based analysis, even for a first-order language.

5.2 Control-Flow Analysis

Sharir and Pnueli [1981] defined, two methods for carrying out interprocedural dataflow analysis so as to ensure that the propagation of dataflow information respects the fact that when a procedure finishes it returns to the site of the most recent call. In one of their methods, the so-called "call-strings approach," each piece of dataflow information is tagged with a call string that records the history of uncompleted procedure calls along which that information has propagated. The call string on a piece of information is updated whenever a propagation step associated with a call statement or return statement is performed. The information that would be obtained, in principle, if call strings were allowed to grow arbitrarily long is called the call-strings-[infinity] solution.

Sharir and Pnueli show that, for distributive dataflow-analysis problems over a finite semilattice, it is possible to restrict the length of call strings to some fixed length (where the bound on the length required is quadratic in the size of the lattice and linear in the number of call sites in the program) and yet still obtain a result that is equivalent in precision to the call-strings-[infinity] solution. By suitable means, approximate (but safe) solutions can also be obtained using shorter call strings; limiting call strings to length k defines the call-strings-k solution.

In considering algorithms for interprocedural dataflow analysis, one should be careful not to confuse two separate issues:

(1) Whether an algorithm computes a solution equal in precision to the call-strings-[infinity] solution.

(2) Whether an algorithm computes its solution by actually tracking entities labeled by call strings (e.g., of some length k).

A type-(2) algorithm typically has worst-case running time that is exponential in k. However, for suitably restricted classes of interprocedural dataflow analysis problems, there are algorithms with property (1), yet their worst-case running times are polynomial in the size of the program;(9) these algorithms use dynamic programming, rather than utilizing entities labeled with explicit call strings [Reps et al. 1995; Sagiv et al. 1996]. For the same class of problems, a type-(2) algorithm will, in general, have exponential running time.

These results provide an interesting contrast with those that have been obtained on a program-analysis problem of interest to the functional programming community: the problem of "k-CFA," or "control-flow analysis" (for higher-order programming languages) using call strings of length k [Heintze and McAllester 1997; Jagannathan and Weeks 1995; Nielson and Nielson 1997; Shivers 1988]. The goal of control-flow analysis is to track data and control flow in the presence of first-class (anonymous) functions, data constructors, and selectors. Many of the algorithms that have been given for k-CFA are type-(2) algorithms (in the sense mentioned above), in that they actually track entities labeled by call strings of length [is less than or equal to] k. In general, the running time of these algorithms is exponential in k.

Similar to the concept of the call-strings-[infinity] solution to an interprocedural dataflow-analysis problem, the [infinity]-CFA solution is what would be obtained, in principle, if call strings were allowed to grow arbitrarily long. The results of Sections 3 and 4 imply, in general, that when data constructors and selectors are to be taken into account, [infinity]-CFA is undecidable. That is, in the presence of data constructors and selectors, the [infinity]-CFA solution cannot be computed.

ACKNOWLEDGMENTS

I am grateful for discussions that I had about the problem with F. Nielson, G. Ramalingam, S. Horwitz, and D. Melski.

(1) In this article, the word "parentheses" is used in both the generic sense--to mean any kind of matching delimiter (e.g., round parentheses, square brackets, curly braces, angle brackets, etc.)--as well as in the specific sense of round parentheses. It should always be clear from the context which of these two meanings is intended.

(2) In this example, we use C syntax. In later examples, we use C augmented with the operator cons, which denotes a pairing constructor; the operators car and cdr, which select the first and second components of a pair, respectively; and the operator atom, which constructs an atomic object (different from NULL) from a given string. This notation is used to simplify the way storage-allocation operations are expressed in our examples. The use of these operators does not imply that our results apply only to the analysis of LISP programs.

(3) There is an unfortunate clash in terminology that the reader should be aware of. The term "context-sensitive analysis" is standard in the programming-languages community, where it means a static-analysis method in which the analysis of a called procedure is "sensitive" to the context in which it is called: a context-sensitive analysis captures the fact that different call sites that call the same procedure may have different effects on a program's possible execution states. Context-sensitive analysis should not be confused with the "context-sensitive languages' of formal-language theory. Unfortunately, the principle that context-free-language reachability is useful in formalizing approaches to context-sensitive analysis was fully articulated [Reps 1998] only after the term "context-sensitive analysis" had been adopted by the programming languages community [Emami et al. 1994; Wilson and Lam 1995].

(4) In this article, the term "interleaved" is used in a somewhat restricted sense, compared to the standard usage in formal-language theory (cf. Hopcroft and Ullman [1979, pp. 282]). The exact nature in which patterns of matched symbols are allowed to be woven together is defined precisely in Sections 3 and 4.

(5) A linear context-free grammar is one in which at most one nonterminal appears on the right-hand side of each production.

(6) For technical reasons having to do with the details of the constructions given in Sections 3 and 4, we will also work with some slight variants of (2.7). For instance, in Section 3 we use a version of (2.7) in which each occurrence of "{" is immediately preceded by two occurrences of the symbol e, and each occurrence of "}" is immediately followed by two occurrences of e.

(7) The role of the labels "<" and ">" is similar to that of"{" and "}", respectively, in that both kinds of parenthesis pairs encode procedure call/return. However, < and > have been introduced as separate symbols to emphasize the fact that the calls to procedure f play a different role in the construction than the calls to the fj procedures.

(8) The only elements omitted from the data-dependence graph shown in Figure 4 concern dependences on NULL. These are irrelevant to the question of how atom("A") flows through the program--and to our embedding of P-PCP graphs in data-dependence graphs.

(9) In their Theorem 7-3.4, Sharir and Pnueli [1981] establish that the greatest fixed point of a certain set of equations equals the meet-over-all-valid-paths (MOVP) solution. In their Theorem 7-4.6, they establish that the MOVP solution equals the call-strings-[infinity] solution. As for algorithms, Sharir and Pnueli give a worklist algorithm for finding the greatest fixed point of the set of equations. However, Section 7-3 of their 1981 article presents a fairly general framework: in particular, the only assumption about the semilattice is that it is finite. Because of the way their dynamic-programming algorithm tabulates information, when the size of the semilattice is exponential in the size of the program, the algorithm may use time exponential in the size of the program. To achieve a polynomial time bound, the tricks are (1) to restrict attention to a certain class of semilattices; however, this class can include semilattices whose size is exponential in the size of the program (e.g., the powerset of the program points, the powerset of the program's variables, etc.); and (2) to tabulate the Sharir-Pnueli [Phi] functions pointwise (e.g., on singletons, rather than entire sets). This is essentially what is done in Reps et al. [1995] for the class of finite-distributive-subset problems, and in Sagiv et al. [1996] for the larger class of distributive environment problems.

REFERENCES

AIKEN, A. AND MURPHY, B. R. 1991a. Implementing regular tree expressions. In Proceedings of the 5th ACM Conference on Functional Programming and Computer Architecture. ACM, 427-447.

AIKEN, A. AND MURPHY, B. R. 1991b. Static type inference in a dynamically typed language. In Proceedings of the 18th ACM Symposium on the Principles of Programming Languages. ACM, 279-290.

AIKEN, A. AND WIMMERS, E. L. 1993. Type inclusion constraints and type inference. In Proceedings of the 6th Conference on Functional Programming and Computer Architecture. 31-41.

CALLAHAN, D. 1988. The program summary graph and flow-sensitive interprocedural data flow analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 47-56.

COOPER, K. D. AND KENNEDY, K. 1988. Interprocedural side-effect analysis in linear time. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 57-66.

COOPER, K. D. AND KENNEDY, K. 1989. Fast interprocedural alias analysis. In Proceedings of the 16th ACM Symposium on the Principles of Programming Languages. ACM, 49-59.

EMAMI, M., GHIYA, R., AND HENDREN, L. J. 1994. Context-sensitive interprocedural points-to analysis in the presence of function pointers. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 242-256.

HARRISON, M. A. 1978. Introduction to Formal Language Theory. Addison-Wesley, Reading, MA.

HECHT, M. S. 1977. Flow Analysis of Computer Programs. North-Holland, New York.

HEINTZE, N. 1993. Set-based analysis of ML programs. Tech. Rep. CMU-CS-93-193. School of Computer Science, Carnegie-Mellon Univ., Pittsburgh, PA.

HEINTZE, N. AND JAFFAR, J. 1991. A decision procedure for a class of set constraints. Tech. Rep. CMU-CS-91-110. School of Computer Science, Carnegie-Mellon Univ., Pittsburgh, PA.

HEINTZE, N. AND MCALLESTER, D. 1997. Linear-time subtransitive control flow analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 261-272.

HOPCROFT, J. E. AND ULLMAN, J. D. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA.

HORWITZ, S., REPS, T., AND BINKLEY, D. 1990. Interprocedural slicing using dependence graphs. ACM Trans. Program. Lang. Syst. 12, 1 (Jan.), 22-60.

HORWITZ, S., REPS, T., AND SAGIV, M. 1995. Demand interprocedural dataflow analysis. In Proceedings of the 3rd ACM SIGSOFT Symposium on the Foundations of Software Engineering. ACM, 104-115.

JAGANNATHAN, S. AND WEEKS, S. 1995. A unified treatment of flow analysis in higher-order languages. In Proceedings of the 22nd ACM Symposium on the Principles of Programming Languages. ACM, 393-406.

JONES, N. D. AND MUCHNICK, S. S. 1981. Flow analysis and optimization of Lisp-like structures. In Program Flow Analysis: Theory and Applications, S. S. Muchnick and N. D. Jones, Eds. Prentice-Hall, 102-131.

KHEDKER, U. P. AND DHAMDHERE, D. M. 1994. A generalized theory of bit vector data flow analysis. ACM Trans. Program. Lang. Syst. 16, 5 (Sept.), 1472-1511.

KOU, L. T. 1977. On live-dead analysis for global data flow problems. J. ACM 24, 3 (July), 473-483.

KUCK, D. J. 1978. The Structure of Computers and Computations. Vol. 1. John Wiley and Sons.

KUCK, D. J., KUHN, R. H., LEASURE, B., PADUA, D. A., AND WOLFE, M. 1981. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM Symposium on the Principles of Programming Languages ACM, 207-218.

LEWIS, H. R. AND PAPADIMITRIOU, C. H. 1981. Elements of the Theory of Computation. Prentice-Hall.

MCCARTHY, J. 1963. A basis for a mathematical theory of computation. In Computer Programming and Formal Systems. North-Holland, 33-70.

MELSKI, D. AND REPS. 2000. Interconvertibility of a class of set constraints and context-free language reachability. Theor. Comput. Sci. 248, 1-2. To be published.

NIELSON, F. AND NIELSON, H. R. 1997. Infinitary control flow analysis: A collecting semantics for closure analysis. In Proceedings of the 24th ACM Symposium on the Principles of Programming Languages. ACM, 332-345.

RAMALINGAM, G. 1999. Context-sensitive synchronization-sensitive analysis is undecidable. Res. Rep. TC 21493. IBM T. J. Watson Research Center, Yorktown Heights, NY.

REPS, T. 1995. Shape analysis as a generalized path problem. In Proceedings of the ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation (PEPM '95). ACM, 1-11.

REPS, T. 1996. On the sequential nature of interprocedural program-analysis problems. Acta Inf. 33, 739-757.

REPS, T. 1998. Program analysis via graph-reachability. Inf. Softw. Tech. 40, 11-12, 701-726.

REPS, T., HORWITZ, S., AND SAGIV, M. 1995. Precise interprocedural dataflow analysis via graph-reachability. In Proceedings of the 22nd ACM Symposium on the Principles of Programming Languages. ACM, 49-61.

REPS, T., HORWITZ, S., SAGIV, M., AND ROSAY, G. 1994. Speeding up slicing. In Proceedings of the 2nd ACM SIGSOFT Symposium on the Foundations of Software Engineering. ACM, 11-20.

REYNOLDS, J. C. 1968. Automatic computation of data set definitions. In Information Processing 68: Proceedings of the IFIP Congress 68. North-Holland, 456-461.

SAGIV, M., REPS, T., AND HORWITZ, S. 1996. Precise interprocedural dataflow analysis with applications to constant propagation. Theor. Comput. Sci. 167, 131-170.

SHARIR, M. AND PNUELI, A. 1981. Two approaches to interprocedural dataflow analysis. In Program Flow Analysis: Theory and Applications, S. S. Muchnick and N. D. Jones, Eds. Prentice-Hall.

SHIVERS, O. 1988. Control flow analysis in Scheme. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Languages, Design and Implementation. ACM, 164-174.

WILSON, R. P. AND LAM, M. S. 1995. Efficient context-sensitive pointer analysis for C programs. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation. ACM, 1-12.

YANNAKAKIS, M. 1990. Graph-theoretic methods in database theory. In Proceedings of the 9th ACM Symposium on the Principles of Database Systems. ACM, 230-242.

Received March 1999; revised October 1999; accepted November 1999

This work was supported in part by the National Science Foundation under grants CCR-9625667 and CCR-9619219, by the United States-Israel Binational Science Foundation under grant 96-00337, by a grant from IBM, and by a Vilas Associate Award from the University of Wisconsin. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes, notwithstanding any copyright notices affixed thereon. The views and conclusions contained herein are those of the authors, and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the above government agencies or the U.S. Government.

Author's address: Computer Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WI 53706; email: reps@cs.wisc.edu.

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

THOMAS REPS University of Wisconsin

Printer friendly Cite/link Email Feedback | |

Author: | REPS, THOMAS |
---|---|

Publication: | ACM Transactions on Programming Languages & Systems |

Geographic Code: | 1USA |

Date: | Jan 1, 2000 |

Words: | 8432 |

Previous Article: | Efficient and Safe-for-Space Closure Conversion. |

Next Article: | Incremental Analysis of Constraint Logic Programs. |

Topics: |