Sa 28S Beyond Heuristic DENDRAL by Bruce G. Buchanan Edward A. Feigenbaum Joshua Lederberg I. INTRODUCTION Theory formation in science embodies many elements of creativity which make it both an interesting and challenging task for artificial intelligence research. One of the goals of the Heuristic DENDRAL project has long been the study of processes underlying theory formation. This paper presents the first steps we have taken to achieve that goal, in a program called Meta~DENDRAL. The Heuristic DENDRAL preject has concentrated its efforts on the inductive analysis of empirical data for the formaticn of explanatory hypotheses. This is the type of inference task that calls for the use of a scientific theory by a performance program, but not for the formation of that theory. When we started on Heuristic DENDRAL we did not have the insight, understanding, and daring to tackle ab initio the problem of theory formation, But now we feel the time is ripe for us to turn our attention to the problem of theory formation. Our understanding and our technical tcols have matured along with the Heuristic DENDRAL program to the point where we now see clear ways to proceed, As always, the proper choice of task environment is crucial, but for us the choice was absolutely clear. The theory formation task most accessible to us is the task of forming mass spectral theory because the Heuristic DENDRAL performance program uses such a theory to elucidate molecular structures of organic chemical molecules. Hence, the notion of building a level of programs “meta" to the DENDRAL performance prograa. The theory of mass spectrometry contains numerous statements about the fragmentation patterns of chemical molecules upon electron impact in a mass spectrometer. Because this is a new science, the theory is still expanding rapidly. In other words, there exists a possibility that a theory formation programs could discover genuine extensions of the theory. The Heuristic DENDRAL system already contains an excellent mass spectral theory. We, therefore, have a clear idea of what a “correct answer" is like. DENDRAL's theory is represented in at least two different forms at present, so that we have a fair idea of the issues involved in representing mass spectral theory for a program. A theory language of notations, data structures, and primitive concepts (with which we are intimately familiar because we developed it) is available. People who are expert in mass spectral theory are members of the DENDRAL research team. Many programs for manipulating mass spectral data have already been developed and are ready to be exploited as 4Yeta-DENDRAL tools. The goal of the Meta-DENDRAL program is to infer the theory that the performance program (Heuristic DENDRAL) needs in order to solve problems of mass spectral analysis. The following table attempts to sketch some differences between the programs at the performance level and the meta-level. Input Output Example Heuristic DENDRAL The mass spectrum of a molecule whose struc- ture is not known (except, cf course, in our test cases). A molecular structure inferred from the data. Uses alpha-carbon frag- mentation theory rules in planning and in validation. Meta-DENDRAL A large number of recorded maSs spectra and the associated (known) molecular structures. A set of cleavage and rearrangement rules con- stituting a subset of the theory of mass spectrometry. Discovers (and validates) alpha-carbon fragmentation Cules in a space of possible patterns of cleavage. Uses set of primitive concepts but does not invent new primitives. In our view, the continuity evident in this table reflects a continuity in the processes of inductive explanation in science. Moves toward meta-levels of scientific inference are moves toward encompassing broader data bases and constructing more general rules for describing regularities in the data. Beyond this level of Meta-DENDRAL there are still higher levels. Not all theory formation is as simple as the program described here assumes it is. For example, the representation of chemical molecules and the list of basic mass spectral processes are koth fixed for this progran, higher level pregram shculd be expected to discover. yet these are concepts which a Also, there is no postulation of new theoretical entities in this progran. But, again, higher levels of theory formation certainly do include this process. The task of theory formation certainly can be and has heen discussed out of the context of any particular theory. However, writing a computer program to perform the general task is more difficult than working within the context of one particular scientific discipline. While it is not clear how science proceeds in general, it may be possible to describe in detail how the scientists in one particular discipline perform their work. From there, it is not a large step to writing the computer program. Thus this paper attacks the general problems of theory formation by discussing the problems of designing a computer program to formulate a theory in a specific branch of science. The general strategy of Meta-DENDRAL is to reason from data to Flausible generalizations and then to integrate the generalizations into a unified theory. The input to the Meta-DENDRAL system is a set of structure-mass spectrum pairs. It receives essentially the same data as a chemist might choose when he attempts to elucidate the processes underlying the mass Spectrometric behavior of a class of molecules. When the pioneers of the field turn their attention to a class of chemical ccmpounds whose mass spectrometric behavior is not well understood, they must collect mass spectra for a number of the compounds and look for generalizations in the data. The generalizations have to be tested against new data and against the established theory. If new data provide counterexamples, the generalizations are changed. If the generalizations are not compatible with the old theory either the old theory or the generalizations are changed, depending on the Seriousness of the discrepancy, the nature cf the statements involved, the scientist's commitment to the old theory, and so on. This paper is organized by the three main subprobleas around which the program is also organized. The first is to explain each individual spectrus, given the molecular structure associated with it. That is, determine the processes (or alternative sets of processes) which account for the experimental data. The second subproblem is to generalize the results from each spectrum to all spectra, In other words, find the common processes and sets of processes which can explain several spectra. The last is to integrate the generalizations into the existing theory in such a way that the theory is consistent and economical. Within each of the three main sections, the subsections indicate further subproblems which the program must solve. IT. FIRST SUBPROBLEMS EXPLAINING EACH SPECTRUM The so-called "method of hypothesis" in science is sometimes proposed as the essence of scientific work. Restating it, ina deliberately imprecise way, the method is roughly to formulate a hypothesis to account for some of the observed data and make successively finer adjustments to it as more observations are made, Very little is said about the details of a scientist's intellectual processes as he goes through the method. Thinking of hypotheses, for example, iS an unsystematic and mysterious task which must be elucidated before the method can he programmed. That is the task we have designated as the first sukproblen. The program starts with individual spectrum- structure pairs as separate from one another. Tt constructs alternative explanations for each spectrum and then considers the spectra all together. An explanation, for the program, as for the chemist, is a plausible account of the mass spectrometric mechanisms which produced the peaks in the spectrum, The explanation is something like a story of the molecule's adventures in the mass Spectrometer: certain data points appear as a result of cleavage, others appear as a result of more complex processes. At this stage of development of the theory, the chemist's story does not account for every data point because of the complexities of the instrument and the vast amount of missing information about mass epectrometry. A. REPRESENTATION The well-known problem of choosing a representation for the Statements of a scientific theory and the objects mentioned by the theory is ccmmon to all sciences. In computer science it is recognized as a crucial problem for the efficient solution (or for any solution) to every problem. Improper choices of representation can forestall the solution for both humans and machines, At this stage there are no computer programs which successfully chocse the representation of objects in a problem domain. Therefore we, the designers of the Meta~DENDRAL systen, have chosen representations with which we have some experience and for which programmed subroutines have already been written in the Heuristic DENDRAL performance systen. It was natural to use these representations since the meta-program itself will not only interface with the Heuristic DENDRAL performance program, hut is built up from the LISP functions of the performance program. Except for some supervisory routines, all of the routines used for solving the first subproblem (reasoning from data to plausible individual explanations) had already been written for use in predicting mass spectra. For example, the routine to break a bond is used in both the meta-program and the predictor phase of the Heuristic DENDRAL system. These common features will be described in more detail btelow. Specifically, for this program, the input data are chemical structures paired with their experimental mass spectra: {(Structure-1 . Spectrum-1) ... (Structure-n . Spectrum-n)). The representation of chemical structures is just the DENDRAL representation used in the Heuristic DENDRAL system, It has been described in detail elsewhere <1,2,4,7>: essentially it is a linear string which uniquely encodes the graph structure of the molecule. The mass spectra, also, are represented in the same way as for the Heuristic DENDRAL performance system Fach spectrum is a LISP list of dotted x-y pairs, where the x-points are masses of fragments and the y-points are the relative abundances of fragments of those masses. Internally, a molecular structure is stored as a list of named atoms in the structure (e.9g., C1, 02, C3) and the connections among the atoms. By putting the connections of the atoms on their property lists (along with other properties such as the type of atom (carbon, oxygen, etc.) and the number of hydrogens), manipulating chemical structures in LISP becomes a matter of changing values on property lists. This representation has heen used in the mass spectrum predictor, for which the functions for Manifulating structures were also developed. The representation of the statements in the program's mass Spectrometry theory is also the one used in the predictor phase cf the Heuristic DENDRAL system. It is discussed in detail in Section IV of this paper. B. SEARCH It is not clear what a scientist does when he "casts about" for a good hypothesis, Intuition, genius, insight, creativity and other faculties have heen invoked to explain how a scientist arrives at the hypothesis which he later rejects or comes to believe or modifies in light of new ohservations. From an information processing pcint of view it makes sense to view the hypothesis formation problem as a problem of searching a space of possible hypotheses for the most plausible ones. This presupposes a generator cf the search space which, admittedly, cremains undiscovered for most scientific problems. In the Heuristic DENDRAL performance system the "legal move generator" is the DENDRAL algorithm for constructing a complete and irredundant set of mclecular models from any specified collection of chemical atoms. Each complete molecular structure is the terminus cf some branch in the tree generated by the program. Intermediate nodes in the tree represent structures partially determined but with some chemical atoms as yet unallocated to the emerging structure. The primitive concepts of this generator are chemical atoms and bonds of mclecules (or nodes and edges of graphs) and valence of atoms (numter of allowable edges emanating from any node). The problem of finding sets of mass spectrometric processes to explain each spectrum is also conceived as a heuristic search problem. As such, the structure of the computer program is easily described, In broad terms, the program contains (1) a generator of the search space, (2) heuristics for pruning the tree, and (3) evaluation criteria for guiding the search. Except for problems inherent in the task, then, the problems of such a program are reasonably well understood. These three main components of the heuristic search program are considered one at a time in the immediate discussion. 1. GENERATOR Writing a computer pregram which solves a scientific reasoning problem is facilitated by seeing the problem as one of heuristic search. This is as true of the meta-program which reasons fron collections of data te generalizations as for the performance system which reasons from one set of data to an explanation. For this reason we have called the process of induction “a process of efficient selection from the dcmain of all possible structures, "<17> For this part of the Meta-DENDRAL system, the generator is a procedure for systematically breaking apart chemical molecules to represent all possible processes occurring in a mass spectrometer. In addition to single cleavages, the generator must be capable cf producing all possible pairs of cleavages, all possible triples, and so forth. And, for each cleavage or set of cleavages it must be able to reproduce the result of atoms or groups of atoms migrating from one fragment to another. For example, after the single break labeled (a) in Pigure 1 below, subsequent cleavage (b) may also occur. The result of (a) + (b) is the simple fragment CH3. 0 u CA3 Cc CH2 - CH2 - CH3 (b) (a) FIGURE 1 Or, for the same molecule cleavage (c) may be followed by pigration of one hydrogen atom from the gamma position (marked with an asterisk) to the oxygen, as shown in Figure 2: bo 0 os \ \ CH3 - C ~ CH2 ' CH2 - CH3 (c) * FIGURE 2 The resulting fragment, then, contains three carbon atoms, six - 12 - hydrogens and an oxygen (C3H60). Since no such fragment results from simple cleavage cr any combination of simple cleavages, it is necessary to rostulate cleavage plus hydrogen migration as the origin of the mass spectral peak corresponding to this fragment and peaks corresponding to similar fragments for the mass Spectra of other molecules. Historically, this double process was postulated for just these reasons and was subsequently confirmed experimentally. The generator of the search space will postulate these processes as possible explanations of the mass spectral peaks at masses 15 (CH3) and 58 (C3H60) for this particular molecule. But it will also postulate the simple cleavage (b) in Figure 1 as the explanation of the peak at mass 15. And For the peak at mass 58 from the process in Figure 2 it will postulate the alternative migration of a hydrogen atom from the position labeled beta. From the generator's point of view these processes are at least as good as the more or less accurate processes shown in Figures 1 and 2. Another device is used by mass spectroscopists to explain their preference for the double cleavage (a) + (b) over the simple cleavage (b) for describing the mechanism resulting in the peak at mass 15 (CH3) for the molecule shown. Mass spectroscopists often appeal to the localization of the positive charge in the charged mclecule to explain why one peak appears in a spectrum but another does not. It is known that only the charged fragments are recorded by the mass Spectrometer. [In this particular case (Figure 1) the positive charge initially resides on the C=0 group of the molecule. For this reason, the simple cleavage (b) will produce a peak corresponding to the fragment C4uH7O, the right-hand fragment resulting from that break. But since there is no positive charge on the CH3 fragment after (b) above, the peak at mass 15 in the spectrum could not have resulted from this process, Because of the particular stability of carbon monoxide, however, the experts are willing to postulate that cleavage {a) is followed by loss of the uncharged C=0 piece from cleavage (b), thus leaving a positive charge on the only other atom in the original fragment, The generator program must also manipulate charges then. [t does this particular piece of reasoning in several steps, as shown in the steps of Figure 3, not just by the double process described above. 0 it 0. CH3 - C - CH2 - CH2 - CH3 O+ tf 1. cH3 - C - CH2 CH2 ~ CH3 (a) } O+ fo 2- CH3 ; Cc (b) Ae cae cee ee SR ee cee ee a we a ee ae ee ee Place the plus charge on the 0 of c=0.* Make cleavage (a) leaving left-hand side. Migrate the charge from C=0 to CH2. Final result, mass=15, after cleavage ({(b). *The correct notation from a chemist's point of view is 0+. instead of O# -- the dot representing a free electron on the oxygen. The Meta-DENDREAL program ignores this subtlety, although the Heuristic DENDRAL program gets this straight. ay A A ae SR A me Vln SOD SOY CO EE A na FIGURE 3 The primitive mechanisms of the generator are charge localization, cleavage, and group migration (where a group can be a positive charge, a single atcm, or a set of connected atoms). The generator is a procedure for producing all possible charged fragments, not just all possible fragments, in other words, Putting these mechanisms together in all possible ways leads to an extremely large space of possible explanations for the peaks in the mass spectrum of a molecule. The pruning heuristics discussed in the next section alleviate that problem somewhat; for the moment let us turn to the actual design of the generator, There are clearly several alternative ways of putting the three primitive mechanisms together into a generator of charged fragments, For example, chemists consider the initial localization of the charge before the possibilities for simple cleavage. But an alternative is to consider the possibility of Cleavages first and then ask about the location of the positive charge, The generator actually programmed allows for changing the canons of generation to experiment with these alternatives, but only the set of cancns described below has actually been used thus far. At the first level of branching in the tree all possible single cleavages are performed on the original molecular structure resulting in all possible primary fragments. At the next level, the positive charge is assigned to all possible atoms in the fragments, (Switching these two steps gives the same results and is closer to the conceptualization used by the chemist; it results in a less efficient program, however.) Starting with level 3 the procedure for generating successive levels is recursive: For each charged fragment at level n (n > 2) produce the charged fragments resulting from {i) cleavage of each bond in the fragment and (ii) migration of each group from its origin to each other atcm in the fragment, where ‘group’ currently means ‘positive charge or hydrogen atom’. In general a group can be any radical in the fragment, and the generatcr can handle the general case. Limiting group migrations is fone by having the program consult a list of items it will consider as candidates for group migrations. That list now contains only '+* and 'H*', which reflects both the designers’ and the chemists' preference for simple migrations. The generation tree is schematized in Figure 4. level oO 0 9. level 1 Oo QO .., O 1. / / level 2 6 0 «.. 0 ta 2. level 3 Ch... 3. 1 level 4 0 >» O Ge--O 4, FIGURE 4 Initial molecular structure Fragments resulting from simple cleavag Charged primary fragments Secondary fragments charged primary fra ments with {i) succ Sive cleavage (circ (ii) successive gro migration (square) Tertiary fragments secondary fragments with (i) successive cleavage (circle) (ii) successive gro migration (square) It is desirable to avoid the possibility of successive levels of some branch in this tree containing all the same atoms and differing only in the migration of atoms or of the charge. In this case duplications would result and the tree would be infinite. Thus the generating algorithm applies the group migration mechanisa only to nodes in the tree which resulted fron cleavage.* The generating process terminates when there are no more bonds to break in a fragment. SU ee a ee ce Oe EE ee AD ak OO es ls a SR SD SA aD ae *In principle it is desirable to allow fragments resulting fron n group migrations and only a single cleavage. Thus the algorithm is constructed to allow up to n group migrations after a cleavage. Currently n=1, so the procedure works as described in Figure 4, 2 ES UF DS ED SE AD a OS SD FO AS na td AD ED SO OAS SD OEE ED OO Sa a A a SO oe 2. PRUNING HEURISTICS The size of the search tree may become quite large, for there are many combinations of mechanisms tc consider. For example, if only bond cleavages are considered, the nusber of possible explanations for data pcints in the spectrum of a molecule is the huaber of ways of breaking combinations of the bonds in the molecule, The number of combinations of b bonds in a molecule is 2**b- 1. This number represents a lower bound on the size of the search space, since secondary processes such as group migrations will be considered in conjunction with hond cleavages. As with any large search tree, then, pruning heuristics are essential for reducing the search. Three simple pruning techniques are currently used by the progran, {1) Since the result of breaking a pair of bonds (or n bonds) is independent of the order in which the bonds are broken, allow cnly one occurrence of each bond set; {2) Since mass spectrometric processes tend to follow favorable pathways, prune any -ranch in the tree which is no longer favorable, as evidenced by failure of a fragment's mass to appear in the mass spectrum; (3) Limit the number of allowable group migrations after each cleavage. The first pruning technique is hardly a heuristic, in the sense of a risky technique, but its pruning effectiveness should be obvious. Duplications of nodes in the search space are unnecessary in this case and can be avoided by removing a bond from consideration after all possible results of breaking it have been explored. The second technique does carry an element of tisk, because mass spectremetry theory includes no guarantee that every fragment in a decomposition pathway will produce a peak in the mass spectrum. In fact, the pruning can only be done after a complete cycle of cleavage plus migration hecause these processes ~ 20 - occur together in the mass spectrometer -- without the appearance of the intermediate fragments, The third technique also is truly heuristic since there are no theoretical reasons why group migrations might not occur in complex and exotic patterns between cleavages. It has been mentioned earlier that the current limit is one group migration after any cleavage. Migrations of two and three hydrogens are known in mass spectrometry, so these will not now be found by the program, Also, limiting allowable migrations to hydrogen atoms and the plus charge is somewhat risky since there are known cases of migration of larger groups, such as methyl (CH3). The bias of mass spectroscopists toward simple mechanisms, however, leads us to telieve that they would place little faith in exotic mechanisms as explanations of mass spectral peaks, at least not without other corroborating evidence. 3. EVALUATION The search tree can be pruned effectively using the techniques just mentioned, with a substantial but not breath-taking reduction in search. Evaluation of alternative mechanisms is still necessary, either during generation or after it is completed, in order to distinguish the highly attractive explanatory mechanisms from those which are merely possible. Evaluaticn routines applied during generation can also be used to limit search it threshold values can be established with confidence, This has not been done because of our lack of experience with the program so far. But even when the mechanisms are evaluated individually during generation, the program must be prepared to evaluate sets of alternative mechanisms later to find consistent and simple explanations, Evaluation of individual mechanisms and of sets cf mechanisms will be considered separately in the following discussion, although it is clear even to the designers that these evaluation criteria are not independent. We would be very happy to have the program rediscover just those explanatcry mechanisms which exfert mass spectroscopists have postulated, and even happier if the program postulated a set which mass spectroscopists accepted as a viable alternative to their cwn. So far, however, our evaluation criteria have not been strong enough to pick out mechanisms which experts think are plausible, while giving a low ranking to the others. Without building in the biases of experts toward their current theory it is difficult to evaluate mechanisms at all. The concept of cleavage adjacent to the C=0 (alpha-cleavage) in molecules like that of Figure 1 is well-established in the mass spectrometry literature, Without putting in criteria relating score to distance from non-carbon atoms, it is difficult to discover the attractiveness of alpha-cleavage. The criteria must, it seems, be independent of any particular theory; they must be criteria for evaluating the worth of generalizations in any theoretical domain. The prograa's evaluation routine presently contains only one a priori principle. In an attempt to measure the simplicity of the statements describing mechanisms, the program counts the number of primitive mechanisms necessary to explain a peak. Thus when there are alternative explanations of the same data point, the program chooses the simplest one, that is, the one with the fewest steps. Simple cleavage is preferred to cleavage plus migration plus cleavage, for example. It is not desirable to use this ranking principle for pruning during generation, however, because there are cases where fragments can only come from complex pathways. Por example, a sisple mechanism and a complex mechanism may both explain the same peak, but a daughter peak can only be derived from the fragment along the sore ccmplex branch of the tree. The reason for this is simple: although different fragments may have the same mass {and thus explain the same peak), daughters of those fragments will not always have identical masses. So, the program must save the complex pathways as long as they may lead to fragments whose mass is not the same as the mass of fragments arrived at by a simple pathway. - 23 <- The result of the generation process as described so far, with pruning and evaluation, is a set of candidate mass Spectrometric processes for each structure which provides alternative explanations for data points in the associated mass Spectrum. For instance, the program hreaks the molecular structure shown in Pigure & at individual bonds or pairs of bonds to give the following information (atoms in the structure are numbered from left to right): MASS EXPLATNED PRCCESS 103 cleavage: C2-C1 cleavage: C6-C7 89 cleavage: s3-C2 cleavage: C5-C6 75 cleavage: Cu-CS 61 cleavage: $3-C4 60 Cleavage: C4-C5 & C2-C1 57 cleavage: C4-53 46 cleavage: S3-C4 & C2-C1 43 Cleavage: C5-c4u 42 cleavage: C42-C7 & C4-53 29 Cleavage: C2-53 28 Cleavage: C5-C6 & C4-5s3 CH3 - CH2 ~ SH ~- CH2 - CH2 - CH2 - cCH3 FIGURE 64 In this example, the program used no migrations or charge localization information, for purposes of simplicity. The program explored all simple cleavages and found peaks corresponding to every resulting fragment but two.* For each of the successful fragments, the program broke each of the remaining bonds. From all the secondary breaks considered, the resulting fragments corresponded to only four additional peaks in the Spectrum, So these four branches of the search tree were each expanded by one more siaple cleavage. None of the tertiary fragments were found in the spectrum so the program terminated. It explained 11 data points in this way, the unfragmented molecule accounted for another one, and the program left 24 data points unexplained. fight of these are explainable hy hydrogen migration in conjunction with one or two cleavages, so just this additional capability adds substantial power to the program. ee ee ee ee es ee ee ee *The CH3 fragment was produced twice but there was no peak at mass 15 in the spectrum. —_ ~— —_ — ITI. SECOND SUBPROBLEM: GENERALIZING TO ALL STRUCTURES The method of hypothesis, mentioned earlier as a vague description of scientific work, suggests that a plausible hypothesis can he successively modified in light of new experience to bring a scientist closer and closer to satisfactory - 25 - explanations of data. Apart from the problem of formulating a Starting hypothesis discussed above and the problem of terminating the rfrocedure, it is not at all clear how the adjustments are to be made nor how to select the new experiences SO aS to make the precedure relatively efficient, or at least workable, These are well-known problems in the methodology of science. In other terms, the problem of successive modifications can be viewed as a preblem of generalizing a hypothesis from one set of observations to a larger set. The Meta-DENDRAL system does not yet contain a programmed procedure which solves this seccnd main subproblem. The design of the procedure will be described only briefly in this section for this reason. The task is to generalize the explanations for each of the molecules into a ccnsistent and Simple set of explanations for the whole set of molecules. The input to this phase of the program is the set of explanatory mechanisms for each of the molecules in the original class. The output is one set of mechanisms which can explain every peak explained by the individual sets of explanations, If the program is successful, the outrut set of explanations will be a unified "theory" of the mass spectrometric behavior of all the molecules in the class. In operational terms, this means, at least, that the final set of explanations will be smaller than the union of all the individual sets of explanations. - 26 - A. REPRESENTATION Perhaps the most pressing problem is to represent the candidate explanations in such a way that they can he compared and unified, Without building in concepts which would beg the theory-formation question, The bonds in each mclecule are now named only by the arbitrary numbering of atoms, the C1-C2 bond, for example. This is a convenient representation, for bonds in a Single molecule but does not allcw comparisons of bonds between molecules. What does the C1-C2 bond in one molecule have in common with the C2-c3 bond in ancther? An attractive representation of a bond is a description of its environment: the types of atoms it links: the number of hydrogens, oxygens, nitregens and so forth on each of the linked atoms; the types of bonds emanating from the linked atoms (Single, double, triple); and so on for successive atoms and bonds away from the bond in question. This is attractive because it is easy to give the program just the right information for efficient compariscns. Rut this is the danger, too, for omitting "superfluous" information gives the program much too great a head start on the problem. It might discover what we want it to discover--the old principles-- but it would never discover anything new. -~27 - The most neutral representation for bonds we have found which still allows comparison is the pair of graph structures of the pieces of the molecule joined by the bond. Breaks (a) and (b) of Figure 1, then become: 0 il CH3 - Cc + CH2 - CH2 - CH3 (a) oad cua tc (b) Group migrations present a difficult problem. The most attractive representation ncw seems tc be representing each migration as "before and after" graph structure pairs. Graphs carry all the information about bonds and migrations; their main drawback is that they carry too much information. The ptogram which looks for common features in subgraphs must be able to Limit the amount of information it considers relevant. Some heuristics will be necessary--to limit the size of the environment that will be considered, for example. Again, at this point, there is a danger of telling the program too much about what we want it to discover, or dc not want it to discover. B. SEARCH As in any learning problem there will need to be considerable readjustment of the learned generalizations as new data are considered. [It should also be expected that learning rates will be highly dependent upon the crder of presentation of the data. It is anticipated that the program will search for generalizations in roughly the same way as our mass spectroscopist friends say they do. The program will first choose one member from the class of compounds which is judged to be "typical", but "simple". By that, a chemist means a structure which is not so small as to present an inaccurate picture of the bond environments, and not so large and highly substituted as to bring in too many special problems. The program should allow these working definitions to be easily changed. The fragmentation and migration patterns already observed for this typical member of the class will then be postulated to be typical for all mesbers cf the class. The rest of the program must look at patterns for additional solecules and adjust the initial hypothesis accordingly. - 29 - Again it should be noted that this design is certainly not the only alternative. It happens to be the one which aprears most promising, in part hecause of its closeness to the procedure acually used by experts. Thus, it is the design we are now actively considering. IV. THIRD SUBPROBLEM: INTEGRATING NEW STATEMENTS INTO EXISTING THEORY Supposing the method of hypothesis to work, the scientist's problem does not necessarily end with the satisfactory formulaticn of a general statement explaining all the observed data. If he is working in a discipline for which there is no existing theory, he might stop here. But it is rare to be out of any theoretical context. Typically, the hypotheses are formulated as extensions of some existing theory. In the course of formulating and modifying hypotheses, the existing theory usually serves as a guide for pursuing some hypotheses and not others. However, for reasons of simplicity the Meta-DENDRAL system currently ignores the existing theory until the end of its chain of reasoning. We are assuming here that the classes of problems given to the system--the data to be explained--cannot ke solved using the existing theory and that they do not conflict with the theory. These simplifying assumpticns mean that the theory formulated by the program can he built up stepwise, class by class. New statements which are formulated are assumed to be consistent with the theory. Then the main problem remaining is to fit the new statements in with the old ones in an econcnical way. One of the reasons we have rewritten the DENDRAL system's mass spectrum predictor was to separate the mass spectrometry theory from the LISP functions it drives. Making changes to the theory, then, does not require reprogramming, in the usual sense, Consequently, writing a program which updates the theory no longer seems to be an insurmountable task. The problems of integrating new statements into the old theory acre independent of the source of those statements. In order to study these problems we have written a program which {a) accepts new cules frcem human chemists and (b) updates the theory table of the program, The program for doing (a), called the dialog program, is not central tc this paper, thus this section will focus on the work to accomplish (b), updating the theory. The dialog program represents one way of obtaining new chemical information to enter into the system. The automated data~analysis and theory-formation program incorporated in - 31 - Meta-DENDRAL represents another approach. The reason for discussing the former is that it serves as a Simulation of the latter. Thus the preblems of putting new chemical information into the system can he discussed from the standpoint of the dialog program, with which we have more experience. A. REPRESENTATION The mass spectral theory is represented in the program as a table of situation-action rules (S~A rules), patterned after Waterman's table of heuristics for good poker play.* A A ee A Cl ED “me A ee ee tee ee ln *D.A, Waterman, "Generalization Learning Techniques for Automating the Learning of Heuristics", Artificial Intelligence, 1:121 (1970). Situations are predicates which are either true or false of a particular molecule or molecular fragment. (This is equivalent to Waterman's representation of situations as state vectors, but seems more economical for this program, hecause of the large number of terms which wculd have to be included in the state vector.) Actions are sequences of primitive mass spectral processes constituting rewrite rules for transforming one structural fragment intc another. In this system, an action can also be another S-A rule, allowing nesting of rules in a manner quite natural to the current textbook descriptions of mass spectral theory. ~ 32 - The chemist interacting with the dialog program in time-sharing has already been given the explanation of primitive concepts known to the system which is given in Appendix A. For example, the chemist is told that he can use the functions BREAKBOND and ADDH, and is given some explanation of them. Mostly, the names of the LISP functions which are available suggest the effects of the function to the chemist. All new rules must be expressed in terms of these concepts. This is a reasonable constraint in the dialog phase of this progran because we have been willing to add new primitives whenever a chemist found some rule he could not express otherwise. For example, the concept of n-cleavage was added as a primitive to allow expression of a gamma-cleavage rule (NCLEAVAGE (QUOTE GANMA)). This constraint also insures that the output from the dialog phase and the output of the automatic generalization program will contain the same terns. As an additional means of maintaining parallelism between the dialog program and the automatic generalization program, the dialog program writes S-A rules on the basis of information obtained from the chemist, but does not allow him to write his own rules, The dialog program prompts the chemist for the information it needs in crder to write the rules, and allows no opportunity for him to deviate from its prompting format. For each S-A rule, accompanying LISP code is written from the ~ 33 - informaticn supplied ty the chemist to define new situations and new actions, unless the chemist has merely put together old Situations and old actions in new ways. B. INTEGRATION The cutput from the dialcg phase is the same as the anticipated output from the generalization program: a set of S-A rules with accompanying definitions of situations and actions. The only expected difference is that the chemist can assign names to Situations and actions during the dialog which are neaningful to him, whereas the generalization program must create its own symbols, Inserting a new rule inte the theory table requires an initial scan of the theory to find out whether the new rule describes (1) an entirely new situation, (2) a situation which subsumes an existing one, or (3) a Situation which can be subsumed under an existing one. For cases (2) and (3), where there is some overlap between the new rule and some existing one (assuming for the moment there is cnly one), the actions of the rules must also be considered in deciding where in the table the new rule must go. After deciding where the rule must be inserted, the program adds the definitions cf the new situation name and action name to the systen. Case (1), in which the situation of the new rule does not overlap with any exisiting situation, makes the insertion of new rules a Siaple matter of appending them to the theory table. The names of the situation and action are then given to the system with definitions, and the updating is completed. Cases (2) and (3) are much more general and common, for there will often be scme overlap between new situations and existing ones. The program must decide whether the overlap is interesting or substantial enough to warrant the extra work of reshuffling all the rules in order to insert the new one. It may decide that the situation is essentially a new one, in spite of some pieces of the situation appearing elsewhere, In this case it treats the rules as a case (1) rule and merely appends it to the table. This saving in processing at update time carries a corresponding cost in duplication of some parts of the cule. But the results of executing the rules will be the same in either case, with very little difference in execution tine. It is desirable to have the pregram nest rules some of the time, however, for reasons of conceptual clarity. If a new and an old rule describe exactly the same situation, for example, it inflates the theory needlessly to separate them as (i) S --> Al, and (ii) S --> A2. What we want is one rule of the form $ --> A1GA2. Duplication of situations represents an extreme case of overlap which is also easy for the program to handle, once it - 35 - recognizes the duplication. It merely puts the second action into the action list of the old situation. Care is necessary, of course, to maintain independence of the actions so that they can be executed in any order with the same result. Work on updating the thecry has progressed this far. New rules can be added if the overlap with situations of existing rules either is null or the pregram decides it is insignificant. Or, a new rule can be aided if the new situation is a duplicate of an old one. For the other cases, the general forms of (2) and (3), there is both the difficulty of finding the appropriate place to insert the non-overlapping pieces of the situation and the non-overlapping pieces of the action, and the the difficulty of deciding which of several rules should be updated to incorporate the new one. Ve. CONCLUSION The Meta~DENDRAL program descrited here is a vehicle for studying problems of theory formation in science, It is huilt upon the concepts and programmed routines already available in the Heuristic DENDRAL performance program, which uses a scientific theory to explain analytical data in organic chemistry. The Meta-~DENDRAL system goes beyond the performance program, however, in attempting to formulate the theory which the performance program will use. The Meta~DENDRAL program works much Like a chemist who is extending his theory of mass spectrometry by looking at collections of experimental results. The data, for both the chemist and program, are mass spectra and the associated molecular structures, By selecting some "typical" examples, first-order general hypctheses abcut the whole collection of data can be proposed. Then, Ly subsequent adjustments, the generalizations are modified tc explain all the data. The new generalizations are then integrated into the existing corpus of theoretical statements in ways dictated by considerations of Simplicity and personal rfreference. The first version of the Feta-program, which is described here, suggests that the design is workable. But it accentuates the arbitrariness of our design decisions and raises the questions of what alternative designs would look like and how good they would be. It also raises a number of issues important to understanding scientific methodclogy in general. The design question is certainly one such issue. Others are questions concerning the criteria of acceptable generalizations, criteria of good scientific theories, and criteria for deciding on a set of primitive concerts for a theory. None of these general issues will he resolved satisfactorily in the context of this progran. Yet none can be resolved for this program without saying scmething about the general soluticns, APPENDIX A. PRIMITIVE CONCEPTS OF MASS SPECTROMETRY KNOWN TC THE DENDRAL PROGRAM This list is taken from an outline given to chemists who define new mass spectrometry rules for the system. The functions at the front of the list are most primitive, those at the end are more cceplex, and in fact are built cut of the simpler ones. The starred functions perform "housekeeping" on electrons and charge, without the chemist having to handle these details explicitly. to the chemist this list serves as a reminder of the names and associated syntax of the "building blocks" available to him for defining new rules. To the present reader it is meant to illustrate the concepts already programmed into the system. FONCTION ARGU MENTS DESCRIPTION > een eae OD ee eh ee ee -_ ae _ ee eS EE ele ee a ee ee ee . HOUSEKEEPING PUNCTIONS: ADDCHARGE ata Assign a positive charge to atm. ADDDOT ata Assign a free electron to ata. IONIZE atn Assign a dot and a charge to ata, PAIRELECTRONS list;nolist Look among the atoms of LIST for adjac atoms with free electrons. Pair up thi electrons to make an explicit bond unl. the pair is named in NOLIST. REMOVECHARGE atm Take away the positive charge from atm REMOVEDOT atn Remove the dot (if present) from atm FUNCTIONS FOR MANIPULATING STRUCTURE WITHOUT HOUSEKEEPING: ~ 34- ADDH CHANGEBOND JOINATOM REMOVEBOND REMOVEH atm atmisatm23n Put a hydrogen on atm. Add n (pos. or neg.) to the order of th atmt-atm2 bond. oldatmzatm;sbond;atontype;nodenun; atmtisatm2 atm Bring atm into the structure -~ attach to oldatm with bond order BOND. Give a the atom type and node number specified Remove the bond hetween atm1 and atm2. - Take a hydrogen off atm. STRUCTURAL MANIPULATICN FUNCTIONS WITH HOUSEKEEPING: ER EAKBOND BREAKRING ELIMIWNATER LOSEALPHARAD LOSENEXTRAD MAKER ING MIGRATEH NCLEAVAGE NEWBOND x * + hk & * atmtsatm2 atmisatm2 atm atp atm atmi3;atm2sbend atmisatm2 n,pct atmlsatm2 Replace the atmi-atm2 bond with a pair of electrons. Try to pair any other free electron with one of the new free electrons. To the same as BREAKBOND when it is certain that the atmi-atm2 bond is in a ring. Fliminate a hydrogen from atm, leaving a free electron. Lose the largest radical alpha to atm. Lose the largest radical adjacent to at Jcein atm? & atm2 with bond to form a ri Move a hydrogen from atm? to atm2, leav a free electron on atm1 (unless atml = ANYATOM, in which case the H comes fron Ereak the nth bonds away fron the heteroatoms in the molecule and assign intensity=pct*oldintst100. If n is 0 or (quote adjacent), the adjacent bonds are hroken, 1=({quote alg 2=({quote beta), 3=({quote gamma). Replace adjacent free electrons on atm with an explicit bond. FUNCTIONS FOR CARRYING OUT COMPLEX REARRANGEMENTS: AMINERR COELIM CO2 EL IM ELI" ELIM1 MRRFGT * tte He & atmist atmist lst Ist lst lst Migrate an H to node 2; Change bond het nodes 16 2 {+#1)3 Break bond hetween 2- Ereakbond between 3 & 2. Breakbond between 4 & 3. Remove an H from node 1; breakbond bets Fliminate an # from node 2. Change bond between nodes 16 2 {-1)3 Migrate an H from node 6 to node 2; Change bond between nodes 16 4 (#1); Break bond between nodes 4&5 5. —HO'- FUNCTIONS FOR HIGHER-LEVEL MASS SPECTROMETRIC PROCESSES: ALLCLEAVAGES ¥* nil LOSENEXTRAD * ata MAJORALPHACLEAVAGE * nil MCLAFPFERTY * nil MINORALPHACLEAVAGES * nil - 41 Break each bond in the molecule (MOLAT and return a list of masses of fragmen each consed to intensities calculated MSTHEORY. (This is the default functi to be called when no functional groups be identified.) Lose the largest radical next to atm. Consider all bonds alpha to the first - in ATMLST and break the one which lose largest radical. Find which McLafferty Rearrangements will occur (single, dbl, two dbl) and do then, Consider all bonds alpha to the first heteroatom in ATMLST and break each on except the major. BIBLIOGRAPHY (1) J. Lederberg, "DENDRAL-64 - A System for Computer Construction, Enumeration and Notation of Organic Molecules as Tree Structures and Cyclic Graphs", (technical reports to NASA, also available from the author and summarized in (12)). (la) Part I. Notational algorithm for tree structures (1964) CR.57029 (1b) Part II. Topology of cyclic graphs (1965) CR.68898 (1c) Part IYI. Complete chemical graphs; embedding rings in trees (19639) (2) J. Lederberg, "Computation of Molecular Formulas for Mass Spectrometry", Hclden-Day, Inc. (1964). (3) J. Lederberg, "Topclogical Mapping of Organic Molecules", Proc. Nat. Acad. Sci., £3:1, January 1965, pp. 134-139, (4) J. Lederberg, "Systematics of organic molecules, graph topology and Hamilton circuits. A general outline of the DENDRAL System." NASA CR-~48899 (1965) (5) J. Lederberg, "Hamilton Circuits cf Convex Trivalent Polyhedra (up to 18 vertices), Am. Math. Monthly, May 1967. (6) G. L. Sutherland, "DENDRAL - A Computer Program for Generating and Filtering Chemical Structures", Stanford Artificial Intelligence Project Memo No. 49, February 1967. (7) J. Lederberg and E. A. Feigenbaum, "Mechanization of Inductive Inference in Organic Chemistry", in B. Kleinmuntz (ed) Formal Representations for Human Judgment, (Wiley, 1968) (also Stanford Artificial Intelligence Project Memo No. 54, August 1967). (8) J. Lederberg, “Online computation of molecular formulas from mass number." NASA CR-94977 (1968) (9) BE. A. Feigenbaum and 8. G. Buchanan, "Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry", in Proceedings, Hawaii International Conference on System Sciences, B. K. Kinariwala and F. F. Kuo (eds), University of Hawaii Press, 1968. (10) B. G. Buchanan, G. L. Sutherland, and FE. A. Feigenhbaun, “Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry". In Machine Intelligence 4 (B. Meltzer and D. Michie, eds) Edinburgh University Press (1969), (also Stanford Artificial Intelligence Project Memo No. 62, July 1968). (11) E. A. Feigenhaun, “Artificial Intelligence: Themes in the Second Decade", In Final Supplement to Proceedings of the IFIP68 International Congress, Edinburgh, August 1968 (also Stanford Artificial Intelligence Project Memo No. 67, August 1968). (12) J. Lederberg, “Topology cf Molecules", in The Mathematical Sciences - A Collection of Essays, (ed.) Committee on Support of Research in the Mathematical Sciences (COSRIMS), National Academy of Sciences - National Research Council, M.I.T. Press, (1969), pp. 37-51 * (13) G. Sutherland, “Heuristic DENDRAL: A Family of LISP Programs", to appear in D. Bobrow (ed), LISP Applications (also Stanford Artificial Intelligence Project Memo No. 80, March 1969). (14 J. Lederberg, G. L. Sutherland, B. G. Buchanan, Ff. A. Feigenbaum, A. V. Robertson, A. M. Duffield, and C. Djerassi, “Applications of Artificial Intelligence for Chemical Inference I. The Number of Possible Organic Compounds: Acyclic Structures Containing C, H, O and A", Journal of the American Chemical Society, 91:11 (May 21, 1969). (15) A. M. Duffield, A. V. Robertson, C. Djerassi, B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, and J. Lederberg, "Application of Artificial Intelligence for Chemical Inference IT. - 44 - Interpretation of Low Resclution Mass Spectra of Ketones". Journal of the American Chemical Society, 91:11 (May 21, 1969). (16) B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, "Toward an Understanding of Infcrmation Processes of Scientific Inference in the Context of Organic Chemistry", in Machine Intelligence 5, (B. Meltzer and D. Michie, eds) Edinburgh "niversity Press (1969), (also Stanford Artificial Intelligence Project Memo No. 99, September 1969). (17) J. Lederberg, G. L. Sutherland, B. G. Buchanan, and E. A. Feigenbaum, "A Heuristic Program for Solving a Scientific Inference Problem: Sugmary of Motivation and Implementation", In Theoretical Approaches to Non-Numerical Problem Solving {R. Banerji and M.D. Mesarovic, eds.) Springer-Verlag, New York (1970). (Also Stanford Artificial Intelligence Project Memo No. 104, November 1969.) (18) C. W. Churchman and B. G. Buchanan, “On the Design of Inductive Systems: Some Philosophical Problems". British Journal for the Philosophy of Science, 20 (1969), pp. 311-323. (19) G. Schroll, A. M. Duffield, C. Djerassi, B. G. Buchanan, G. L. Sutherland, E. A. Feigenbaum, and J, Lederberg, "Application of Artificial Intelligence for Chemical Inference ITI. Aliphatic Ethers Diagnosed by Their Low Resclution Mass Spectra and NMR - 45 - Data". Journal of the American Chemical Society, 91:26 (December 17, 1969). (20) A. Buchs, A. Me. Duffield, G. Schroll, C. Djerassi, A. B. Delfino, 8. G. Buchanan, G. L. Sutherland, FE. A. Feigenhkaum, and J. Lederberg, “Applications of Artificial Intelligence For Chemical Inference IV. Saturated Amines Diagnosed hy Their Low Resclution Mass Spectra and Nuclear Magnetic Resonance Spectra", Journal of the American Chemical Society, 92:23 (Nov. 18, 1970). 1970. (21) Y¥. M. Sheikh, A. Buchs, A. B. Delfino, B. G. Buchanan, G. L. Sutherland, FE. A. Feigentaum, and J. Lederberg, "Appli- cations of Artificial Intelligence for Chemical Inference V. An Approach to the Computer Generation of Cyclic Structures. Differentiation Between All the Possible Isomeric Ketones of Composition C6H100". Organic Mass Spectrometry (in press). (22) A. Buchs, A. B. Delfino, A. M. Duffield, C. Djerassi, B. G. Buchanan, EK. A. Feigenbaum, and J. Lederberg, "Appli- cations of Artificial Intelligence for Chemical Inference VI. Approach to a General Method cf Interpreting Low Resolution Mass Spectra with a Computer", Helvetica Chemica Acta, %33:6 (1970). (23) E. A. Feigenbaum, B. G. Buchanan, and J. Lederberg, "On Generality and Problem Solving: A Case Study Using the DENDRAL Program", In Machine Intelligence 6 (B. Meltzer and D. Michie, eds.) Edinburgh University Press {in press). (Also, Stanford Artificial Intelligence EFroject Memo No. 131.) (24) B. G. Buchanan and T. §. Headrick, "Some Speculation about Artificial Intelligence andi Legal Reasoning". Stanford Law Review, November, 1970. (Also, Stanford Artificial Intelligence Project Memo No. 123.) (25) B. G. Buchanan, A. M. Duffield and A. V. Rokertson, "An Application cf Artificial Intelligence to the Interpretation of Mass Spectra. In MasS Spectrometry, 1970 (G.W. Milne, ed.) Wiley (in press). - ay-