i 12 HEURISTIC DENDRAL : a Program for Generating Explanatory Hypotheses in Organic Chemistry B. Buchanan Georgia Sutherland and E. A. Feigenbaum Computer Science Department, Stanford University A computer program has been written which can formulate hypotheses from a given set of scientific data. The data consist of the mass spectrum and the empirical formula of an organic chemical compound. The hypotheses which are produced describe molecular structures which are plausible explanations of the data. The hypotheses are generated systematically within the program’s theory of chemical stability and within limiting constraints which are inferred from the data by heuristic rules, The program excludes hypotheses inconsist- ent with the data and lists its candidate explanatory hypotheses in order of decreasing plausibility. The computer program is heuristic in that it searches for plausible hypotheses in a small subset of the total hypothesis space according to heuristic rules learned from chemists. INTRODUCTION The computer program described below resulted from an interest in studying scientific hypothesis formation as a decision-making process. To make pro- gress on this broad and general problem, it seemed useful to choose a particu- lar scientific task involving inductive behaviour and to explore it in as much detail as possible. The task chosen is in a well defined but relatively new and complex area of organic chemistry, namely the analysis of mass spectra of organic molecules. HEURISTIC DENDRAL is a computer program which generates molecular ‘graphs’ (i.e., structures) as hypotheses to explain the data produced by a mass spectrometer. The data produced when a mass spectrometer fragments molecules of a chemical sample can be interpreted as a list of masses of fragments paired P 209 MACHINE LEARNING AND HEURISTIC PROGRAMMING 1007 80T _ i li ‘| | I to ] 40 80 mie 120 Composition: CsH16O Molecular Structure: CH3—-CH2—-C—-CH2—-CH2--CH2—CH2—CH3 Computer Representation of the Mass Spectrum: (41. (53. (57. (72. (86. 18)(42 . 7)(43 . 100)(44 . 3) 3)(54 .1)(55 . 11)(56. 3) 80}(58 . 2)(70. 1)(71 . 36) 44)(73 . 5)(81 . 1)(85 . 6) 2)(99 . 31)(100 . 2)(128 . 5)) Figure 1. The mass spectrum for 3-OCTANONE with their relative abundances. An example of a mass spectrum is shown in figure 1. By studying the resulting list of number pairs, chemists can infer the molecular structure of the chemical sample. Some of the decision processes which chemists use in making such inferences are incorporated into a com- puter program along with a structure-generating algorithm which provides a systematic approach to the problem of deducing the structure of a chemical sample. The computer program is HEURISTIC DENDRAL; and it is now 210 BUCHANAN, SUTHERLAND AND FEIGENBAUM capable of making inferences from mass spectra to molecular structures in a restricted domain. The foundation for the HEURISTIC DENDRAL program is Lederberg’s (1964) DENDRALAlgorithm (section 7 contains a summary of this algorithm). The algorithm gives a way of representing and ordering chemical molecules uniquely; thus it gives a method for generating all topologically possible molecules of a given composition without redundancy. It is a systematic and exhaustive topologist which can generate all non-cyclical graphs that can be made with the atoms of the composition, knowing no chemistry other than the valences of these atoms. The DENDRAL algorithm defines the hypothesis space in much the same way as a legal move generator for a chess-playing program defines the total move space within which good chess moves will be sought. The computer program is written in the Lisp language on the PDP-6 computer at the Stanford University Artificial Intelligence Laboratory. It occupies approximately eighty thousand words of memory. Working with Professor Joshua Lederberg at Stanford, William Weiher and William] White developed the basic representation and wrote the initial program. The program has also benefited greatly from the attention of members| of the Stanford Mass Spectrometry Laboratory: Professors Carl Djerass{, Alex Robertson, Jerry Meinwald, and especially Dr Alan Duffield. The program itself is segmented into five subprograms: the PRELIMINARY INFERENCE MAKER, the DATA ADJUSTOR, the STRUCTURE GENERIATOR, the PREDICTOR, and the EVALUATION FUNCTION. The interrelation of these subprograms is shown in figure 2. The PRELIMINARY INFERENCE MAKER (described in section 1) examines a spectrum and determines what general classes of chemical substrpctures are confirmed or disconfirmed by the data. All hypothesized structures generated later by HEURISTIC DENDRAL must contain all the inflicated substructures (all of which are put on a list called GOODLIST); structure may contain any substructure which is disconfirmed by t trum. (All the forbidden substructures are put on a list called BADLIST.) The DATA ADJUSTOR subprogram (described in section 2) chooses signi- ficant spectral peaks for the STRUCTURE GENERATOR to use for its Zero Order Theory. At present there are four independent ways of interpreting the spectrum. The STRUCTURE GENERATOR (see section 3) uses the information de- duced by the PRELIMINARY INFERENCE MAKER and the DATA ADJUSTOR to produce a list of all topologically possible chemical structures which are consistent with the spectrum. The consistency criteria are the lists of] ‘good’ and ‘bad’ substructures and the Zero-Order Spectral Theory, described in detail in section 3.2. The PREDICTOR subprogram is a rough model of a mass spectrometer (see section 4). It is used to predict significant features of the mass spectrum 211 MACHINE LEARNING AND HEURISTIC PROGRAMMING Mass Spectrum -+ Composition -- DATA ADJUSTOR PREDICTOR (applied to each candidate) PRELIMINARY INFERENCE MAKER List of Predicted Mass Spectra Functional Functional Groups Present Groups Absent (on GOODLIST) (on BADLIST) Consistency Check WL STRUCTURE GENERATOR List. of Candidates whose predicted spectra are consistent with the original spectrum List of Plausible Candidates EVALUATION FUNCTION List of Candidate Structures ranked from most preferred to least preferred Figure 2, General design of HEURISTIC DENDRAL 212 BUCHANAN, SUTHERLAND AND FEIGENBAUM corresponding to each candidate structure output by the STRUCTURE GENERATOR. The EVALUATION FUNCTION (see section 5) compares each predicted spectrum against the original spectral data and assigns a score representing similarity of the two spectra. This enables the candidate hypotheses output by the STRUCTURE GENERATOR to be listed in order of their ‘plausibility’ or estimated degree of confirmation. An ideal program for deducing the structure of a chemical sample would output exactly one structure as the explanation for the spectral data. Up to now the usual case has been that several different structures are suggested as plausible explanations for the data. However, even a short list is a far better result than was obtained by the original program, which listed all the topologically possible structures and made no use of any real data at all. Because the constraints which have been included in the program to limit the search space are heuristic, nothing guarantees that the correct structure will not be bypassed. When a test run does fail, however, the program is modified after our chemist-informants study the output and analyze their own decision procedures. The purpose of this report is merely to describe the current state of HEURISTIC DENDRAL and to sketch some of our plans for future program developments. 1. THE PRELIMINARY INFERENCE MAKER The PRELIMINARY INFERENCE MAKER is conceptually very simple: it looks for the presence or absence of sets of peaks in a mass spectrum and updates GOODLIST or BADLIST, thus constraining HEURISTIC DENDRAL from generating large numbers of molecular structures as possible explana- tions of a given mass spectrum. By looking for patterns of peaks in the spec- trum which are characteristic of some structural fragment, such as the keto group, this preliminary program can tell the STRUCTURE GENERATOR to concentrate on some fragments and to avoid others. It does this by temporarily putting desirable structures on GOODLIST (see section 3.5) and undesirable ones on BADLIST (see Section 3.3). The program has access to translations of Tables 1 and 2. As Table 1 indicates implicitly, this program knows the name, structure, valence, valence locations, empirical formula, and symmetrical atoms, as well as some char- acteristic peaks for several functional groups. It also recognizes priorities of groups, as Table 2 indicates. Addition of new information is simplified by a short routine (QUEST) which asks the chemist at the console for the essential information — and explains what it wants if he does not understand. An example is shown in Table 3. In this example the function QUEST is called to prompt information about identifying a new group, in this case the ester group. The lines preceded by asterisks are messages from the machine. Lines following a machine prompt (i.e., after a colon) were typed in from the console. This dialogue is much 213 MACHINE LEARNING AND HEURISTIC Functional group and characteristic subgraph A. KETONE Oo | —C— B. METHYL-KETONE3 oO H I L | CH3—C—CH2—C—C— | Cc. ETHYL-KETONE3 Oo | | i CH3—CH2—C—CH2—C—c-— | | D. N-PROPYL-KETONE3 Oo / H CH3—CH2-—CH2,—_C—CH;,— C—C— | | E, ISO-PROPYL-KETONE3 CH3 O | | | CH—C—CH3—C—C— | CH3 F. ALDEHYDE oO | | —C—CH G. ETHER | | —C—9—C— | | H, ETHER2 —CH2,—O—-CH2— I. METHYL-ETHER2 —CH2—O—-CH3 J. ETHYL-ETHER2 —CH»—O—CH2—CH3 K. PRIMARY-AMINE2 —CH2—NH3 BYNES _ PUNT URW WRU PROGRAMMING identifying conditions . There are 2 peaks at mass units xl & x2 such that a. x1 +x2=M-+28 b. x1—28 is a high peak c. x2—28 is a high peak d. At least one of x1 or x2 is high Ketone conditions are satisfied . 43 is a high peak . 58 is a high peak M — 43 is a low peak M-—15 is low or possibly zero Ketone conditions are satisfied . 57 is a high peak . 72 is a high peak M— 29 is a high peak M-—57 is a high peak 71 is a high peak 43 is a high peak 86 is a high peak 58 appears with any intensity 71 is a high peak 43 is a high peak 86 is a high peak There is no peak at 58 . M—44 is a high peak . 44 is a high peak 1. M—17 is absent Ne we BWNe . M—18 is absent . Ether conditions are satisfied . There are 2 peaks at x] & x2 such that a. xl—x2=M-+ 44 b. At least one of x1 or x2 is high . Ether2 conditions are satisfied . 45 is a high peak M-—15 is low or possibly zero M-—1 appears (any intensity) . Ether2 conditions are satisfied . 59 is a high peak M-—15 appears (any intensity) . 30 is a high peak . No other peak is high Table 1. Important chemical groups and their identifying conditions 214 functional group and characteristic subgraph L. SECONDARY-AMINE2 —CH2—N—CH2— H M. TERTIARY-AMINE2 —CH2—N—CH2— | CH3 N. ISOPROPYL-2ARY-AMINE2 CH; —CH»—N—CH f H CH; O. ISOPROPYL-3ARY-AMINE2 CH3 —CH;—N—CH | \ CH3 CH; P. ALCOHOL ——C—OH | Q. PRIMARY-ALCOHOL -—CH2--OH R. C-2-ALCOHOL OH | —-CH—CH3 Table !. (contd.) BUCHANAN, SUTHERLAND AND FEIGENBAUM identifying conditions . There are 2 peaks at x1 & x2 such that a. x1+x2=M+43 b. At least one of x1 or x2 is high . 30 is a high peak 1, There are 2 peaks at x1 & x2 such why WN Nv N—- NS Ne WwW 215 that a. xl4+x2=:M+71 b. At least one of x1 or x2 is high . 44 is a high peak . 44 is a high peak . 72 is a high peak . M—15 isa high peak . 58 is a high peak . 86 is a high peak . M—15 isa high peak . M is low or possibly zero . Either 4—18 or M—17 appears (any intensity) . M—46 appears (any intensity) . Alcohol conditions are satisfied . The 31 peak is approximately 10% . Alcohol conditions are satisfied . 45 is a high peak MACHINE LEARNING AND HEURISTIC PROGRAMMING A. Family Trees: priorities within families 1. Ketone | | | Methyl-Ketones Ethyl-Ketones Propyl-Ketones 2. Ether | Ether 2 | | | | Methyl-Ether 2. Ethyl-Ether 2 Ether 5 3. Alcohol es ee | | Primary-Alcohol C-2-Alcohol B. Hierarchies of families Ketone > Ether (i.e., check Ethers only if the Ketone tests fail). Table 2. Family priorities (QUEST) * THIS PGM REQUESTS INFORMATION TO ALLOW THE INFERENCE MAKER TO PUT RADICALS ON GOODLIST OR BADLIST ON THE STRENGTH OF THE APPEARANCE OR NON-APPEARANCE OF A FEW SPECTRAL LINES. IF YOU DO NOT KNOW THE PROPER FORM FOR YOUR ANSWER TO ANY REQUEST TYPE A QUESTION MARK. IF YOU MAKE A MISTAKE IN ALINE YOU CAN CORRECT IT WHEN QUEST IS DONE BY CALLING THE FUNCTION ‘‘CHGPROPS’’ OF ONE ARGUMENT, THE GROUP NAME. E.G., (CHGPROPS (QUOTE KETONE)) *NAME OF FUNCTIONAL GROUP OR RADICAL: 9 * ANY ATOMIC NAME WILL SUFFICE: ESTER * PEAKS WHOSE ABSENCE [INDICATES THE ABSENCE OF THIS GROUP: 2?* THE PGM WANTSA LIST OF DOTTED PAIRS INDICATING MASS-INTENSITY PAIRS IT SHOULD LOOK FOR. MASS UNITS MAY BE SPECIFIED AS 1. ANUMBER, 2. M (THE MOLECULAR WT), OR 3. A LIST OF THREE ELEMENTS: 3.1 THE LETTERM 3.2 SLASH AND MINUS (OR PLUS) SIGN 3.3 A NUMBER (TO BE SUBTRACTED OR ADDED TO THE M). JTable 3. Conveying information to the PRELIMINARY INFERENCE MAKER 216 BUCHANAN, SUTHERLAND AND FEIGENBAUM INTENSITY UNITS MAY BE SPECIFIED AS 1. A NUMBER BETWEEN © AND 100 INCLUSIVE, 2. THE WORD ‘‘ANY’’ (ANY INTENSITY ABOVE 0), 3. THE WORD “LOW”’ (INTEN BETWEEN # AND 5 INCL), 4. THE WORD ‘“‘HIGH’’ (INTEN BTW 11 AND 190 INCL), OR 5. THE WORD ‘‘Poss@ ’’ (INTEN LOW OR ZERO) FOR EXAMPLE THIS LIST WOULD BE ACCEPTABLE: ((45.. HIGH) ((m / —45).9) (m. Low)) *PEAKS: ((45 . 9) (66 . 8)) *#pEAKS SUFFICIENT TO INDICATE THIS GROUP: 2* TYPE ‘‘SAME’” IF THE NECESSARY CONDITIONS ARE ALSO SUFFICIENT, “*NIL’’ IF THERE ARE NO SUFF CONDITNS, OR °°???” IF YOU WANT THE CONDITION LIST EXPLAINED: SAME * STRUCTURE IN LIST NOTATION: ? *TYPE A LIST WHOSE FIRST ELMT Is 1 AND INDICATE ALL HYDROGEN ATOMS. E.G., (Ic (1 4) (1 8) (1 B)) FOR THE METHYL RADICAL. * STRUCTURE: (1 c(20) (1 0(1 c))) *VALENCE!? *IF ALL FREE BONDS ARE ON ONE ATOM, TYPE THIS NUMBER. OTHER- WISE TYPE A LIST OF NUMBERS INDICATING HOW THE FREE BONDS ARE SPLIT UP, READING THE STRUCTURE YOU TYPED FROM LEFT TO RIGHT, IGNORING ATOMS WITH NO FREE VALENCE: (1 2) *LIST OF SYMMETRIES: ? *TYPE A LIST OF THE FORM ((3 2. 1)) TO INDICATE THAT THE FIRST & THIRD ATOMS WITH FREE VALENCE ARE SYMMETRICAL :() * PLACE OF GROUP IN ITS FAMILY: ? *TYPE A LIST CONSISTING OF (1) THE NAME OF THE NEXT HIGHER FAMILY MEMBER OR NIL (2) THIS GROUP NAME (3), (4), ... CN) NAMES OF ALL NEXT LOWER MEMBERS. E.G., FOR ETHER2: (ETHER ETHER2 METHYL-ETHER2 ETHYL-ETHER2) *PLACE OF GROUP IN ITS FAMILY !NIL *THANKS CALL AGAIN (CHGROPS (QUOTE ESTER)) *pROPERTY TO CHANGE:? *PROPERTY NAME * DESCRIPTION (TYPE ONE) NESS LIST OF PEAKS WHOSE ABSENCE INDICATES ABSENCE OF GROUP SUFF LIST OF PEAKS INDICATING THE PRESENCE OF THIS GROUP FORM EMPIRICAL FORMULA AS A LIST OF DOTTED PAIRS VALENCE A SINGLE NUMBER OR A VECTOR STRUCT STRUCTUREIN LIST NOTATION SYM LIST OF SYMMETRIES *PROPERTY: VALENCE *VALUE: (1 3) *REPLACE (R) OR ADD (A)?R (1 3) * PROPERTY TO CHANGE: NIL DONE Table 3. (contd.) MACHINE LEARNING AND HEURISTIC PROGRAMMING shorter when the user knows the correct form for his response (and does not type a question mark). If the user calls QUEST | instead of QUEST, the machine begins with the first prompt, bypassing the initial descriptive sentences. If the user wishes to change any information typed in previously, he calls the function CHGPROPS. The PRELIMINARY INFERENCE MAKER is given as input the spectrum, the empirical formula of the molecule, and a noise threshold to apply to the spectrum.! The Lisp function INFER accepts this information and controls the subsequent inferences about the presence or absence of the structural groups. The program performs the following three tests for each structure: 1. Is the empirical formula of the structure compatible with the empirical formula of the molecule? If not, get the next structure. 2. Is any necessary condition falsified by the spectrum? Jf so, put this structure on BADLIST and get the next structure. 3. Are all sufficient conditions satisfied by the data? If so, put this structure on GOODLIST and get the next structure. Note: at present all sets of conditions are both necessary and sufficient. The Family Trees shown in Table 2 reduce the effort of the PRELIMINARY INFERENCE MAKER and eliminate redundant effort in the STRUCTURE GENERATOR. When the spectral data indicate that a group is absent from the structure (resulting in the addition of this group to BADLIST), no lower members of the same family are even checked. On the other hand, if both a higher and a lower member of a family are indicated by the data (resulting in the addition of both groups to GOODLIST), only the lower, more specific, group is used. For example, if both of the subgraphs named ETHER and ETHER2 are on GOODLIST, the program deletes the more general one, ETHER, sifice ETHER2 constrains structure generation more; that is, there are fewer isomers of a given composition containing —CH»,—O—CH?z— than there are which contain bot | | The family hierarchy list also reduces the effort of this program. If any member of the first family is on GOODLIST, no members of the second family are even checked. In the cases of the general Ketone, Ether2, Secondary-Amine2, and Tertiary-Amine2 subgraphs, the preliminary inference maker can, in fact, isolate the position of the functional group as well as determine which 1 Spectral peaks are deleted if their amplitudes are lower than the threshold. This option has not yet been exercised since it may confuse the inference maker. A threshold value of 1 bypasses this option. 218 BUCHANAN, SUTHERLAND AND FEIGENBAUM functional group is present. It cando this because of highly favourable alpha- cleavage (cleavage of the bond between the carbon atom attached to the heteroatom and the rest of the molecule) which is an identifying condition for each of these subgraphs. For example, in the ketone shown below, the program can tell that the keto radical (C=O) is between some C3H7 structure and some C4Ho structure, even though it cannot specify terminal radicals uniquely. Cc 1 Cc H H cH—C—cH¢ ° CH; CH»—CH3 This positional information is passed to the STRUCTURE GENERATOR’S partitioning routine which is discussed in section 3.4. The effect in this case is that the only ketones which will be generated are those with the keto group bounded by three carbon atoms on one side and four carbon atoms on the other. (INFER (QUOTE C8H1 60) §:99846 1) * GOODLIST= (*ETHYL-KETONE3*) *BADLIST=(*C-2-ALCOHOL* *PRIMARY-ALCOHOL* *ETHYL-ETHER2* *METHYL- ETHER 2* *ETHER2* *ALDEHYDE* *ALCOHOL* *ISO-PROPYL-KETONE3* *N-PROPYL- KETONE3* *METHYL-KETONE3*) (JULY-4-1968 VERSION) C2* ETHYL-KETONE3*H8 MOLECULES NO DOUBLE BOND EQUIVS 1. cH2.. cCH2.c3H7 c=—.0oCcC2HS, 2. cH2.. cH..cH3C2H5 c=.0C2H5, 3. CH2.. CH2.CH..CH3 CH3 CGC §.0C2H5, DONE s:99046 ((41.. 18.) (42...7.) (43... 190.) (44... 3.) (53... 3.) (54... 1.) (55..11.) (56... 3.) (57. . 80.) (58... 2.) (70..1.)(71.. 36.) (72... 44.) (73... 5.) (81... 1.) (85... 6.) (86. .2.) (99... 31.) (106. . 2.) (128... 5.)) Table 4. Example from the PRELIMINARY INFERENCE MAKER Although the chemical heuristics used in this program are more like suggestions than rules, they have demonstrated their usefulness in a number of trials. The results of one of these trials appear in Table 4. The dashed line separates the lines printed by the PRELIMINARY INFERENCE MAKER from the lines printed by the sTRUCTURE GENERATOR. The complete output for this example is discussed in detail in section 6. In this case, total output is reduced from 698 isomers! to 3 isomers as a result of applying the PRELIM- INARY INFERENCE MAKER. 1The number of chemically stable acyclic structures with empirical formula CgHisO is 698; the total number of topologically possible graphs which satisfy just the valence restrictions is 790. Section 3 discusses the program which generates these structures. 219 MACHINE LEARNING AND HEURISTIC PROGRAMMING The program was given the mass spectrum and empirical formula of 3- octanone. $:09046 is the mass spectrum for the structure: oO | C—C—C—C—-C—C—C—-O The first output structure is the correct structure for this spectrum. The rest of the output structures are other ethyl-ketones, because of GOODLIST. 2. THE DATA ADJUSTOR The DATA ADJUSTOR subprogram determines which mass points of a real spectrum are significant enough to be used by later programs. This process is separable into three steps: 1. Determine the mass of the molecular ion (4). If this number is not in the real spectrum, insert it with large amplitude. 2. Delete peaks at impossible mass points. Specifically, delete peaks at 1,2,..., 10, 11, 19,..., 23, M—1, M—2,..., M—23. 3. Delete all but the most significant peaks. Significance has to be decided without knowledge of the molecular structure of the sample producing the spectrum. Four methods of determining significance are included at present, with the choice of method being left to the program user. (i) The Threshold Method selects those mass points which have amplitudes higher than a certain number. (ii) The Biemann Method selects the two mass numbers with highest amplitudes in each interval of 14 mass numbers. (iti) The Lederberg Method selects the » mass numbers with highest amplitudes. The number » depends upon the number of atoms in the chemical composition. n= 1e(count -1)_ 1 (iv) The fourth method allows the user to specify the number of mass points to be used (the # highest peaks). Each of these four methods reduces the real spectrum to a set of mass numbers judged to be the ‘significant peaks’ in the data. This revised spectrum is then given to the STRUCTURE GENERATOR which treats it as the data to guide the process of generating structures. The data-adjusting routine is invoked by calling the function REALSPEC with three arguments. The first two arguments specify the composition and the spectrum. The third argument indicates which method to use. The four possibilities for the third argument are: (i) —m-—use threshold m (ii) 7 —use Bieman’s method (iii) NIL — use Lederberg’s method (iv) -take the n highest peaks. 220 BUCHANAN, SUTHERLAND AND FEIGENBAUM The relative merit of these methods has not been determined. In the examples processed so far, it appears that the sTRUCTURE GENERATOR needs only a few of the mass points in a typical spectrum. 3. THE STRUCTURE GENERATOR The DENDRAL Algorithm described in section 7 is a procedure for generating all of the topologically possible acyclic structures (isomers) of a chemical composition. This algorithm is based on a canonical notation for chemical structures and an ordering procedure which determines which of two canon- ical structures is ‘higher’. The STRUCTURE GENERATOR is a computer program implementing the DENDRAL algorithm but with the inclusion of heuristic constraints to pre- vent the program from generating structures which are incompatible with chemical theory or mass spectral data. Applying these constraints in the course of structure generation greatly increases the efficiency of the program, de- creasing amount of output and total run time by several orders of magnitude. The STRUCTURE GENERATOR is designed to solve the following problem: GIVEN: a list of defined atoms with their valences and weights a composition (empirical formula) a spectrum (mass numbers only) a list of likely substructures a list of impossible substructures TASK: generate all structures compatible with the given data. If there are no data-oriented lists of likely or impossible substructures and no spectral data, the program generates all structural variants (iso- mers ) of the given composition. INSTRUCTIONS: 1. Make certain that the composition is compatible with the spectrum, if there is one. 2. Consider only those structures which have exactly the types and amounts of atoms specified by the composition. 3. If certain substructures are required, remove their atoms from the composition and insert the name of a ‘superatom’ to repre- sent that substructure. Be sure the superatom substructure is compatible with the spectrum. After generating a structure containing superatoms, translate superatoms into the original substructures before printing the output. 4. If the partitioning option is to be exercised, consider all sub- groupings (partitions) of the composition. Determine whether a given partition is ‘plausible’. Generate only those structures which come from plausible partitions, 5. Generate substructures to combine into isomers. The isomers must contain no forbidden substructures; and each substructure must be compatible with the spectrum, if there is one. 221 MACHINE LEARNING AND HEURISTIC PROGRAMMING 6. Provide the user of the program periodic opportunities to observe and change the direction of structure generation. (Optional) 7. Remember past work. (Optional) The mechanisms for following these instructions are described in the following sections. 3.1. Brief description of the structure-generating algorithm The basic steps for generating chemical structures are to generate radicals (structures with a free bond) and to connect radicals to make larger struc- tures. Radicals are generated recursively from a composition list of atoms by deciding on the first atom (apical node) and free bond (afferent link) and then making one or more radicals out of the remaining composition. The function GENRAD! constructs a single radical by this method; MAKERADS constructs two, three, or four radicals from a single composition; and GENMOL determines the center of a molecular structure and causes two or more radicals to be constructed to attach to the center. The function UPRAD takes a radical and returns the next higher radical which can be made from the same elements. UPMOL does the same for mole- cules. The function ISOMERS causes all the structures for a given composition to be generated and printed in ascending canonical order. The program’s constraints are controlled by a number of switches (global variables ) which are pre-set before calling soMERS. The switches are named: SPECTRUM, GOODLIST, BADLIST, NOPARTS, DIALOG, DICTSWITCH, and OUTCONTROL. Individual constraints may be bypassed at the discretion of the user of the program. When all constraints are turned off, the sTRUCTURE GENERATOR becomes a routine graph maker, generating an exhaustive list of all possible acyclic graph structures of m nodes, where different nodes may have different numbers of links (valence). The switch settings for uncon- strained program operation are: (SETQ SPECTRUM NIL) (SETQ GOODLIST NIL) (SETQ BADLIST NIL) (SETQ NOPARTS T) (SETQ DIALOG (QUOTE OFF)) (SETQ DICTSWITCH (QUOTE OFF)) (SETQ OUTCONTROL (QUOTE OFF)) 1 The Lisp functions which perform certain operations will be identified in this report. To simplify the discussion, however, their arguments and operation will not be discussed. A separate paper lists all L1sp functions contained in the STRUCTURE GENERATOR and outlines their use. 222 BUCHANAN, SUTHERLAND AND FEIGENBAUM 3.2. The SPECTRUM and the Zero-Order Theory of Mass Spectrometry The SPECTRUM of the STRUCTURE GENERATOR is a Single list of numbers, corresponding to significant mass numbers in the real spectrum. The DATA ADJUSTOR sub-program provides the STRUCTURE GENERATOR with this list of numbers, all of which have equal importance as far as the STRUCTURE GENERATOR is concerned. The SPECTRUM is consulted to confirm the presence of compositions and radicals. The first reference to SPECTRUM is by the function ISOMERS which must make certain that the mass of the input composition is present. If it is not, then no structures can be generated for that composition. Any smaller composition can be made into structures if it is not inconsistent with the Zero- Order Theory described below. This constrains the program to consider only those sub-compositions which have some promise of leading to structures compatible with the SPECTRUM. The Zero-Order Theory assumes that every bond of a structure to which it applies will break (one bond at a time) and that at least one of each pair of substructures obtained from a single break will contribute its mass to the spectrum. The Zero-Order Theory does not apply to double bonds, triple bonds, or bonds leading to certain small structures. That is, in order for a structure to be consistent with the Zero-Order Theory, at least one of the following conditions must be met; The structure contains exactly one atom other than hydrogen. The afferent link is greater than 1. The mass of the structure is less than 30. The mass of the structure is in the SPECTRUM list. . The complement mass of the structure is in the SPECTRUM list. AR YN = This Zero-Order Theory of Mass Spectrometry is crude but easily imple- mented. A more elaborate spectral theory is contained in the PREDICTOR (section 4), but obtaining such a spectrum for an arbitrary structure con- sumes more computer time than would be practical in a program such as the STRUCTURE GENERATOR. The Zero-Order Theory is sufficient to limit the output of the STRUCTURE GENERATOR to a small class of hypotheses, but it will need major revisions before it can be classed as a ‘smart’ limiting heuristic. To make use of spectral information in the STRUCTURE GENERATOR it is merely necessary to execute (SETQ SPECTRUM L) where L is a list of integers, corresponding to the desired mass numbers. To terminate use of spectral information, execute (SETQ SPECTRUM NIL). When structure generation is proceeding in the presence of a SPECTRUM, the work that is remembered for future reference (see section 3.7 describing the dictionary) is independent of the spectral data, so it is permissible to use several different spectra in succession in the same program core image. 223 MACHINE LEARNING AND HEURISTIC PROGRAMMING 3.3. Preventing the generation of forbidden substructures Some chemical structures are so implausible (unstable) that they would never exist, either alone or imbedded within any larger structure. The STRUCTURE GENERATOR has a list (BADLIST) of these implausible struc- tures; and no output of the sTRUCTURE GENERATOR will contain any substructure on BADLIST.! The STRUCTURE GENERATOR avoids generating structures containing forbidden substructures by checking rigorously before attaching new atoms to a piece of structure. At every step in generating a radical, the program knows the partially built structure and can determine whether the atom and bond which are about to be attached to it will include one of the forbidden substructures. The following process insures that no forbidden structures will be formed: 1. The partial structure is guaranteed to be plausible because of previous checking. 2. Form the new partial structure by adding the next bond and atom. 3. Consider all elements of BADLIST which have a top atom identical to the atom just added to the partial structure. 4. For each such element of BADLIST, compare the radicals which are attached to the top atom of the new structure with the radicals attached to the top atom of the BADLIST structure. 5. If every radical on the BADLIST structure is found in the list of radicals on the new structure, then the new structure must be rejected. 6. Rejecting a structure means that it is necessary to change either the added bond or the additional atom (or both) in order to generate an allowable structure. Note that this process prevents the STRUCTURE GENERATOR from creating many implausible molecules, since the addition of each new node causes a check to be made for forbidden substructures including that node. Usually only part of the structure has been generated because unallocated atoms are added only to stable pieces. Each forbidden substructure appears on BADLIST several times, once for each possible top (apical) node. Structures are added to or removed from BADLIST by the function FIXBADLIST, which first generates all the forms of the forbidden substructure, and then adds to or deletes from BADLIST according to the desire of the user. Naturally if there are no structures on BADLIST then there are no constraints on the output of implausible structures. 1 As described in section 1, BADLIST may be expanded according to given spectral data. The permanent part of BADLIST belongs to the program’s theory of chemical instability. But substructures of both the theoretical and the context dependent parts of BADLIST are treated alike. 224 BUCHANAN, SUTHERLAND AND FEIGENBAUM The current form of BADLIST has been suggested after several iterations of the following loop: suggest forbidden substructures. generate output. inspect output. The currently forbidden substructures are listed in Table 5. Hydrogen atoms must be specified explicitly, and lists of atoms enclosed in parentheses indicate that any member of the list may be used in that position on the substructure. 1. c=c—(N,0)—H 2. c==c—(N,0)—H 3. N==N—(N,0,H) 4. H—-O—C—(N,0)—H 5. H—C—N=0 6. N==C—O—H 7. O—O 8. (N,o)—(N,0)—(n,0) 9. o-s 10. s—s—s H ll. H—N—c—NZH NANY 12. (N,o)—C—Oo—H oO Table 5. Forbidden substructures comprising BADLIST 3.4. Partitioning a composition into plausible sub-compositions The unconstrained structure-generating algorithm produces molecules by first determining the center of the structure (bond or particular atom) and then generating all possible radicals out of the remaining composition and attaching them to the center in all possible combinations. When all possible centers have been considered, the process of structure generation is complete. The task of generating a set of n radicals from a single composition requires that the composition be divided (partitioned) into n subcompositions. Then a radical is generated from each smaller composition. All possible partitions are considered in the unconstrained program, regardless of whether the sub-compositions are ‘plausible’ or compatible with a spectrum. The lowest partition is considered first, where ‘lowest’ means that it has two sub-compositions, of equal size if possible, and with the lowest ranked atoms all in one of the compositions (where the arbitrary ranking Q 225 MACHINE LEARNING AND HEURISTIC PROGRAMMING from carbon to sulfur is: C< NO or set ULIM< 10. If ULIM=10 and LLIM=0O, all parti- tions will be plausible. 2. To activate the partition list, first set the switch NOPARTS=NIL. Then, before generating any molecules, the program will ask the user if he wishes the partition list to be constructed. 3.5. Specifying required substructures The basic components used by the STRUCTURE GENERATOR are chemical atoms which possess two properties, valence and atomic weight. These atoms are connected by bonds to form radicals (structures with a free bond) which may in turn be connected with other atoms and radicals to form larger radicals and molecules. The STRUCTURE GENERATOR can also treat complex structural fragments as atoms. Structures so treated have come to be known as ‘superatoms’. The STRUCTURE GENERATOR replaces a group of atoms in the given composi- tion by the name of a corresponding superatom, and generates structures with the revised composition (including the superatom name). Only at output time (if then) do the constituent atoms of the superatoms re-appear. Two obvious benefits arise in the use of superatoms: 1. The generation of isomers of a composition is faster because there are fewer atoms in the composition. 2. Structural fragments essential to an explanation of a mass spectrum may be made into superatoms. All isomers will contain the selected substructure, thus the output list is more relevant to the data. A third benefit was realized as a result of the use of superatoms: 3. Ring structures can now be generated by the previously acyclic STRUCTURE GENERATOR by specifying each different ring as a superatom. Normal atoms are known to the STRUCTURE GENERATOR because their names are on a global list called ORDERLIST. The usual value of ORDERLIST is ‘(C NOPS)’ and the position on this list defines the ‘DENDRAL Order’ of the atom: C