1977-78 Annual Report RR-00612 Section 2.5 working space: rule RL := ? one of the following: CLEAR FETCH NAME GRAPH BREAK PEAKGROUP DRAW SHOW ADD QUIT °°? := GRAPH entering editstruc. (NEW STRUCTURE) >RING 5 >BRANCH 51111121 >NDRAW RL 3 9 /\1/ 4 2 | | 5-1 / sN 6 8 7 >DONE (Rl DEFINED) BREAK (1 2) (5 4) := PEAKGROUP 235 ? one of the following: DELETE CLEAR FETCH NAME ‘TRANSFERS SIGNIFICANCE INTENSITY SHOW ADD QUIT ?? ::= NAME Pl 2:= TH -l :3:= SHOW working space: peak group Pl (TRANSFERS -1) ::= INTENSITY 80 2:= ADD Pl included in PEAKGROUPS next: :3= NAME P2 ::= TH -1 H20 -1 working space: peak group P2 (TRANSFERS :-H-H20) ::= I 50 ees SH 51 1977-78 Annual Report RR-00612 Section 2.5 working space: peak group P2 (TRANSFERS :-H-H20) (INTENSITY 50) (SIGNIFICANCE 60) ::= AD P2 included in PEAKGROUPS next: 2:=Q := SHOW working space: rule R1 show subgraph drawing? Y/N. show connection table? Y/N. (BREAK (1 . 2) (5 . 4)) peak group Pl (TRANSFERS -1]) (INTENSITY 80) (SIGNIFICANCE 90) peak group P2 (TRANSFERS :-H-H20) (INTENSITY 50) (SIGNIFICANCE 60) := ADD Rl included in RULES next: s= N R2 =G «644 8 14 4 M+ present & listed : 7 5 1 - M+ present and not identified: 1 (a) 4 (b) 2 (c) - M+ absent but identified : - 6 18 7 M+ absent but listed : - l - 8 M+ absent and not identified: - - 2 (c) 4 (b,d) Notes on errors: (a) errors due to ions recorded at masses considerably above M+ (b) errors due to impurities. (c} errocs due to simple parity tests failing to detect presence of nitrogen. (d) errors due to mass differences of more than 11l5amu between the highest mass ion recorded and the true molecular weight. ‘Table III, MOLION Results with Four Classes of Compounds. 2.6.2 Proposed Additional Work on the MOLION Program The MOLION program is currently being implemented on the PDP11/45 based computerized GC/MS systems in the Departments of Chemistry and Genetics. The program will be evaluated through tests on the analysis of urine samples and other body fluids. Further developments of the system will be made in accord with the results of these tests. 2.7 Congen Improvements During the past year many improvements have been made in the version of CONGEN available for outside use. These improvements allow the user more flexibility and range in the use of existing commands. Further, some new commands have been created which increase the power and utility of CONGEN. The program has become easier to use and more robust. Finally in almost every subsection of the program the user can inspect the computation as it proceeds. This means that fewer long, wasteful computations will be performed. 2-761 Erroc Detection in Substructure Definition We now differentiate between two types of substructures called patterns and supecatoms. This simplifies the chemist's 60 977-78 Annual Report RR-00612 Section 2.7 interaction with the program in that it makes explicit the two different ways in which substructural information is used in the curcent version of CONGEN. Further, this distinction helped us weite very complete error detecting routines in a relatively small number of lines of code. When the chemist indicates that he(she) is finished defining a substructure by typing “done” in editstruc, we check the substructure for errors. If we find errors we ask the chemist whether or not he chooses to fix them. However, if the chemist tries to enter this substructure on the composition list or on the constraints list without fixing it we indicate that the errors have not been fixed. Further we ask him if he would like to fix. them. If he says yes, we put him inside Editstruc. In Edit-struc there is a new command called ERRORS which will print out all of the errors made in definition. If the chemist does not choose to fix the errors, we warn very clearly that the results will be unpredictable or erroneous. We allow the chemist to go ahead on the philosophy that he may have a perfectly good reason for doing what seems to us to be nonsense. Some examples of the types of errors detected are: a) x names and polynames in superatoms; b) undefined atom names in patterns or superatoms; c) free valences on patterns; dad) no free valences on superatoms; e# an atom with too many attached bonds or a conflict between the hydrogen range specified and the number of free valences specified; f) the lack of exactly one tag in a proton constraint; and g) problems connected with use of multipld link nodes. With link nodes we detect when one link node is illegally bonded to another link node, when a link node wrongly has more than two neighbors, when a link node is monovalently bonded, and finally when a link node has a tag. Much remains to be done in extending and integrating the link node concept through out the rest of congen. There also needs to be further error checking to warn users who change a superatom and then call the generate routines with out redefining composition. We have concentrated our efforts on mistakes which we have observed frequently when other chemists use CONGEN. Moreover, this error checking will serve to reduce substantially the errors made by chemists using the program. 1977-78 Annual Report RR~-00612 Section 2.7 2.7.2 Depth-First Imbedder The IMBEDDER program was completely rewritten. Four major improvements were implemented and the efficiency of almost all of the different subsections was improved. First, the method of camputation was changed from breadth first (all structures delivered at the same time) to depth first (the structures delivered one at a time as they are created). The chemist can now check the computation as it proceeds by using the cntrl-S and cntrl-I features. Use of the cntr1-S feature often will allow the chemist to see that a certain computation is much larger than he anticipated and to stop it before canputer time is wasted. Use of cntrl-I allows the chemist to see if the imbedding is proceeding in producing the kind of structures he anticipated. If not, the computation can be stoped and the problem redefined with only minimal loss of human and computer time. Previously the chenist would have had to have waited excessive amounts of time before seeing any results only to find, for example, that another constraint should have been used or, for another example, that more pruning should have been done before imbedding. This new improvement will result in the saving of large amounts of computer time. Second, all the constraints testing during imbedding is now done in the SAIL portion of CONGEN and structures violating the constraints are not returned. Previously all structures were ceturned to the LISP portion of CONGEN before any constraints checking and subsequent pruning were done. This new approach represents a real gain in efficiency because these programs cun much faster in SAIL than they do in LISP. Third, the canonicalization routines were rewritten. When they were first written there was the prospect that our system might be interfaced with Chemical Abstracts. With this prospect in mind, the canonicalization was done using a modified form of Morgan's algorithm which was easy for people to understand and closely related to the canonicalization algorithms used at Chemical Abstracts. Recently, the procedures were redesigned with efficiency as the only criterion. New algorithms were found and the process of assigning a canonical number to a structure is now much less costly in terms of time. Further, two structures which are aromatically equivalent are given the same key (related to its canonical number). Previously they were given different keys and special methods in Lisp were needed to insure their equality. Since many different parts of CONGEN use the canonicalizer this resulted in a gain in efficiency for all of them. Fourth, a change was made so that any number of superatoms can be imbedded at one time. This means when large numbers of superatoms need to be imbedded the chemist can in one set of commands perform the entire task, rather than the more time consuming approach of one-at-a-time. However, the chemist can still choose for reasons of efficiency to imbed a single 62 1977-78 Annual Report RR-00612 Section 2.7 superatom when special tests on the environment of that superatom ace required. This also provides the opportunity for large, multiple imbeddings to be done in batch mode. (The batch command was rewritten so that it would accept multiple superatoms.) The large batch job is then run after midnight when the load average is low to increase the amount of computer time for other uses. 2.7.3 EDITSTRUC Changes The RENUMBER command was added to give the chemist flexibility in choosing schemes of numbering the atoms in the structure. There have been internal changes made to the editstruc commands BRANCH, LINK, CHAIN, and DELATS. All involve the method of numbering atoms. It is now possible to create a substructure which has "“gaps"(missing numbers) in its numbering to atoms. These changes necessitated some further changes in the routines which prepare and send structures to the IMBEDDER in CONGEN. 2.7.4 BATCH The BATCH command was rewritten to take advantage of the fact that the new lower fork programs can accept any number of Superatoms to be imbedded. As the system load continues to increase BATCH will become a more attractive option. 2.7.5 RESTORE The RESTORE command was changed so that files written by REACT can be restored as well as files written by CONGEN. REACT users make use of the new EXAMINE subcommand (discussed elsewhere in this ceport) as well as the mass spectral ranking program MSRANK. Therefore it is very natural for a REACT user to save cesults using the SAVE subcommand in REACT and later to restore these structures in CONGEN in order to examine or rank them. The reason for the incompatibility between REACT format amd CONGEN format is that REACT save files contain often many different structure lists whereas CONGEN save files contain only one. Thus the RESTORE command in CONGEN must ask the REACT user which structure list to restore. 2.7.6 TREEGEN The TREEGEN command was rewritten to ensure that no duplicates are produced. Duplicates arise if there is a superatom on the composition list and further if from the remaining atoms a copy of that superatom or a portion of that superatom can be constructed. Another way of saying this is duplicates arise if there is a pattern in the superatom and that pattern can be built 63 1977-78 Annual Report RR-00612 Section 2.7 from the remaining atoms in the composition list. The new routines check for a sSuperatom on the composition list and if there is one the canonicalization routines are called for each structure. - This key is used to determine if the structure has already been added to the structure list. Several functions had to be rewritten to insure that this was done efficiently. 2.7.7 SURVEY AND EXAMINE The command SURVEY was added to the experimental CONGEN and routines were written which allowed the chemist to look over the structure list for certain functional groups and other features defined by the. SURVEY has since been incorporated into a command called EXAMINE (see section 2.3). 2.7.8 DRAW The draw command was extensively rewritten so that it would be more flexible. The user can either draw the whole structure list or give the command an argument which gives the range of structures to be drawn. The range given may or may not use the number associated with the structure as a key. The user can also give ranges such as 1-3 which means draw the first through the third structures on the list. 2.8 CONGEN Efficiency We have a continuing long term commitment to improve the efficiency and the celiability of CONGEN. Algorithms under developnent are written quite differently from the way they should be rewritten to execute efficiently and reliably. For example, it is very natural to use free variables in the development of new code, but eventually when the function is assumed or proven to be working correctly these free variables should be eliminated to buy efficiency and modularity. LISP's inherent inefficiencies can often be circumvented by careful reprogramming. The project of block compiling CONGEN descr ibed below is a major step forward in providing an efficient and robust base for future development. At the outset we had hoped that block compiling would speed up constrained structure generation by a factor of at least four. We ended up after a great deal of fine tuning with a speed-up factor of about two, but much higher factors in other parts of the code. At the beginning of this project we used the new LISP subsystem Masterscope to analyze each of CONGEN's twenty one files. For each file we prepared a database for that file which 64 1977-78 Annual Report RR-00612 Section 2.38 contained information about its functions and their variables. These databases are important for maintenance and ease of learning CONGEN as well as for their short term purpose of facilitating block compilation. For any function on a file which we have analyzed we can ask the database wnich variables that function uses freely and which other functions it calls. When all the files have been analyzed we can find out which functions call a particular function. Further we developed automatic batch programs to make and test new versions of CONGEN and to update the supporting databases. These programs can be run at night when the system is lightly loaded. This helps spread out the load on our heavily used system and leads to improved quality control in the version we supply to users since our testing can afford to be much more extensive and thorough. Now that we have finished the work of block compiling CONGEN much testing remains to be done to insure that no new errocs have been introduced. Once we are satisfied that the block compiled version is robust and reliable it will replace the version of CONGEN which we distribute to outside users. When we have ceached this stage the new block compiled version will be used for development as well. Blocks which are under capid development can be substituted into the block compiled version in their interpreted or normally compiled form. New blocks can be added without disruption of the existing program organization. Hence in gaining efficiency we have not lost flexibly. The work on block compiling was done with the help and direction of Larry Masintec, a former member of the DENDRAL project, now with Xerox Palo Alto Research Center. 2.9 CONGEN Reprogramming We have been investigating the reprogramming of CONGEN into an Algol-like language. The goals of reprogramming are threefold: first, to unify the program into a single language which can be used on a variety of computer systems; second, to begin to compact the program into a manageable, cost-effective size for current time-sharing systems; and third, to improve typical cuntimes for CONGEN so that it becomes a more attractive means foc scientists to solve structure elucidation problems. A version of CONGEN which fulfills these goals would be useful on a variety of canputer systems and could be exported to many different chemical and biochemical laboratories. 2.9.1 Unification Into a Single Language CONGEN is currently coded in three different programning 65 8 Annual Report RR-00612 Section 2.9 languages. The constrained cyclic structure generator, which is the basic algorithm responsible for the generation of structures, the entire usec interface and a number of control routines necessary to support communication between the three languages ace all coded in Interlisp. Interlisp is a list-processing language, and the sections of CONGEN written in this language are heavily oriented toward the use of lists as data structures. The part of the program which deals with the drawing of structures, either ona teletype or ona graphics terminal, is coded in FORTRAN. The remaining parts of CONGEN, including those parts of the program responsible for obtaining final structures fron intermediate representations and various routines to support these functions, are all written in SAIL, an ALGOL-60 variant designed for the PDP-10 computer. Although it would be possible to emulate Interlisp's list processing facilities using, for example, the REFERENCE and RECORD capabilities of SAIL, initial timing tests have shown that no significant increase in speed could be obtained by such a move. It is felt that this is a reflection of the fact that Interlisp is tuned for list processing, and that SAIL merely provides list processing as a “add-on" feature to ALGOL. We believe that it will be possible to unify CONGEN into one ALGOL-like language by utilizing data structures more Ssuitapdle in such a language. It would be desirable to use a language which provides as little overhead as possible to the size of arunning program. Although the mathematics of the algorithms in CONGEN is quite complex, the algorithms themselves make no complex demands on a programming language. Our initial choice of language which we are exploring as a vehicle for ceprogramming, BCPL, is discussed in more detail in a subsequent section. 2.9.2 Compaction Into a Manageable Size. The Sumex computer system facilitates rapid development of complex programs: a virtual memory, paging enviconment with a full 256K core image available to each of CONGEN's different language segments. We estimate that a fairly direct translation (i.e., with a minimum of redesign of the algorithms beyond ceplacement of lists with arrays} would likely result ina peogram requiring approximately 300K words of memory in which to cun. This is too large, even with extensive use of overlays. With a certain amount of theoretical work, we can develop an algorithm related to one discussed by Sasaki. We have made significant progress on this problem (see below. The algorithm will need to be mathematically proven and made suitable to handle the majority of problems with which CONGEN is typically presented. 66 1977-78 Annual Report RR-00612 Section 2.9 Using an adaptation of the Sasaki algocithm, and redesigning the current SAIL and FORTRAN portions of CONGEN, we expect that an overlaid version of the new CONGEN would need on the order of 20-30K words of core to run on a PDP-10. Our preliminary estimates are of course subject to uncertainty. They presume an external device (random access disk) for storage of structures. Large problems (1-2000 structures) would require temporary files totaling about 100 pages (512/PDP-10 words per page). The whole package in one core image would require 51K words. Any mechanism for overlays would make the largest core image required about 21k. 2.9.3 Improvement in Runtime In addition to decreasing the program size, it is also necessary to minimize execution time. An experimental version of the algorithm mentioned above has been written in BCPL. We have obtained initial timing information for this algorithm and compared results with the current CONGEN. The test cases used were representative of the types of problems with which the program will be confronted. The new generating algorithm is designed to replace the Interlisp structure generator. For the typical problem involving real structures, the generation problem deals with Supecatoms. Thus, the empirical formulas in Table IV represent a whole class of problems. For example, the time to perform CoHN.05 represents the time for not only isomers of this formula, But also the time for any problem with two tetravalent, two trivalent and two bivalent Superatoms together with two hydrogen atoms. Table IV. Preliminary Timing Estimates* for Isomer Generation Number of Time in seconds for generation Empirical formula Isomers Interlisp CONGEN BCPL algorithm © CoHN505 506 233 10.0 CeH, 217 113 9.5 Nig 91 52 70.5 4 The times were obtained by cunning the respective test cases on lightly loaded DEC KI-10's. The values obtained for the CONGEN in Interlisp were obtained on a system operating under Tenex. The values obtained for the new BCPL algorithm were obtained ona system operating under DEC's TOPS-10 operating system. The values presented are the average of three runs, but must be viewed only as approximate because of variations in system overhead, e.g., paging, expected during normal use. The timing values in Table IV are what we expected given 1977-78 Annual Report RR-00612 Section 2.9 our knowledge of the two algorithms. It is characteristic of a Sasaki-like algorithm that as the number of atoms (or Superatoms in our implementation) of the same type increases, execution time increases exponentially. The considerations of symmetry built into the Interlisp version treat such problems more efficiently, so the increase in execution time is somewhat less than exponential. There is a point where the efficiency of both algorithms would be the same. This is illustrated by the data in Table IV. With diverse atom types the BCPL algorithm is 23 times faster for the example case CoHjN,05. With six tetravalent atoms oc Superatoms of the same type, tng BCPL version is only about ten times faster. For Ni,, where all atoms or Superatoms are of the same type and the same degree (Same number of non-hydrogen neighbors} the Interlisp version is faster. It is characteristic of most problems with which CONGEN is confronted that there is a diversity of types of Superatoms. In our opinion, the factor of 10-25 in increased speed for such problems justifies using the new algorithm, Work is also currently underway ona separate imbedding package, Similac to the package in the current CONGEN, but restructured for efficiency. This, together with the structure drawing program will place all of CONGEN in a_ common language. We can report that as of Feb. 5, 1978, the first version of the imbedding algorithm in BCPL is cunning. Work is now under way to compare results with the production version of CONGEN to ensure the accuracy and reliability of the new imbedder. 2.9.4 Choice of Language for Reprogramming Recursion seems essential to the clear phrasing of most of the algorithms, both in future development work and in the initial ceprogramning effort. Since provision for recursion in FORTRAN is usually by an add-on package or other such assembly of special routines, and since this facility is not available on the machine on which CONGEN is being developed, FORTRAN is not viewed as a likely candidate as a language choice. Ina _ similar vein, many ALGOL implementations ace quite inefficient in handling recursion. In many of these ALGOL implementations, due to the fact that any function is allowed to be recursive, one must pay the price of recursion even for non-cecursive portions of the program. Two ALGOL based languages which are exceptions to this generalized method for handling recursive coutines are SAIL and BCPL. SAIL requires one to explicitly declare a procedure as recursive, and then to use a different calling technique than is used for non-recursive procedures. SAIL also provides more explicit control over allocation of recursive variables. BCPL is a “mini-ALGOL" with the same block structure and looping statements as ALGOL but with much more limited semantics. BCPL compilers are generally small and simple, and are available 68 977-78 Annual Report RR-00612 Section 2.9 on a variety of machines. We feel that there are three advantages to BCPL. First, because its structure is simple one is to some extent insulated from collapse of or changes in the supported compilers. It would probably not take over a month for someone experienced in compiler implementation to construct a basic BCPL compiler from scratch. Thus, if programs are restricted to a fairly pure form of the language, even a total cemoval of support from the language by outside agencies would not be fatal to those programs. Secondly, the BCPL run-time system is generally quite small, and adds little overhead to the size of a running program (on the order of a few thousand words of core storage, compared to a few tens of thousand words of core storage for SAIL). Lastly, because BCPL is closely rcelated to both SAIL and ALGOL, it would be fairly simple and largely mechanical to convert the BCPL code into its SAIL or ALGOL equivalent, if such a translation proved necessary or desirable. Largely as aresult of the effort required to analyze adequately the questions posed by the conversion study, work has already begun on the initial stages of CONGEN reorganization and translation. In an effort to study the the effect of BCPL, the target language, on program efficiency, as well as to study the speed of the generation algorithm described above a modified implementation of the algorithm was coded into BCPL. This code formed the basis for the estimates of speed and size improvements described previously. 2.9.5 A Version of CONGEN for the Chemical Information System We have recently been discussing the prospects of a version of CONGEN available to the public on a fee-for-service basis as part of the NIH/EPA supported Chemical Information System (CIS). Discussions with Dr. Steve Hellec, EPA, and Dr. William Milne, NIH, sevecal months ago resulted ina contract with Stanford University to investigate the feasibility of translation of CONGEN into a language which could eventually be supported by the CIS. The outcome of this study was that such a translation was possible given the limited goals of the task. The previous section on reprogramming languages and progress was taken in part from the results of that study. We are currently drafting a detailed contract proposal to carry out the translation. The limited goals include providing CIS with a version of the CONGEN program including some (but not ally of its current capabilities. This version is to be written in an Algol-like language and must run on the Division of Computer Resources and Technology's (DCRT's) DEC PDP-10 at NIH. It is important to contrast these limited goals with the more ambitious objectives of reprogramming as discussed in the previous Section: 1) The translation for the CIS will produce a 69 1977-78 Annual Report RR-00612 Section 2.9 version which will cun only on the DCRI/NIH system. It will presumably also cun on similac PDP-10's using the TOPS-10 operating system. This, however, represents only a very small subset of computer systems available to the chemical community. Thus the CIS version will not meet our expressed objective of a version of CONGEN which is exportable to a large number of cesearch groups; 2) the translation for the CIS will utilize the Basic Canbined Programming Language (BCPL}. This is an Algol- like language which will suffice for the CIS version. It is not clear that this language is optimum for a program which is to be widely distributed. The development of a more machine- independent language, such as MAINSAIL at SUMEX, may provide a much better vehicle for wide distribution, in which case our efforts under the current DENDRAL grant would be directed toward that language; and 3 the contract with CIS will have very limited provision for introducing new improvements in CONGEN and related programs (see Sections 2.2 and 2.4} to the DCRI/NIH version of CONGEN in BCPL. It is obviously essential to provide the chemical community with the most up-to-date version of CONGEN and related programs. Work under the present grant is directed at this broader goal. At the same time there are obvious similarities to the CONGEN translation effort supported oy CIS and the NIH under the current DENDRAL grant. We would be foolhardy to view these efforts as mutually exclusive. The major similarity is that the choice of language is subject to similar restrictions, meaning that some ALGOL-like language would be used for the exportable version for wider distribution. We feel that a translation of parts of the program, for example from BCPL into MAINSAIL, is a celatively simple task given the similarities of the languages. The translation efforts would, we feel, be synergistic, providing more rapid access of the chemical community to CONGEN and other programs. 3 THEORY FORMATION PROGRAMS ~ Meta~DENDRAL 3.1 Incremental Learning In order to allow applying the Meta-DENDRAL program[3] to a wider cange of chemically interesting problems, we have begun to remove one of the most important current program limitations - its inability to add piecewise to what it has learned. Meta- DENDRAL must currently process all training data at once, producing a set of rules which cover that data. The amount of training data which the program may examine when forming cules is therefore limited by available computer memory. We aim to give the program the ability to learn incrementally resulting in the following benefits: 70 1977-78 Annual Report RR-00612 Section 3.1 1) The chemist will be able to generate rules from one set of training data, examine the rules, and if necessary obtain additional data foc modifying and adding to the rule set. By examining the pactial results produced at each step, the chemist may determine which additional tcaining molecules are most appropriate for the next learning cycle. We expect the program to aid inmaking this-decision by suggesting new training molecules whose spectra will resolve among cules which represent alternate plausible explanations of the observed data. 2) Since the amount of training data processed strongly influences the reliability of the learned rules, training on arbitrarily large data sets will allow Meta-DENDRAL to form more accurate cule sets than currently feasible. 3.1.1 The Approach The proposed approach to incremental learning involves hypothesizing a set of rules on the basis of existing training data, then updating the rule set when new training data is provided. When existing rules apply incorrectly to new training molecules, these rules are modified. New rules are added to the cule set by applying the current one-pass Meta-DENDRAL program to the portion of the new training data which cannot be explained by existing cules. The figure below summarizes the proposed approach. 71 1977-78 Annual Report RR-00612 Section 3.1 | New data | v | Current |—- >| | “| cule | | Modify current rules, and | | set |<———--——| filter out "explained" data | Seon | | A | | | | Rules | unexplained | covering | data | new data Vv ee ec a nee ae ae ee nT Current Meta-DENDRAL | ee es ee ee ee Figure 8. Approach to Incremental Leacning 3.1.2 Modifying Existing Rules Rules must be modified so that they become consistent with the new data while remaining consistent with previous data as well. In shoct, the method involves storing along with each cule a summary of alternate acceptable versions of the cule (those with the same evidential support in the observed training data). The summary of all acceptable versions of a given rule, cefered to as the version space[12] of the rule, is useful for a number of tasks associated with rule learning, including incremental learning. Version spaces provide an explicit representation of the space of all alternate versions of agiven rule - i.e. those which cannot be disambiguated by the currently observed training data. As such, version spaces will allow Meta-DENDRAL to reason more thoroughly with the choice among alternate rule versions. Some expected benefits and uses of this increased ability are listed below. Modifying current rules using new training data. Since version spaces provide a summary of all altecnate versions of a given rule which are consistent with previous evidence associated with the cule, they delimit the range of allowed future 72 1977-78 Annual Report RR-00612 Section 3.1 modifications to the rule. Thus, those modifications to the rule which are consistent with past data are exactly those which yield versions of the cule contained in the version space. At any point, the version space contains all rule versions which are consistent with a given set of evidence. The "best" rule version can then be chosen (e.g.,. on the basis of simplicity or chemical plausibility) from the entire space of rule versions consistent with the data. More complete exploration of alternate rules. The current RULEMOD portion of Meta-DENDRAL tries to make rules more general oc more specific in ordec to improve their pecformance on the training data. RULEMOD tries many such ways of modifying cules, but it cannot afford to try all ways. This portion of RULEMOD will be replaced by a new routine which will use version spaces to explore all ways of generalizing or further specifying rules in order to eliminate negative evidence or add additional positive evidence. Intelligent selection of training instances. Since version spaces represent the range of plausible alternate versions of a cule, they contain the information needed to select new training instances designed to discriminate among competing rule versions. For instance, by examining a given version space the program ought to be able to suggest a set of compounds whose spectra would allow ruling out many of the plausible rule versions while strengthening the evidence associated with other versions. 3.1.3 Version Spaces This section presents a sample version space generated by Meta-DENDRAL, and discusses how version spaces may be updated to take into account new training data. Notice that the program is dealing not with a single cule which will be later modified, but with the space of all plausible rule versions. The algorithm for updating version spaces using new training data is a candidate elimination algorithm: candidate rule versions are eliminated from the version space as they are found to perform incorrectly on new training data. The candidate elimination algorithm is assured of finding all rule versions consistent with a given set of positive and negative training instances. This is accomplished without backtracking and independent of the order of presentation of the training instances. 3.1.3.1 Definition and Representation The key to an efficient cepresentation of version spaces lies in observing that a general-to-specific ordering is defined on the space of cule subgraphs. The vecsion space may be represented in terms of its maximal and minimal elements according to this ordering. 73 977-78 Annual Report RR-00612 Section 3.1 To see exactly how the general-to-specific ordering comes about, consider an example. Suppose that Rl and R2 are two cules which predict the same action. Then Rl is said to be more specific (or, equivalently, less general# than R2 if and only if it will apply to a proper subset of the instances in which R2 will apply. This definition is simply a formalization of the intuitive ideas of "more specific" and “less general". The general-to-specific ordering will in general bea partial ordering; that is, given any two cules we cannot always say that one is more general than the other. Therefore, when all elements of the version space are ordered according to generality, there may be several maximally general and maximally specific versions. Version spaces can be represented by these sets of maximally general versions, MGV, and maximally specific versions, MSV. Given such a representation it is quite easy to determine whether a given cule belongs to a given version space. A rule statement belongs to the version space of a given set of positive teaining instances and negative training instances if and only if it is (1) less general than or equal to one of the maximally general versions, and (2 less specific than or equal to one of the maximally specific versions. Condition (1) assures that the cule cannot match any training instance in I-, while condition (2) assures that it will match every training instance in I+. Since the sets MGV and MSV are by definition complete, (1+ and (2) will be necessary as well as sufficient conditions for membership of a cule statement in the version space. 3.1.3.2 Example From C13 NMR Rule Formation Meta-DENDRAL has been used to determine rules associating substructures of molecules with data peaks in a carbon-13 nuclear magnetic resonance spectrum [11]. Figure 9 shows a version space represented by the program in tecms of the sets of maximally specific rule versions (rule MSVl) and maximally general rule versions (cules MGVl and MGV2}. ‘This version space contains all cules which predict a CMR shift of from 14.0 to 14.7 ppa. downfield from TMS and which are consistent with a _ set of paraffin and acyclic amine data presented to the program. The cule pattern which expresses the conditions for application of each rule is stated in the language of chemical subgraphs. Each node in the subgraph represents an aton in a molecular structure. Each subgraph node has the four attributes shown, with values constrained as shown in Figure 9. 74 1977-78 Annual Report RR-00612 Section 3.1 | Rule Subgraph | Constraints on Subgraph Node Attributes | | | | subgraph node | atom number of number of number of | | name | type non-hydrogen hydrogen unsaturated | | | neighbors neighbors electrons | | | MSVL: | | | | | V-w-X-y-Z v_| carbon 1 3 0 | | w | carbon 2 2 0 | | x | carbon 2 2 0 | | y | carbon 2 2 0 | | z | carbon >=] any 0 | | | | | | ens | | | | V-Wr-X vy | carbon 1 any any | | w | any 2 any any | | x | any >=] 2 any | | | | | | | MGV2: | | | | | | V-W—X v__| carbon 1 any any | | w |. any 2 any any | | x | any 2 any any | | | Figure 9. A Version Space Represented by It's Extremal Sets 75