g) SURVEY (46 K) - Exemination of large structure lists for frequency of occurrence of standard Structural features; and h) STEREO (24 K) - Generation of stereoisomers. 2.1.1.3 Export status We are concentrating our export effort on machines which have a significant number of users in the chemical community. We decided to pay special attention to machines which our CONGEN users have access to now and to those which they have indicated to us that they will have access to in the near future. We are also strongly guided by the Persons attending the workshops (see Section ???) and the machines to which they have ready access. Since many of our users have Digital Equipment Corporation PDP-19 computers this was our first priority. The program was designed to run on the Tops-19 operating system since there is a compatibility package which allows programs which run under Tops-1@ to also run without change on Tops-2% and the Tenex operating systems. We got a version running on the Tops-19 and exported it to Rutgers where it ran on the Tops-28 system, and to two different Tops-1@ sites: Smith,Kline Research and Ely Lilly Research. Since we have a Tenex operating system at Stanford we have now verified that CONGEN does run on all three and that the compatibility package is robust. We are continuing negotiations for a contract, separate from this proposal, to provide a version of CONGEN accessible through the NIH/EPA Chemical Information System (CIS). That system is currently operating on DEC equipment so that direct export of the current version of the program to CIS will be simple. Complete integration of the program into the CIS framework of programs and their intercommunication is, however, a much more difficult task. This task will be pursued and funded separately from the current grant because it is essentially a mechanical programming and documentation effort with little research content. However, the resulting documentation will be available for all persons to who we export the program, thus benefitting our DENDRAL work. We decided on two other machines with their associated operating systems for reasons that will be discussed briefly below. The two machines are the IBM 379 running the new Virtual Machine Conversational Monitor System Operating System (VM/CMS) ( CONGEN running on this system will also run on the next generation of IBM machines the 3190 and the 3380 because they will run an identical (to the user) version of VM/CMS }, and the Control Data Corporation 6600 computer at the National Resource for Computation in Chemistry at the Lawrence Berkeley Laboratory, ~- 101 A study was made of all of the different operating systems for IBM 37@ series computers. We looked carefully at those operating systems with virtual memory. Of these, only VM/CMS seemed to be a reasonable short range possibility since its interactive, time sharing system is very similar to the PDP-1d@ Tenex system. We applied and were admitted as a project to a Stanford/IBM Joint Study project. This. project is slated to provide us with access to a IBM 370 computer some time in the spring. After we get CONGEN running on VM/CMS we plan to investigate in more detail the other 372 virtual memory operating systems and also to investigate the IBM series 1 mini-computer which has a virtual memory Operating system. We estimate that 3 man months will be spent on this project. The NRCC currently has a computer complex consisting of a CDC 7600 ,6688, and 649@ together with ea PDP-8E mini-computer. The 669@ has been dedicated to interactive computing consisting of both interactive programs and an interactive system for preparing batch programs for the 7608. The 6400 serves as an input / output machine and the PDP-8E manages the terminals and teletypes for the 660@. A project at LBL called the real time systems group has brought up a version of BCPL on the 6608 and they have expressed interest in helping us bring CONGEN. , We did a detailed calculation and determined that an average CONGEN session on the 668@ conducted over the Tymnet network would cost about 196 dollars at normal priority at week day rates. This seemed reasonable to us and we will be applying to NRCC for a grant for ‘the computer time necessary to bring up CONGEN. We estimate the task to be about one and a half man months. During our discussion and research we considered a significant number of other machines and other programming languages. We have so far been unable to find any solution to the problem of a version of CONGEN for a mini-computer system such as the DEC PDP-1l series machines. Address space limitations of 32K words make the task prohibitive in terms of effort. Even with systems with memory management, the job of rewriting CCNGEN to fit into 64K 16 bit words is probably beyond our present means in terms of programming time required. We are discussing with Varian Associates, Palo Alto, the prospects for 4 mini-computer version of CONGEN in the PASCAL language. If they decide to undertake such an effort we may have access to a mini- computer version in a language which is rapidly gaining popularity and already enjoys significant transportability among machines. Several other computer systems and languages have been explored for suitability for CONGEN. So far all have suffered from language deficiencies which do not allow the heavy recursion required for CONGENs basic algorithms (e.g. FORTRAN), from lack of transportability (the BLISS and C languages), or from being implemented on a machine 102 which is not widely available in the chemical and biochemical community (e.g., Honeywell, Hewlett Packard). These investigations will continue because new machines like the DEC VAX-11/72 will have an increasing number of users in the future and versions of the BCPL compiler will be available for popular systems. 2.2 CONGEN Developments The reprogramming effort has been far from a transliteration of existing algorithms into BCP, In many portions, the basic algorithmic approach taken in the previous version was reformulated to allow for a more effective representation and solution of the problem. Aside from the development of and proof of correctness for a new structure—- generation technique (related to that of Sasaki) which we discussed in last year’s report, and aside from the work described elsewhere in this report on stereochemistry (Section 2.4) and the SURVEY function (Section 2.5), the major milestones in CONGEN development which have paralleled the reprogramming are as follows: 1) Imbedder. The mathematical technique for expanding superatoms in intermediate structures developed by Brown was reexamined and reformulated to allow for a more compact representation. ‘The primary difference in our new approach is that the topological symmetry group of the atoms, rather than the free valences, is used in the computation. For example, the Superatom A below, with twelve free valences, has twelve topological symmetry operations \/ Cc \/\ 7 < Cc | | ~< Ce /\/ C /\ A interchanging its atoms, but because of the pairwise interchanges between free valences on each atom, the free-valence group has 64x12=768 symmetry elements. The BCPL version of the imbedder carries the symmetry information as 11 permutations of 6 objects (the identity permutation is not explicitly represented) 103 requiring 66 words of memory, rather than as 768 permutations of 12 objects requiring 9216 words of memory. By implicitly representing interchange symmetry among free valences, among the termini of internal bonds being allocated to the superatom ard among monovalent atoms being attached to the superatom, the new version is able to use a drastically smaller amount of space for the storage of symmetry information. Neither of these approaches to imbedding can perceive all possible sources of duplicate structures, so it was necessary also to develop a final filter package to canonicalize the imbedded structures and compare them for duplicates. However, the new version stores the structure representations on an external random-access file rather than in the computer “s memory as was done before, and only a list of pointers to these filed structures is stored internally. Asa result, the new imbedder. can deal with thousands rather than hundreds of imbedded structures using only a modest amount of memory. 2) Constraints. The basic structure generation and imbedding algorithms are of little practical use without the ability to constrain their output based on the presence or absence of structural features. The graph matcher and cycle finder, which accomplish constraint testing, were translated with little change from their INTERLISP counterparts. Inclusion of constraints in the imbedder, where they serve only as a filter on the final output structures, was straightforward. In the structure generator, however , the constraint-testing mechanism was merged much more intimately with the generation process. The main aspects of this merging are as follows: a) As soon as hydrogen atoms are distributed among the non-hydrogen atoms (the first activity of the generator), the distributions are checked against the constraint substructures to determine which distributions can be ruled out a priori. .I£ a substructure is required to be present and contains three methine carbons (CH), for example, the generator will immediately discard hydrogen distributions which do not 104 3) tools. contain at least three such carbons. Many constraints supplied to the generator place restrictions on the possible distributions of hydrogen atoms, and by this mechanism such constraints are tested most efficiently. b) The order in which the generator assembles its atoms is influenced by which atoms appear in the constraints. If a substructure forbidding the construction of peroxides (0-0) is present, the generator will be encouraged to consider possible interconnections among oxygen atoms first so that the presence of peroxides can be avoided early in the computation. Because different constraints may encourage different starting atoms, a. scoring scheme has been developed which is used to establish the overall order of atom assembly, taking all constraints into account. Interactive aids. Much effort has been directed toward the development of a robust and helpful interactive system to allow a user easily to define a CONGEN problem and to make use of the basic algorithmic The primary accomplishments in this direction have been as follows: a) The development of LINSTR, a package of BCPL functions for interactive input from the user, accessed by all of the interactive CONGEN modules. The line~input and prompting functions in LINSTR provide for three levels of help information which can easily be passed from the main program. The first level consists of prompts which are typed to the user when information is required by the program. The novice may step through the prompting sequences supplying one piece of information at a time in response to these prompts, while the expert user may anticipate the prompts 105 and type ahead his responses on the line to avoid the prompts. This, together with the ability of the LINSTR functions to accept unambiguous abbreviations for keywords, allows a great deal of flexibility in the form of the input. For example, the following two sequences accomplish the same effect in the program (user’s inputs are underlined): Step-by-step input; DEFINE DEFINITION TYPE: SUBSTRUCTURE NAME: R6 . (NEW SUBSTRUCTURE) >RING 6 >DONE ~ R6 DEFINED Condensed input; “DE S R6;R 6;D0 (NEW SUBSTRUCTURE) R6 DEFINED A second level of help is provided by the °?° facility which can can be evoked at any prompt in the program. At these points, the °?° input will cause helpful information vassed by the main program to LINSTR to be typed to the user. The third level of help is provided by a similar °??° facility, which will cause the program to refer to a much more extensive on-line help document to give a full description of the expected information, and the context in which it will be used. This third level is still under development; the basic mechanism has been developed but we have not yet constructed the on- line documentation, b) The simplification and extension of the basic commands. The 106 number of basic CONGEN commands has been reduced from 29. to 14 by the consolidation of commands with similar function (e.g., SHOW is now a general- purpose method of obtaining information about the session and replaces six previous commands) and eliminating little-used options (e.g., TREEGEN). The number of EDITSTRUC commands has likewise been reduced from 23 to 17. Also, previous concepts which were somewhat artificial have been removed. For example, a user does not now need to distinguish between superatoms and patterns when he defines a substructure. The representations for these two types of substructure have been consolidated and a defined substructure can be used in either context. As another example, the user does not need to place substructures on BADLIST any more - the new input sequence allows him to express the Presence or absence of substructural features in a natural statement such as “exactly 3° or ‘at most 1° or ‘none’. The new command structure seems easier for users to remember and work with. 2.3 RESCURCE SHARING 2.3.1 CONGEN Workshops In early December, 1978, we held at Stanford a series of mini- workshops on the use of an exportable version of the CONGEN program. Invitees included members of the chemical and biochemical community who are actively engaged in solving the structures of unknown chemical compounds encountered in research in industrial, academic and government research laboratories. The primary purpose of these workshops was to introduce experts in the field of structure elucidation to the first version of the exportable program. These persons were chosen for . their chemical and biochemical expertise; few had significant experience with computers previously. Thus, they represented what we think is a good cross-section of the community of potential users of CONGEN. We held three three-day sessions of the workshop so that we could offer access to a computer terminal for all 107 the persons at one session and so that we could provide close supervision and assistance as they began to learn and use CONGEN. We also implemented a recording scheme so that an interactive session at the terminal could be recorded as a text file and available after the problem was completed for close scrutiny for the chemist and for ourselves. Such scrutiny reveals, for example, common difficulties in certain portions of the user interaction thereby pointing out areas for improving the interaction. The persons who attended the workshops, their affiliation and a summary of their reactions to the program are summarized in Appendix I. We also include in that Appendix persons who were not able to attend the workshops but desire, on the basis of our contacts with them with regards to the workshops, a copy of the exportable CONGEN. A copy of the original letter sent to one of the invited persons is included as Appendix II to describe our purposes in more detail. Although the version of CONGEN used in the workshops was not complete, enough of the program existed in close to final form to allow us to fulfill our other purposes. We wanted to ensure that any remaining program errors could be detected and fixed prior to making the program more widely available. The best way we have found to do this once a program is essentially debugged is to confront the program with a wide variety of problems from many different users. We also wanted to determine if there were major deficiencies in any part of the program which made it difficult to understand or use. Eliminating such deficiencies would ensure that an exported version would meet the needs of the persons attending the workshop, i.e., that some minimum standards of acceptability could be determined and met. Finally, we needed to determine the computing facilities available to this group and in detailed discussions to explore opportunities for export to their own laboratories. This allows us to set some priorities on developing versions for various makes of computers. The facilities of each attendee and the current and future state of export to each laboratory are summarized in Appendix I. 2.3.2 Conclusions from the Workshop There are several conclusions which can be drawn from the workshop experience. The reaction of all persons attending the workshop was very positive, not only concerning organization and intellectual stimulation, but also with the problem-solving capabilities of the program. The following are major positive aspects of the workshop experience: a) we were able to meet our goal of demonstration of exportability by utilizing CONGEN on two different computers during the workshop; 108 _ b) every participant found the program of sufficient utility to express an interest in obtaining a version in some way for his or her own laboratory; c) the interface to CONGEN, extensively modified based on experience with the old version of the program, proved much simpler to use, much more chemically logical and consistent and much more helpful to the user in providing guidance and error checking; d) several new problems were analyzed successfully at the workshops, either by verification of the unambiguous nature of the structural assignment or by obtaining a list of candidate solutions to guide further experimentation; e) installation of the exportable version nas been completed successfully at two different sites, Lilly Research and Smith, Kline and French Research, and several more will follow in the next two months. There are some common criticisms expressed by the persons attending the workshop which, in our opinion, represent points of focus for the remainder of the grant period and for a renewal application. Briefly, the major deficiencies were as follows: a) The requirement of specifying non-overlapping structural units is non-intuitive and thus unnatural. Other programs, like CONGEN, share this difficulty, but we are in a position to remedy it based on recent research so that future versions may be easier to use; b) The program is very complex and lacks sufficient documentation or internal “help” facilities. We recognize this and to some extent it is a reflection of the lack of maturity of the new version. We plan to provide better on-line help facilities accessible from within the program and a much more comprehensive program guide with examples. c) The teletype oriented drawing program produces some drawings which are difficult (if not impossible) to interpret. Providing the chemist with a connection table of such drawings, as we can do currently, is no long-term solution. Here we face the problem of diminishing the exportability of the program if we restrict its use to certain types of graphics terminals (there are many types, each requiring different programs to operate). Currently there is no graphics terminal which is competitive in price to character- 109 oriented terminals. One way to solve this problem is to encourage collaborators to provide their own graphics packages which we can then in turn offer to others. 2.4 Stereochemistry 2.4.1 SAIL Program The stereoisomer generator program written in SAIL and discussed in last year’s annual report has been improved in several ways. The program has been modified to-process lists of structures to count and/or generate the possible stereoisomers. Thus with the existing CONGEN structure generator it is now possible for the first time to generate all the possible stereoisomers for a given empirical formula completely and irredundantly. These stereoisomers are represented in a compact canonical form and are written onto a disk file by the program along with other information about the structure. Three additional features which were proposed in the last annual report have been added to this program. First, at the user’s discretion, the program will compute cis and trans double bond designations for the stereoisomers and write these on the file. Second R and S designations for tetravalent stereocenters based on the Cahn-Ingold-Prelog conventions are computed for stereocenters which are not fixed by any nontrivial symmetry element. These designations were thought to be the most useful and most stable with respect to future changes of the R/S .homenclature system. Third, the ability to handle stereochemistry of common heteroatoms with valence less than 5 has been added. A small interactive package has been added for deciding whether trivalent nitrogen atoms are free to invert. The user is given a choice for each such nitrogen atom. This program has been included with the current LISP version of CONGEN (it rums as a separate fork) and is available to all users who can access SUMEX. It has been extensively tested on well over 1900 structures. Further details can be found in the publications cited in this report. (HPP-78~-8 , HPP-78-9) 2.4.2 BCPL program Since the CONGEN program has been recently reprogrammed into BCPL to create an exportable version, it was decided to also reprogram the STEREO program into BCPL and carry on further developments in that language to ensure compatibility and exportability. With the exception of the parts of the program which compute R/S symbols and handle 110 heteratoms interactively, this reprogramming has been accomplished. Further developments on this program include a fairly extensive interactive package which allows the user to obtain information about the generated stereoisomers. The user may obtain drawings of projected stereocenters showing absolute configurations of stereocenters (e.g., Fischer projections, Newman projections, double bonds) or obtain drawings of linear segments of .structures showing all the configurations of the included stereocenters. The user may also obtain information about the symmetry and equivalent atoms in any stereoisomer. This program is currently rumning with the BCPL version of CONGEN and was available and tested during the recent series of workshops. This program has been exported with this version of CONGEN. The experimental version of the BCPL program has been modified to allow for some constrained generation of stereoisomers as proposed in the last annual report. The algorithm and program for exhaustive generation were written with this eventuality in mind. An additional interactive session has been added to the stereoisomer generator which allows the user to add constraints before generating the stereoisomers. At present, the user may input constraints on the absolute or relative stereochemistry of any stereocenters. Thus if part of the stereochemistry of a structure is known, it is possible to constrain the stereoisamer generator to produce just those isomers consistent with the known stereochemistry. This parallels the procedure in the structure generator of CONGEN. 2.4.3 European speaking trip At the invitation (and expense) of the Center for Interdisciplinary Research at the University of Bielefeld in West Germany, one of our group, J. G. Nourse, talked about recent developments in the CONGEN program. Besides the lecture at Bielefeld at aconference on the applications of permutation group theory to Chemistry, Physics, and Biology, a lecture was given at the University of Bremen (also W. Germany) to a conference on applications of graph theory to Chemistry. In addition invited lectures were given in Berlin (Free University) and twice in Zurich (ETH and University). A great deal was learned about current efforts by others in both the US and Europe on computer applications to chemical structure elucidation, synthesis, and data bases. Considerable interest in our programs resulted. Besides continuing correspondence, this is evidenced in part by the presence of Prof. Andre Dreiding of the University of Zurich at one of our recent CONGEN workshops. The contents to some of these lectures are included in references 14 and 26. lil 2.5 Structure checkina functions for CCNGEN 2.5.1 Introduction A program, "STRUCC", has been developed to provide functions for checking sets of structures for desired substructural features or for compatibility with recorded mass-spectral or nmr data. While primarily devised for processing sets of structural isomers produced by means of CONGEN , STRUCC can also take as input sets of structures created through the REACT program or defined though an extension of CONGEN’s EDITSTRUC function. The main structure checking functions currently available through STRUCC are: : 1) EXAMINE: This EXAMINE function is an extended version Of that available in standard CONGEN. Amongst other extensions are facilities for checking for Specified ring-fusions or spiro-junctions within structures. 2) MSA: The MSA ("Mass Spectral Analysis”) functions provide a means for using mass spectral data to rank candidate structures. The MSA functions can employ either ordinary “half-order theory", or a model’ of fragmentation in which bond break plausibilities are related to specified substructural features. 3) LOOK: The LOOK (1) functions are intended to assist a user in investigating the utility of proposed experiments for differentiating between candidate structures. LOOK provides a mechanism for determining the various different ways in which particular Superatom parts are. incorporated into candidate structures, 4) TSYM: The TSYM function allows some simple forms of symmetry constraint to be defined. These constraints use only topological symmetry. 5) RESONANCECHECK: The RESONANCECHECK function is intended for checking that all constraints have been given to the structure generator. The function can identify differences in candidate structures that would be associated with features in the lHmmr or 13Cnmr that (1) (The LOOK functions incorporate some of the features of the PLAN functions described in last year’s report). 112 one might reasonably expect to be fairly obvious (e.g. different numbers of hydroxy protons, different numbers of carbonyl carbons etc). Generally, such differences are found in cases where the user has forgotten to specify substructural features incompatible with the observed data, or has misapplied the constraints so that not all instances of wumwanted features are eliminated. 6) NMRFLT: The NMRFLT functions represent a first attempt at developing a system for predicting proton resonance spectra of candidate structures, and for using differences between predicted and observed spectra as a basis for pruning the structure list. The STRUCC system is also used as a test-bed for new structure evaluation functions. When functions are considered to be sufficiently developed to be of use, top-level calls to those functions are added to STRUCC “s repertoire of commands. STRUCC has a user-interface similar to that of CONGEN and incorporates many of the same subsystems (e.g. EDITSTRUC and DRAW). 2.5.2 ‘The Form of the STRUCC Program: The following diagram indicates schematically the overall form of STRUCC: - | Executive | |TSYM| / fF - \ \ \ J £f | NN = \ \ \ _OoO / f{f = | , o\ \ — = —_— | Status | / / |ESI | fo \ \ INMRI {RC} | XMN| | Commandsi~—-/ / — | | \ \- — / | DR| | IMSA| | LOOK| | File | —_ | Manipulation| IDS | 1) Status commands: 1 AR?:; Lists the "aromatics templates" used in CONGEN’s last GENERATE or IMBED step. 113 2) 3) ii CLEAR: Restarts the program. iii CM?: Gives the structure composition. iv CN?: Lists the contents of the "Global Constraint list" used in CONGEN’s last GENERATE, IMBED or PRUNE step. v CY?: Gives the current number of structures. vi EF?: Gives the empirical formula (if defined). vil EXIT: Ends the program. viii UA?: Displays all _suser defined superatoms and patterns. File Manipulation: i RESTORE: Reads in a CONGEN-file (or a REACT-file) containing defined superatoms, composition, constraints and structures. ii BCPL: Reads in a file of structures created through the new BCPL version of the CONGEN program in order that they may be analyzed through MSA, EXAMINE etc. iii APPEND: Adds all = structures from a CONGEN save-file to the set currently in memory and then eliminates any duplicates. This option is useful for combining results from problems where the structure generation process was performed several times with different assumptions about starting superatoms etc, iv .SAVE: Creates a CONGEN-file containing current superatoms, structures etc. EDITSTRUC (ES): CONGEN’s standard EDITSTRUC 114 function is available. See the CONGEN users manual for details of this function. 4) DRAW (DR): CONGEN’s standard DRAW function is available. See the CONGEN user’s manual for details of this function. 5) DEFINE-STRUCTURES (DS): The DS_ function provides various extensions to standard EDITSTRUC that are useful when creating sets of related structures. The DS function is used when the structures to be processed by one of the analysis functions (e.g. MSA) have not been created by REACT or CONGEN. 6) MSA: Mass spectral analysis. 7) LOOK: (for assistance in experiment planning etc). 8) RESONANCECHECK (RC): (Simple checks for omitted constraints) . 9) EXAMINE (XMN): Extended version of EXAMINE. 18) NMRFLT (NMR): Prediction and checks on proton resonance Spectra of candidate structures. 11) ‘TOPSYM (TSYM): TSYM will prime the structure list to retain only those structures in which some, user defined, substructure has a given minimum number of symmetrically equivalent images. On starting, or on restarting subsequent to a "CLEAR" command, STRUCC first lists any news bulletins about new options/bugs etc and then asks whether the structures might vary in composition. Many of the processing functions use checks on composition and have to be informed as to whether these checks have to be performed just once, or, for each structure being processed. If the structures were generated by CONGEN then all will have the same composition but structures produced by REACT or entered manually through DEFINE-STRUCTURES may vary in composition. 2.5.3 STRUCC’s HELP System STRUCC has a primitive on-line documentation system. This subsystem is invoked by giving the command "HELP" in reply to a prompt from the program. If the command "HELP" is used alone, then the program retrieves information supposedly useful within the current context. 115 Arguments can be used with the "HELP" command, e.g. "HELP TAG UNTAG" would result in HELP trying to find information on the EDITSTRUC TAG and UNTAG commands. The HELP files do contain some commented examples of the more complex functions. 2.5.4 DEFINE-STRUCTURES The DEFINE-STRUCTURES (DS) command allows you to define complete structures by means of an extended EDITSTRUC system. Typically, the DS- command would be used to enter a set of structures that are to be processed by one of the analysis routines —- such as MSA —= but which have not been generated by CONGEN. Generally, sets of structures that are being entered by means of the OS-system will share common substructures. For example, the structures might consist of steroidal compounds based on one or two nuclear skeletons and half a dozen sidechains. The DS-system allows you to use substructures, (previously defined as Pattern-type Superatoms in EDITSTRUC) when creating new structures. 2.5.5 MSA, The Mass Spectral Analysis Functions. The MSA functions utilize an extended version of DENDRAL’s “half- order theory of mass spectrometry", (fescribed in previous reports), and can provide the following forms of mass-spectral analysis: 1) PREDICTION: prediction of spectra on the basis "half order theory". The program has to be given: i Parameters controlling the fragmentation process ii A minimum plausibility value for ions to be Listed iii The minimum mass of interest. “All structures on the structure list are processed and their spectra are listed at the terminal. 2) ANALYSIS: In this mode, MSA can be used to list all possible rationalisations for observed ions. The program lists, for each ion, the breaks, neutral losses and H~transfers:necessary for it to be generated from a given structure. In general, this is a large amount of data; consequently, the program only processes a user- defined subset of the structures. Each structure in the subset is processed in turn with the fragmentation 116 analysis being listed at the terminal. The program has to be given: i) The index numbers of the Structures to be processed ii) The observed spectrum iii) Fragmentation control parameters. iv) A minimum plausibility value on processes that are to be reported. 3) RANKING: For ranking structures, the has to be given: _i) The observed spectrum. ii) Pragmentation control parameters. iii) The form of the scoring function. The contribution to -a structures’ score from a recorded ion being predicted is given as the product of the predicted plausibility and one of: a) l (presence/absence of ion is all = that matters) b) The ion’s mass ¢c) The ion’s - observed intensity dad) The product of the ion’s mass and intensity. program All structures are processed; optionally, their scores can be listed as they are processed. Once all have been processed, the program produces a ranked listing of the structures. It is then possible either to simply prune away those with inadequate scores, or to enter EXAMINE with these scores. Within EXAMINE, the results of MSA- scoring can be combined with substructural features to 117 form selection criterion based on overall agreement with the spectrum and presence of desired features. 4) EXAMINE: In this mode, the program identifies all structures for which the observed ions are predicted. The information is converted into a form that can be used by EXAMINE. The observed ions can then be used as EXAMINE~selection keys, just like substructural features; so, one can select structures with >C8H601 AND C6H19N103 AND C4H8N102 The number of ions that can be rationalized in terms of a given structure is used as a score for that structure. This score is available in EXAMINE. So, as well as checking for structures that can explain particular ions, it is possible to request those which can explain a given number of the observed ions. In EXAMINE mode, MSA requires the same data as when in RANKING mode. The basic set of parameters which may have plausibilities adjusted in the "half order theory" are: 1) the plausibility of single bond breaks, (e.g. 1) 2) the plausibility of aromatic bond breaks, (e.g. i) 3) the plausibility of double bond breaks, (e.g. 3) 4) the plausibility of bonds of higher order breaking, (e.g. 9) , 5) the plausibility of adjacent breaks, (e.g. 6.25) 6) the plausibility of the molecular ion being observed, (? class dependent) 7) if multi-step processes are permitted, then taking the plausibility of single step processes as l, values must be. given for relative plausibilities of more complex processes @.g, two step processes (e.g. 8.7) three step processes (e.g. 0.4) 8) if H-transfers or neutral losses are specified 118 then plausibility values must be given for each transfer/loss, MSA functions allow substructural patterns (created using EDITSTRUC) to be used to define bord environments to which special break plausibilities are to be assigned. The program works by checking whether any of these substructural patterns match a structure, and if so which bonds in the structure correspond to those for which special break plausibilities have been designated. Then, when the program is fragmenting that structure to predict ions, it can check if any of the bonds it has broken are in the list of those having special break plausibilities, As well as allowing these more general mechanisms for defining the plausibility of bond breaks, the MSA functions let the plausibility assigned to a predicted ion to be adjusted according to how well it is likely to localize charge. The basic "half order theory" does not make allowance for factors such as Nitrogen being able to better stabilize a charge than Carbon and, consequently, Nitrogen-containing ions being more plausible than those without Nitrogen. In MSA, relative charge- localization plausibilities may be defined for different atom—types. The plausibility assigned to a predicted ion is then modified by the maximum charge-localization plausibility of any of its constituent atoms. 2.5.6 LOOK Frequently, a chemist can conceive of additional experiments that could serve to probe the structural environment of one of the superatom parts that he has used in defining a CONGEN problem. Such experiments might involve a reaction at the site of the superatom part or a series of proton decoupling measurements for "walking along alkyl chains" from some identifiable starting point. Generally, the utility of such experiments depends on there being some significant structural difference between candidates within some relatively small radius of the already known superatom part. The LOOK functions are intended to assist the chemist in finding such differences. Basically, LOOK takes the starting superatom (or any other substructural pattern that the user may wish to define), maps it into each structure, expands it by including neighboring atoms, creates a canonical representation of the expanded part and groups candidates according to these canonical representations. LOOK then reports on the different expanded features that have been identified and allows the user to further inspect these larger features. The user can choose for a part to be further expanded to achieve some finer discrimination or can investigate differences relating to ring-systems involving the new feature etc. In LOOK, the substructure expansion process is controlled through user specified options. 119 2.5.7 The Proton NMR Functions . Some simple functions are now available that can be used to specify features in the proton resonance spectrum and prune the structure list to obtain only those candidates that appear to provide a rational for the selected features. These functions use an "additivity of shifts" model for predicting the proton resonance spectrum of a candidate structure. This model ignores all steric effects; including such important influences as shielding/deshielding. through close proximity to an unsaturated system. Further, as shift values in reference tables represent averages over many different types of (usually acyclic) compounds, they can provide but a poor model for any given structure. One can hope that the predicted resonances of methylene groups will generally be within about §.6pom of the observed values while methines should be within 1.5ppm. The formulae used are: Deltagyy = 8.2 + Cl + C2 Deltan, = 6.2 + Cl +C2+ C3 8 gem + 2cis t trans where the Ci and Zi values are supposedly additive constants. The resonances of methyl, aromatic, alkyne, aldehydic, hydroxy and some other classes of protons are not computed but taken from standard tables. For some of these classes, e.g. hydroxy and aromatic protons, the resonance values are given as a range rather than any typical value. If the approximate prediction methods appear tolerably accurate for a given class of candidate structures, then the functions can be used for pruning the structure list by tests that predicted spectra satisfy user-defined constraints. These constraints take the form of requirements for specified (minimum) numbers of protons resonating in (possibly overlapping) regions of the spectrum. 2.3.8 BCPL versions of STRUCC The more useful _components of the STRUCC system are being converted to BCPL so that they may be available to future users of the BCPL~CONGEN system. 120 2.6 Meta-DENDRAL 2.6.1 META-DENDRAL PROGRESS 2.6.1.1 INTSUM The INTSUM program for the analysis of spectra has been improved by using confidence factors in the place of many of the original program constraints. This feature allows association of liklihoods with fragmentations. It thus allows consideration of a much wider range of possible processes while limiting the final explanations for spectrum peaks to the most plausible explanations. Additional improvement of the program allows logical separation of the concepts of H-transfers and neutral composition transfers. This provides a better correlation between the explanations provided by the program and those expected by the chemist. 2.6.1.2 RULEGEN A significant problem in generalizing the INTSUM explanations has always been reducing the size of the search space so as to be able to produce interestffg rules in a reasonable amount of time. In addition to the constraints already provided, the RULEGEN program now allows use of existing rules to filter the peak explanations to be considered. This is an important step in allowing the program to focus on rules which account for peak explanations not yet encompassed by existing rules. As an aid in better umderstanding the process of rule formation, the program is now capable of generating additional information about the search space. This information serves as data for other programs which can then analyze and present to the user compact descriptions of the rule search done by RULEGEN. 2.6.1.3 EDITSTRUC INTERFACE The latest versions of the Structure editor, EDITSTRUC, and the structure drawing programs have been interfaced to allow their use in all appropriate places in INTSUM and RULEGEN. The newest programs for conversion of EDITSTRUC structures recognize a larger subset of the structural features which may be specified within EDITSTRUC. This allows the user greater flexibility in the specification of substructures in user-created rules. 2.6.1.4 PREDICTION and RANKING The programs allowing the entry and use of user-defined rules have 121 peen extended to allow prediction of the molecular ion and inclusion of confidence factors in the rules. The process of spectrum prediction from Meta-DENDRAL rules has previously involved the matching of rules against only those sites in the molecules considered as possible breaks. With the use of user- entered rules, and program developed rules containing greater structural detail, the program was generalized to allow prediction based on graph matching alone, without the prior generation of possible break sites. 2.6.1.5 HUMAN ENGINEERING Many minor improvements have been made in the program’s interaction with the user. In general, these improvements have been designed according to the following criteria: 1. Messages should be informative yet not excessively long or wordy; 2. User typing should be kept to a minimum; 3. Programs should behave in ways which people find understandable: 4, During execution, programs should provide occasional information concerning their progress. 2.6.2 RESULTS The practical value and capability of new programs are best evaluated by applying them to real, non-trivial problems. In our case, we have chosen the biologically important marine sterol compounds. Their mass spectra are predominant in. the structure elucidation of new compounds in spite of the fact that relatively few of the fragmentation mechanisms are known. Often very similar spectra are recorded due to the great similarity of common skeletons. Our study involves the comparison of predicted spectra of known structures with the observed spectra of unknown compounds. We want to compare the usefulness of different methods of forming the rules used for spectrum prediction. We distinguish 3 methods: 1) dalf-order theory (can be supplemented by functional group rules). 2) Class~-specific rules (selected by the chemist) 3) Computer-generated rules Our results were obtained using nine selected 4-demethylsterols (six isomers of composition C29H480, two C28H460 and one C27H440). Each spectrum of the nine selected marine sterols was considered to be the observed spectrum and ranked against 23 candidate structures (the 23 candidates contained 17 different C7 - Cll sidechains and three 4~demethylsterol skeletons). For the half- order theory an overall average performance of (2.4 8.9) was obtained. The first number gives the number of candidates ranked better than the correct one, the second represents the number of candidates ranked equally with the correct one. In this case the average value is not very representative, as its value is strongly reduced by a compound which was ranked in 17th place. This compound, the 23- demethylgorgosterol, contains a cyclopropane in the sidechain for which no special fragmentation processes are considered in the simple half- 122 order theory. The ranking can be greatly improved by providing fragmentation rules for cyclopropane rings. The results of the second method (class specific rules), depends on the quality and number of selected rules. For this study we selected about 17 skeleton breaks (observed in more then 7@ percent of the structures) from the INTSUM results of 23 marine sterols to which we added 13 known fragmentation processes. These processes (associated with neutral transfers, intensity range, and a confidence factor) were entered using the new rule editor program. The overall performance of these rules was (0.3 8) which means that, with the exception of three compounds, which were ranked in the second position, the correct structure was always ranked first. A further improvement is seen when the distribution of the scoring values is considered. For these rules, much better- separations were observed than with the half-order theory. Also, the quality of the predicted spectra are sufficient to consider the creation of a library which could be visually compared without the need of a computer. For the third method no results can be summarized here, as the computer generated rules are still being developed. The improvement of this last step will be a main goal of the next year. 2.7 REACT and MAXSUB Programs 2.7.1 REACT During the last year there have been no additional deVelopments in the REACT program. Rather, it has been used extensively in apolications to both structure elucidation problems, and, more effectively, in mechanistic studies involving plausible biochemical cyclization and rearrangement pathways. A major paper describing the REACT program and the underlying algorithms which allow it to interpret automatically structural constraints applied to reaction products appeared this year (9). This paper was concerned more with describing the program for interested persons, but did include a simple example application involving the Structure elucidation of a sesquiterpenoid alcohol isolated froma marine organism. The program was also described in more "chemical” terms for a general audience in a review paper which will appear shortly (15). In conjunction with the work on Meta-DENDRAL and spectrum prediction and ranking applied to analysis of marine sterols (see Section 2.f), we have employed the REACT program to generate biochemically plausible sterol side chains. As we described in the previous annual report, reaction mechanisms thought to be applicable to side chain medification, including cyclizations, rearrangements and 123 degradations, were supplied to REACT as, effectively, constraints on the variety of side chains which are theoretically possible. for example, CONGEN can be used to generate isomeric C-1ll sterol side chains possessing one double bond. There are 7769 (!) of them. Using REACT, however, only 76, less than one percent, are predicted as plausible. Recent papers have illustrated this approach for both extended (7) and shortened (19) side chains. Recently we showed (15) that seven new structures were all predicted by the program, adding © some support to the hypotheses of biochemical transformations. 2.7.2 MAXSUB Program The function of the MAXSUB program is to detect common structural features in a potentially diverse but related set of compounds. This problem is one faced by chemists engaged in structure/activity studies, particularly in design of new, biologically active compounds based on known compounds with known activities. However, any problem involving an “activity” related to structure, including spectral signatures, is in principle amenable to analysis by MAXSUB. MAXSUB, by determining common features of structures displaying common activities, is presumably focussing on those aspect of the structures which are related to the activity. However, in its current state, the program is only experimental. Many types of activity are intimately connected with stereochemical aspects of structure and MAXSUB does not include any stereochemistry. It does represent a foundation for further study of the problem because the algorithms can in principle deal with three- dimensional descriptors of atoms and bonds. Some work may be done on this program in the next grant period. The existing program will te described in detail in a publication which will appear.soon (18). 2.8 High Resolution GC/MS System For the current grant period we deemphasized further development of our GC/MS and GC/HRMS system as requested by the study section and focussed our attention on maintenance of the existing system and applications of the system to a variety of mass spectral and structural problems of ourselves and our collaborators. In addition we have in press a major paper describing in detail our approach to both GC/low resolution mass Spectrometry and GC/high resolution mass spectrometry (17). In this paper we describe methods of data acquisition, reduction and preliminary analysis, a description which includes all major elements of our mass spectrometer/computer systems and a variety of computational details where our approaches differ from those of other workers in the field. In a companion paper we describe how resulting masS spectral data have been utilized in computer-assisted structure 124 elucidation (16). The following list summarizes the samples we have analyzed in various operating modes of the instrument during the past year: 1) High Resolution analyses: a) DENDRAL-related 134 b) Outside collaborators 45 2) High Resolution GC/MS a) DENDRAL~related 86 b) Outside collaborators 13 3) Low resolution GC/MS 45 (these samples were primarily marine sterol mixtures from our laboratory and from several other groups, which did not require HRMS analysis) . 2.9 References ‘ ‘ In this section we summarize recent publications supported wholly in part by the current grant. This list includes a few publications published at the end of 1977 to help set the context for more recent publications which build on the previous results. (1) T.H. Varkony, R.E. Carhart, and D.H. Smith, “Computer-Assisted Structure Elucidation. Modelling Chemical Reaction Sequences Used in Molecular Structure Problems," in "Computer-Assisted Organic Synthesis," W.T. Wipke, Ed., American Chemical Society, Washington, D.C., 1977, p. 188. (2) "“Computer~Assisted Structure Elucidation," D.H. Smith, Ed., American Chemical Society, Washington, D.C., 1977. (3) R.E. Carhart, T.8. Varkony, and D.H. mith, "Computer Assistance for the Structural Chemist," in "Computer~Assisted Structure Elucidation," D.H. Smith, Ed., American Chemical Society, Washington, D.C., 1977, p. 126. (4) D.H. Smith, M. Achenbach, W.J. Yeager, P.J. Anderson, W.L. Fitch, and T.C. Rindfleisch, "Quantitative Comparison of Combined Gas Chromatographic/Mass Spectrometric Profiles of Complex Mixtures," Anal. Chem., 49, 1623 (1977). (5) B.G. Buchanan and D.H. Gnith, "Computer Assisted Chemical . 125