INTRODUCTION INTRODUCTION This proposal seeks a three year extension of our existing jrant for Resource Related esaarch - Computers and Chemistry (RR-Q0612). Over the two years we have been Supported by this grant we have made signiticant progress in all of the areas we initially proposed including clinical applications ot body fluid analysis by yas chroatography/mass spectrometery (GU/MS), extensions to automate our GC/MS instrumentation and data systems, and the development of programs which, in specific areas, match human performance in interpreting mass spectra from first principles as well as extend mass spectral theory to new classes of compounds. Our success to date reinforces our expectations that this research will have a Significant and useful impact on medical research involving studies of human biochemistry. AS discussed in section B(ii) of this proposal, we have bolstered contact with real clinical problems through the Department of Pediatrics (Prof2ssor Howard Cann). We have recently encountered preliminary correlations between the amount of beta-amino isobutyric acid present in the urine of children with lymphoblastic leukemia and the state of their disease; and also between a defect in phenylalanine-tyrosine metabolism and late metabolic acidosis in premature infants. This project is highly interuisciplinary, werging the interests of Professors Lederberg (Genetics), Djerassi (Chemistry), and Feigenbaum (Computer Science), in evolvinj and applying mass spectrometry as an analytical tool in medicine and in modeliny aspects of scientific problem solving processes. Mass Spectrometry is an ideal domain for this collaboration. On the one hand it has special importance to medical science and organic chemistry as a remarkably sensitive and analytically precise physical method for studying human biochemistry at the molecular level. On the other hand, the problems of mass spectrum interpretation are at once sufficiently complex to challenge the human intellect and sufficiently structured to be dealt with by current Computer programming concepts. It is thus a rich, real-world problem domain in which to study the emulation of lower level cognitive tunctions, knowledge representations, and theory formation processes. This combination of interdisciplinary interests promises both near and long term returns for the research investment. AS indicated above, even with relatively crudely automated systems, a significant impact can be made on relevant medical problems. in the longer term the increasing load of body fluid analyses, which will have tos be performed to be responsive to clinical needs, will require unburdening chemists from the laborious processes ot reducing and interpreting the large volumes of data involved. These probleas are squarely adiressed by the proposed use of stored libraries of solved spectra, augmented by computer programs to extend such catalogs by "cognitive" insight. -2- fhis proposal is organized in a manner Similar to the original in that the overall goals are divided into a number of subtasks. These comprise the original subtask definitions as well as one additional task proposed to explore the use ot Carbon(13) nuclear maynetic resonance information as a potentially useful adjunct to mass spectral information to limit the space of candidate molecular structures. The respective proposal subtasks elaborated upon in subsequent sections include: Part A: Applications of Artificial Intelligence to sass spectrometry Part B(i): Mass Spectrometer Data System Development Part B(ii): Analysis of the Chemical Constituents of body Fluids Part C: Extending the Theory of Mass Spectrometry by Computer Part D: Applications of Carbon (13) Nuclear Magnetic kesomance Spectrometry to Assist Chemical Structure Determination This proposal is related to several others pending, in progress, or terminating: 1) SUMEX (NIH: WR-06785, pending - Principal Investigator, J. Lederbery)-- This proposal seeks to establish a computer resource for the application of artificial intelligence in medicine as well as for the exploration of GC/MS as a tool for biomolecular characterization. The present renewal application is subsumed in the SUMEX application but is submitted indepenlently to meet NIH renewal application deadlines which predate National Advisory Research Kesources Council consideration of the SUMEX proposal. Should SUMEX be approvei, this proposal will be withdrawn. Should SUMEX not be approved, this proposal seeks to continue support of our current mass spectrometry research efforts. 2) Genetics Research Center (NIH: pending - Principal Investigator, J. Lederberg)-- This proposal seeks to establish a Genetics Research Center at Stanford for research in medical genetics and the application of such research to clinical aspects of medical genetics. This proposal incorporates a Significant level ot cooperation between the Departments of Genetics and Pediatrics at Stanford including clinical applications of GC/MS. The Genetics Center proposal complements the present renewal application in that it concentrates on research aspects of genetic disease whereas this proposal attacks basic problems of methodology as well as developmental aspects of applying GC/MS analyses of metabolic disorders as indicators of disease states in a broader context. 3) ACHE (NIH: Rk-00311, terminating, July 1973, - Principal Investigator, J. Lederberg)-- the ACME computing resource has been our major source of computing Support for the reduction and analysis of mass spectral data. This Support has been provided as a part of the ACME core research program without an explicit transfer of funds from the DENDRAL project. With the termination of NIh support, the ACME facility will be combined with other Medical Center computing functions on a fee-for-service basis, thereby introducing a new specific iten in our budget to cover these computer costs. 4) Heuristic Programming Research in Artificial Intelliyence (Advanced Research Projects Agency (ARPA): sD-183, in progress ~ Co-Principal Investigators, E. Feigenbaua and J. Lederberj)--This on-going research effort complements the present proposal by supporting those aspects of artifical intelligence concept and program development not directly related to medical problem areas. The present NIH-supported project benefits from this research and acts to enable the transfer of these ideas into a medically relevant context. The current resource grant is headed by Professor &. Feigenbaum as Principal Investigator. He will shortly take a leave of absence for two years to accept the post of Deputy Director of the Information Processing Technigues Office of AKPA. During his absence, Professor Lederberg will act as Principal Investigator of the research project. Whereas Professor Feigenbaum will formally not be a member of the project during his tenure with ARPA, he will maintain his office locally, enabling his to maintain close intellectual contact with our cesearch etfort. PART A: APPLICATIONS OF ARTIFICIAL INTELLIGENCE TO MASS SPECTROMETRY Part Aw Applications of Artificial Intelligence to Mass Spectrometry OBJECTIVES: The overall objective of part A of this proposal is to extend the reasoning power of Heuristic DENDRAL. Mass spectrometry was initially chosen as the task area in which to explore the techniques of heuristic programming for molecular structure elucidation. Much of the past and proposed future efforts will remain directed strongly to analysis of mass spectra because of the sensitivity and speciticity of the technique. It is clear, however, that information available from other spectroscopic techniques, utilized routinely by chemists when sample quantities are sufficient, can and should be used where appropriate to obtain structural information which cannot be provided by mass spectrometry alone. This point is elaborated in the subseyuent discussion of progress and plans. A corollary of the overall objective is to tie the Heuristic DENDRKAL program very closely to the regquiregaents of the Chemical studies outlined below (analysis of steroids from body fluids) and in Part B of the proposal (analysis of chemical constituents of urine, blood, and other body fluids). We have previously directed and will continue to direct our studies toward ciasses of biologically relevant molecules. Thus we have the capability of providing Significant support to the chemically oriented activities as the capabilities of Heuristic DENDRAL are extended. The overall objective encompasses several sub-tasks, outlined below, all of which represent critical steps in building a powerful program in an incremental fashion. This approach provides an operational program which can be used by chemists in a routine production mode, while extensians of the program are under development. The sub-tasks are the tollowing: A) Extend Heuristic DENDRAL to analysis of the mass spectra of complex molecules. This includes the assessaent of the capabilities and limitations of the program in analysis of unknown compounds or mixtures of compounds. It also includes refinement of planning rules which infer compound class or molecular substructure, both being extremely important in subsequent analysis of a mass spectrua. B) Develop the Cyclic Structure Generator to provide DENDRAL with the capabilities for generation of all isomers of a given empirical formula. Define and incorporate constraints on the generator to exclude imaplausible isomers. Enlarge the capacity of the cyclic generator to accept constraints of demanded or forbidden substructures (GOODLIST, BADLIST). C) Develop the ability to incorporate information available from ancillary mass Spectrometric techniques (e.g., metastable ion data, low ionizing voltage data, isotopic labelling) and other spectroscopic data (e.g., substructures from NMR) into the existing Heuristic DENDRAL prograa. D) Extend the Predictor, now capable of prediction of mass Spectra for limited classes of molecules, to the design of experimental strategies. Given a set of data, and partial or ambiguous structural information based on these data, Specify additional experiments which may be done to effect a unique solution or minimize ambiguities. PROGRESS: We have, in the past two years of the existing DENDRAL grant, made significant progress in each of the areas outlined above. We feel that in some areas the progress has been particularly exciting, for example, the completion of the programa for analysis of the aass spectra of complex molecules, and completion of the cyclic structure generator (unconstrained). The following represents a brief outline of accogplishgents to data, keyed to the objeetives A-D above, A) Extension of Heuristic DENDRAL Extension of Heuristic DENDRAL to the mass spectra of complex molecules dictated two important agdifications in the approach used successfully for saturated, aliphatic, monofunctional (SAM) compounds. To reduce ambiguities of elemental composition inherent in low resolution mass Spectra, the decision was sade to extend the program to handle high resolution mass spectral data which specify the eupirical composition of every ion. Although the basic Strategy of Heuristic DENDRAL (plan, generate and test) was Maintained, the absence of a cyclic structure generator at the time the program was written dictated that the basic skeleton, common to the class of molecules analyzed, be specified. The techniques of artificial intelligence have now been applied successfully to a problem of direct biological relevance, namely, the analysis of the high resolution masS spectra of estrogenic steroids. The performance of this program has been shown to compare tavorably with the performance of trained mass Spectroscopists, see Smith, et.al. (1972). The operation of this program has been detailed in this publication, a copy of which is attached. Briefly, the program was designed to emulate the thought processes of an expert as far as possible. High resolution aass spectral data are searched for evidence indicating possible substituent placesents about the estrogen skeleton. Molecular structures allowed by the mass spectral data are tested against chemical constraints, and candidate solutions are proposed. Further details of the performance in analysis of more than thirty estrogen-related derivatives are presented in the above publication. Of particular significance in this effort were, in addition to exceptional performance, the potential for analysis of mixtures of estrogens WITHOUT PRIOR SEPARATION, and for generalization of the programming approach to other classes of molecules. Because of the structure of the Heuristic DENDRAL prograa it is immaterial whether the spectrum to be analyzed is derived from a Single compound or a mixture of compounds. Each component is analyzed, in teras of molecular structure, in turn, independently of the other components. This facility, if successful in practice, would represent a significant advance of the technique of mass spectrometry. Many problena areas, because of physical characteristics of samples or limited sample quantities, could be successfully approached utilizing the spectra of the unseparated mixtures. Even in combined gas chromatography/mass spectrometry (GC/MS), many overlapping peaks will be unresolved and an aralysis progran must be capable of dealing with these sixtures. In collaboration with Prof. H. Adlercreutz of the University of Helsinki, we have recently completed a series of analyses of various fractions of estrogens extracted fron body fluids. These fractions (analyzed by us as unknowns) were found to contain between one and four major components, and structural analysis of each major cogponent was carried out successfully by the above program. fhese sixtures were analyzed aS unseparated, underivatized compounds. The implications of this success are considerable. Many compounds isolated from body fluids are present in very small amounts and complete separation of the compounds of interest trom the many hundreds of other coapounds is difficult, time-consuming and prone to result in sample loss and contamination. We have found in this study that mixtures of limited complexity, which are difficult to analyze by conventional GC/4S techniques without derivatization (which frequently makes structural analysis more difficult), can be rationalized even in the presence of Significant amounts of impurities. A manuscript on this study has been submitted to the Journal of the American Chemical Society In the past year we have extended our library of high resolution mass spectra of estrogens to include 67 compounds. These data represent an important resource and have been included (as iow resolution spectra for the moment) in a collection of mass spectra of biologically important molecules being organized by Prof. S. Markey at the University of Colorado. These data have been used extensively in developing the program strategies for Meta-DENDRAL (see Part C, below}. The Heuristic DENDRAL program for complex molecules has received considerable attention during the last year in order to generalize it from its previous emphasis on specific classes of compounds and program strategies. By removing information which is specitic to estrogens, the program has become much more general. This effort has resulted in a production version of the program which is designed to allow the chemist to apply the program to the analysis of the high resolution mass spectrum of any molecule with a miniaum of effort. Given the spectrum of a known Or unknown compound, the chemist can supply the tollowing kinds of information to guide analysis of the mass spectrua: a) Specitications of basic structure (superaton) corkaon to the class of aolecules. b) Specification of the fragmentation rules to be applied to the superatom, in the form of bond cleavages, hydrogen transters and charge placement. c) Special rules on the relative importance of the various fragments resulting from the above tragmentations. dq) Threshold settings to prevent consideration of low intensity ions. e) Available metastable ion data and the way these data are subseguently used ~~ to establish definitive relationships between fragment ions and their respective molecular ions. f) Available low ionizing voltage data -- to aid the search for molecular ions. g) Results of deuteriua exchange of labile hydrogens -~ to specify the number of, e.g., -OH groups. We have beea very successful in testing the generality of the program, with particular emphasis on other classes of biologically important molecules. We have used the program in analysis of high resolution sass Spectra af progesterone and some methylated analogs, a Small number of androstane/testosterone related compounds, steroidal Sapogenins and n~butyl-trifluoroacetyl derivatives of amino acids. B) Cyclic Structure Generator The cyclic structure generator has been completed after several years of effort under the continuing guidance of Protessor Lederberg. The boundaries, scope and Limitations of chemical structure can now be speci fied. The cyclic structure generator now rests on a firm Mathematical foundation such that we are confident of its thoroughness and ability to generate structures, prospectively avoiding duplicate structures. The prospective nature of the generator is a necessity for efficient implementation, as retrospective checking of each generated structure to eliminate redundancies is too time consuming. The necessary concepts have recentiy been transformed into an operating program. A manuscript describing the mathematical theory of the heart of the generator, the labelling algoritha, has been accepted by Discrete Mathematics (H. Brown, et.al., 1973). A companion manuscript describing the mathematical theory ot the complete generator has been submitted (H. Brown and L. Masinter, 1973, submitted). The cyclic structure yenerator in its entirety (encompassing acyclic and wholly cyclic structures and combinations thereof) will be described for chemists (L. MaSsinter et.al., in preparation). Apart from the labeling algoritha the remainder of the problea involves, first, the combinatorics of assignment of atoms to cycles or chains, and second, construction of acyclic radicals to attach to the rings using the well known principles of acyclic DENDRAL. A companion manuscript will soon be submitted describing for chemists the core of the cyclic structure generator, the labelling algoritha. This algorithm is capable of construction of all isomers, of wholly cyclic graphs, which may be formed by labelling the nodes of a cyclic skeleton with atoms (e.g., C, N, 0) or labelling the atoms of the skeleton with substituents (e.g., -CH3, -OH). Through the use of graph theory, and the symmetry-group properties of cyclic graphs the labelling algorithm avoids construction of redundant isomers. It identifies equivalent node positions prospectively before labelling takes place. It is indicative of the precarious communication between chemists and mathematicians that it had remained unsolved (except for trivial simple cases) despite attention tor over 100 years. As an indication of the complexity of chemistry in teras of numbers of possible structures, take the example of C6H6. The most familiar molecule with this molecular formula is benzene. Yet there are 217 topolggical isomers for C6H6 (with valence constraints) of which only 15 are pure trees. The simple addition of one oxygen atom to the empirical formula of benzene, yielding C6H60, yields 2237 isomers of the most familiar representative, phenol. The first exercise of the generator has been to create a dictionary of carbocyclic skeletons. This time-consuming task would otherwise have to be done each time aie aew molecular foraula is presented. The dictionary is structured to contain keys as to type of skeleton, number of Tings, cring fusion, and so forth. The constraints which we wish to implement are then simple to exercise in the coatext of the dictionary. C) Analysis Using Additional Data Sources Several additional techniques are available to the amass Spectroscopist other than recording the conventional mass spectrum. They provide complementary data which frequently are of great assistance in rationalization of the conventional spectrum, either in terms of structure or fragmentation mechanisms. We have designed the Heuristic DENDRAL program for complex molecules to use data from these additional techniques in auch the same way aS ai chemist does. The following three types of of data can now be used: I) Metastable Ion {MI) Data. Metastable ions provide a means for relating fragment ions to molecular ions in a mass spectrua. This is iaportant in two contexts. In examination of the spectrum of a known compound, the existence of a metastable ion provides strong evidence that a given fragment ion arises at least in part in a single decomposition process from an ion of higher mass (not necessarily the molecular ion). Investigations of this type are necessary to validate the fragmentation rules which guide the Heuristic DENDRAL program. (e.g., investigations of metastable ions of estrogens, Smith, Duffield and Djerassi, 1972). The second context use is the analysis of mixtures of compounds to determine which fragment ions in a very complex spectrugs are descended from which aolecular parents. We have explored the analysis time and specificity of results as a function of the amount of sgetastable ion data available on a mixture. A 10 to 100-fold reduction in computer tine is observed to arrive at single, correct solutions for various mixture components (rather than 5-20 possible solutions limited by the conventional mass spectrum alone). These results are reported in detail in the description on analysis of the estrogen mixtures (Smith, et.al., 1973 F-IC (submitted) ). Metastable ions are those which are formed by fragmentation processes occurring during the flight of an ion after formation and acceleration. These fragmentation processes may occur at any point along the flight path of ions through the maSs spectrometer. Because of the complex behavior of metastable ions formed in magnetic or electric fields, they are uSually studied in field-free regions. A conventional double focussing mass spectrometer possesses two field-free regions where metastable ions may be studied. one region lies between the electric sector and the Magnetic sector. This region can be used to study so-called “normal” metastable ions, i.e., those metastable ions which are observed superimposed on the peaks in the conventional mass Spectrum and which follow the relationship: observed mass of metastable ion = (mass of daughter) **2 /(mass gf parent). The other field-free region lies between the ion source and the electric sector. Metastable ions formed in this region can be examined by de-tuning one analyzer of the instrument (defocussing). This procedure allows establishment of Specific relationships between ions involved in a setastable decomposition so that the parent ion.and its decomposition product, can both be identified. This technique has led to much more useful information for the Heuristic DENDRAL program, as illustrated earlier in this section. II) Low Ionizing Voltage (L¥) Data. The key to successful Operation of the Heuristic DENDRAL prograg is correct inference of the molecular ion(s) and solecalar formula (e) in a given mass spectrum. In the past, metastable ion data were used to assist the program in correct identification of molecular ions. This procedure has now been supplemented, making the program cognizant of LY data. At lower ionizing volatges, molecular ions are formed with lesser amounts of excess internal energy. Most classes of molecules (those that display significant molecular ions) can be analyzed at a sufficiently low ionizing voltage such that only molecular ions are observed, as the internal energy is not sufficient to allow fragmentation. This technigue was used extensively in the analysis of estrogen mixtures and the resulting data Slaplify the program's task of determining molecular ions. IIL) Isotopic Labeling. We have previously described how isotopic labeling of labile hydrogens with deuterius aids analysis. For example, the last phase of the analysis of spectra of complex aolecules involves several "chemical" checks on the validity of proposed structures. The knowledge of the number of hydroxyl groups can be a powerful filter to reject certain candidate structures (Smith, eteal., 1972). There are many qther kinds of data available to chemists engaged in structure elucidation. The details of cheaical isolation and derivitization procedures May reguire that only certain types of functional groups are plausible. Spectroscopic data from other techniques {(e.g., proton or C13 NMR, IR, UV) may be available for a particular unknown. We have designed the Heuristic DENDRAL program for complex molecules with these additional data in sind. Specific Pu plans for implementation of these data as constraints on Heuristic DENDRAL are described in the Plans section below. Certain chemical information, for example, the knowledge that aromatic hydroxy functionalities have been methylated, can already be included as a constraint. D) Extension of the Predictor Programs The function of the Predictor in Heuristic DENDRAL has been to evaluate candidate solutions (structures) by prediction of their mass spectra, based on empirical fragmentation cules, and comparison of predicted versus observed spectra. This has been extended to high resolution mass spectra of complex molecules. Performance has been tested on estpogenic steroids and steroidal sapogenins. There are other aspects of prediction of behavior that we have incorporated and plan to incorporate in the Predictor. We can now predict a mininmua series of getastable defocussing experiments necessary to differentiate among candidate structures resulting froa analysis of a amass Spectrum. Other efforts are discussed in the Plans section, below. This approach amounts to design of optiaua experimental strategies to effect a solution or asminisize ambiguities. We have begun to explore ways in which to predict the aass Spectral behavior of molecules without the need to resort to the classical method of determining many mass spectra followed by empirical generalization. Dr. Gilda Loew has been investigating extended Huckel molecular orbital theory in an attempt at qualitative prediction of bond strength Initial efforts on estrone will shortly appear describing these results (G. Loew, et.al., 1973). Briefly, calculated net atomic charges appear to have little bearing on subseguent fragmentation of the molecule. Bond densities (which are related to bond strengths), however, provide some indication of which bonds are likely to undergo scission in the tirst step of a fragaentation process. PLANS? AS in the previous section, research plans are keyed to the objectives A~D. A) Extension of Heuristic DENDRAL I) We will continue use of the present prograa in collaborative studies with Prof. Adlercreutz concerning estrogenic steroids from, e.g., pregnancy urines. Work to date has inspired a synthetic program at Stanford Universty to verify conclusions of the program with regard to new estrogen netabolites, The planning program will be used extensively in analysis of the synthetic products also. AS the capability for analysis of the mass spctra of other classes of steroids is developed, we hope to extend this collaboration. II) We feel we have achieved a high level of compound-class independence in our present program. AS more classes are L212 analyzed we expect that further "cleanup" may be necessary, but easy to carry out. ITIt) We are presently accumulating a large number of high resolution mass spectra of pregnanes and androstanes. For example, the first step away from estrogen analysis was initially going to be to the analysis of pregnanes, another biologically important class ot steroids.