Reprinted from Advances in Mass § pectrometry Volume Y The Institute of Petroleum The Application of Artificial Intelligence in the Interpretation of Low- Resolution Mass Spectra By ARMAND BUCHS, ALLAN B. DELFINO, CARL DJERASSI, A. M. DUFFIELD, B. G. BUCHANAN, E. A. FEIGENBAUM, J. LEDERBERG, GUSTAV SCHROLL,* and G. L. SUTHERLAND (Departments of Chemistry, Computer Science, and Genetics, Stanford University, Stanford, California 94305, U.S.A.) INTRODUCTION Tue application of high-speed digital computers to organic mass spectrometry has become important to many laboratories, especially in connection with problems of data acquisition in high resolution mass spectrometry. The use of computers for the structural identification of organic compounds from their mass spectra by data retrieval is valuable when the solution space is limited to a series of compounds with known mass spectra. In general, this is not the case; hence attempts to use digital computers in the interpretation of unknown mass spectra are of vital importance for any future automatic reduction of experimental data. We wish to describe one approach to a general computer interpretation of low- resolution mass spectra of organic compounds and to present the results obtained with one special class, aliphatic amines. THE GENERAL APPROACH The molecular composition of an unknown compound defines the size of the solution space as the complete set of isomers with that empirical formula. The size of the search space increases drastically with the number of atoms, Whereas C3HpN yields four isomers, there exist 14,715,813 isomers with the composition CooHagN. The first step in an identification process will therefore be to reduce the solution space to a size where the generation of candidate structures is a feasible process in view of the cost involved in computer time. The programme (called Heuristic DENDRAL) takes the molecular com- position and the mass spectrum as input and returns a list of acceptable candidate structures. Optionally, othe: physical data (e.g. NMR) can be introduced and this will further truncate the list of candidate structures. Heuristic DENDRAL originally contained five sub-routines,t Preliminary * Present address: Chemical Laboratory II, University of Copenhagen, The H. C. Orsted Institute, DK-2100, Copenhagen, Denmark. + Programme modules are written in upper case. 314 315 A. BUCHS ef al. Inference Maker, Structure Generator, Predictor, Consistency Check, and Scoring Function. For the problem under consideration, aliphatic amines, only the first two sub-routines were used, since all the mass spectrometry theory was placed in the Preliminary Inference Maker and no further heuristics existed for use in other phases of the original programme. The present programme was deliberately designed to achieve maximum truncation of the search space within the Preliminary Inference Maker, since saturated amines yield many more isomers than aliphatic ethers of equal carbon content and it was desirable (in view of the operational time factor) to reduce to a minimum the number of possible solutions presented by the Preliminary Inference Maker. In order to unambiguously define all possible amine sub-graphs (2.e. structural units containing the hetero-atom and having at least one free valence) the symbols T (tertiary), S (secondary), P (primary), and M (methyl) are used to delineate the degree of substitution on the «a-carbon atoms of any saturated amine. Thus P refers to | | —CH;—NHe; SM to —CH—NHCHs3 and TS to —CH—NH—C—. | The canonical order of the symbols is T > S > P > M and, using this conven- tion, there exist 31 possible amine super-atoms.* Initially, all 31 possible amine super-atoms are placed on GOODLIST and each is removed when it fails a heuristic decision. All the programme’s know- ledge of the theory of mass spectrometry and nuclear magnetic resonance spectroscopy is stored within the Preliminary Inference Maker. It should be noted that an NMR spectrum, if available, can be successfully used, but the programme performs in an efficient manner using only mass spectrometric input. If the empirical composition of an unknown’s molecular ion corresponds to CnHan+3N, the programme locates the amine rules (Fig 1), places the 31 possible superatoms on GOODLIST, then checks to see whether sufficient carbon atoms are contained in the composition to build each of the separate superatoms (Fig 1, decision SIZE). An NMR spectrum, if available, serves as the next input datum for the programme to scrutinize. Heuristic DENDRAL determines the total number of carbon-bound methyl groups, the number of N-methyl groups, and, if this latter parameter is zero and an integral curve exists, the number of protons attached to the a-carbon atoms of the amine. Those superatoms which are incompatible with the NMR spectrum are then deleted from GOODLIST. The first mass spectrometric condition (ALPHA CLEAVAGE, Fig 1) programmed into Heuristic DENDRAL is related to the well-known propensity of aliphatic amines to fragment by «-cleavage. For those super-atoms with only one free valence H (i.e. P, (—CH2—NHk), PM, (—CH:—NCHs) and PMM, (—CH2—N(CHs),) * Three others, M, MM, and MMM, exist but they translate to the special cases of methyl amine, dimethyl! amine, and trimethyl amine respectively. RULES FOR ETHERS KETONES , AMINES i AMINE RULES for the 31 THE APPLICATION OF ARTIFICIAL INTELLIGENCE 316 COMPOSITION _— * superatoms on GOODLIST [avPua: ease | NOHIGH | ——<—$—$ _$_$___________- | SIZE CrHanes N Low Resolution |__| MASS SPECTRUM |* ‘ Alpha cleavage NNTUPLES MAXINTENS MIDENTENS A bel — NONEHIGHER BRANCHING ALL(M-15) ee] SOME (M-15) REARRANGEMENT . ———s| GoooLlLIstygT +s ISOMERCOUNTER |, _**_ J r OPTIONAL | NMR SPECTRUM | * TOTALMETHY LCOUNT | ptm | WME THY LCOUNT | HYDCOUNT = Input. * = Output. * Fie 1 the first condition is that the a-fission peak (m/e 30, 44, and 58, respectively) must be the base peak. A second condition (NOHIGH, Fig 1) states that there should be no other peaks with an intensity higher than ro per cent relative abundance above the mass value of one half the molecular weight. This latter rule was introduced to take care of special cases which arose when some of the smaller amine molecules were used as examples. In the case of any other super-atom there is a definite number of «-cleavage fragments which must be located within the unknown mass spectrum (decision NTUPLES, Fig 1). Thus, for every free valence present in a super-atom, the programme has to calculate the mass of an equivalent number of alkyl radicals (referred to as NTUPLES). Certain super-atoms must yield «-cleavage peaks in their mass spectra which exceed an empirically determined value of 70 pér cent relative abundance. If more than one set exceeds this limit, the largest intensity 317 A. BUCHS é¢ al. sum is accepted (decision MAXINTENS, Fig 1). Certain super-atoms (those secondary or tertiary «-mono or a-disubstituted amines) must yield an NTUPLE set that exceeds only 30 per cent relative abundance, since these compounds are known to yield very abundant rearrangement ions. The programme decisions (Fig 1) labelled NONEHIGHER, BRANCHING, ALL (M-15), and SOME (M-15) relate to various facets of the a-cleavage process of aliphatic amines. Super-atoms which either cannot yield rearrangement ions or only ions of moderate intensity are accepted at this stage as viable candidates provided they contain one surviving ntuple. Those super-atoms which can produce rearrangement ions of major intensity are further tested (decision REARRANGEMENT, Fig 1). They must have at least one intense ion in their mass spectrum originating from the amine re- arrangement process (see a and a’—>8). Each candidate super-atom (plus the masses of the alkyl fragments which must be attached to its free valences—termed PARTITIONS in the programme) is sent to a sub-routine (ISOMERCOUNTER) which calculates the number of isomers compatible with that super-atom and its partition list. Heuristic DENDRAL has been tested with g1 amine mass spectra; in 37 instances these were supplemented by NMR spectra. The programme always included the correct answer in its final output. A tremendous truncation of the hypothesis space (the number of possible isomers) was achieved in most instances. For example, there exist 2,156,010 isomers of tri-n-hexylamine and with mass spectrometry alone this figure had been reduced to 240 acceptable candidates. However, with the addition of NMR spectroscopy this list was further reduced to only one entry—the correct answer. R—CHe—- wv N\ ~CH2—R’ 3 NX oO x Q-2+ 2+ + CHe=N—CH2R’ R—CHe— =CHe dat, CHs a a’ CHe= Ne dats b EXPERIMENTAL The programme described is written in the LISP programming language and runs on the IBM 360/67 computer at the Stanford University Computation Center. Without NMR data, the programme required 4-26 minutes to interpret gI mass spectra. When NMR data are also used, the process is approximately 30 per cent faster. THE APPLICATION OF ARTIFICIAL INTELLIGENCE 318 ACKNOWLEDGMENTS Financial assistance from the Advanced Research Projects Agency (Contract SD-183), the National Aeronautics and Space Administration (Grant NGR-o5- 020-004), the National Institutes of Health (AM 04257), and the allotment of a Fulbright travel award to G.S. is gratefully acknowledged. Discussion E. Kendrick (Esso Research Centre, Abingdon, Berkshire, U.K.): You use the pro- gramming language “LISP” in the work you have described. Does this language have particular advantages for this type of work and what are these advantages? B. G. Buchanan: The objects in LISP programmes are the so-called atoms, z.e. numbers or strings of letters and numbers. The atoms can be collected in lists and pairs, but the elements of a list or a pair can be lists and pairs as well as atoms. This allows the pro- grammer to operate with complex structures in a much more efficient way than by using the arrays in FORTRAN or ALGOL. LISP differs in another important way from the ordinary programming languages. In LISP all functions are found in list-structures, and this means that it is possible to write LISP programmes which are writing new LISP programmes. Finally, the possibility of using recursive functions is very helpful in the programming. G. Schomburg (Max Planck Institute, Mulheim, Germany): Most mass spectra of the components of complex isomer mixtures will be obtained by GC-MS work with high resolution GC (capillary). NMR-spectra are mostly not obtainable from species separated by capillary GC because of low sample load. How do you solve this problem? B. G. Buchanan: The NMB-spectrum is not required as input of the programme. Hence, there will be no difficulties in applying DENDRAL to GC-MS work. However, when available, the NMR-spectra will lead to a decrease in execution time as well as in the number of candidates.