~Jg- b) GAS CHRKOHATOGRAPRY/HIGH RESOLUTION MASS SPECTROMETRY We will complete the intermediate disk butter in conjunction with the ACME follow-on system transition to allow routine collection and filing of sequential spectra. We will exercize tte syster on body fluid samples in support of our clinical applications and the development of interpretation projrams. As developments occur which improve sensitivity, we will ancorporate these to extend the power of the system. Cc) AUTOMATED GC/MS DATA REDUCr10N The approach described above is still in the formative stage. We will complete the development and implementation of these ideas, test them in the clinical application domain ana produce an automated system suitable for routine use by the biochemist. d) CLUSED-LOOP INSTRUMENT CONTROL With the development of a more automated method for acquiring metastable information under subtask (a) plans, we will develop and exercise the strategy planning aspects of the Heuristic DEADRAL programs in connection with Managing a urine analysis GC/MS run. This will be a simulation of closed-loop Operation intended to demonstrate the feasibility and need for un actual implementation of these iueas. In support of these closed-loop simulations we will investigate the feasibility of instrument mode switching and simple control function such as ion source and electrostatic analyzer potentials and maynet scan. REFERENCE ~- PART B(i) 1) Lederberg, Joshua, “Rapid Calculation of Molecular Formulas from Mass Values," Journal of Chemical Education, Vol. 495, Page 613, Septeaber, 1972. fete Joshua Lederberg School of Medicine Stanford University Stanford, California 94305 Tie ealeulation of molecular composi- tions consistent with a given range of mass values arises particularly in’ mass spectrometry. Although this ‘an be a trivial exercise on the computer, it has been vexing to do by hand. Published tables, c.g., Beynon and Williams,' are bulky, and nevertheless cover a limited range of atom valucs. The values are also awkward to search, not having been sorted. The following approach was designed for a desk calculator that ought to be available to any student. As it involves only a few additions and subtractions, it can—horribilts dictu—even be done by hand. Further- more, it lends itself to real time implementation on small computers that lack high precision ‘‘divide” in- structions in their hardware. The basis of the calculation is the table, which is an ordered list of the mass numbers of the formulas for H from 0 to 10, N from 0 to 5, and O from 0 to 11. It contains only those compositions whose masses are an integral multiple of 12. Any number of C’s may then be added as required. The use of the table is best explained by a specific example, say m = 259.09 + 0.001. Step 1. Since 259 == 7 modulo 12, 5 H’s (5.03918) will be bor- rowed to give m’ = m + 5H = 264.129. This is then divided into m!’ = my + my; m, = 264 (SgRX 12); my, = 0.129 + 0.001. Step 2. The table is searched for entries that correspond to my and whose mass does not exceed m;. (mm, is expressed as m,;/12 = C-equivalent.) We find none in this cycle. Step 3. We therefore remove 12 H’s (12.0939) to give m” = m’ — 12H = 252.0385 + 0.001. The table now has entries at 0.034 (HsN,Os), 0.035 (HipNO,) and 0.036 (H6N;0;). These will be completed in Step 4. 12 H’s are again removed until m, falls below — 0.0498, the bottom of the table. In our example, this occurs at the next cycle. Rapid Calculation of Molecular Formulas from Mass Values Siep 4. The table entries are now completed as follows Add C's Check mass to Adjust (compare make = borrowed 259.0900 up_m” 's 0.0010) 34 0.084216 HeN«Oa mi = Cee Cs CsllisNeOn 259.089 35 0.035559 HioNOo mj = Cu Cy Citi NOs 259.090 36 0.036895 WeNsO5 mi = Cis Cs CaHaNsOs 259 092 Step &. Various criteria of chemical plausibility can be used to filter the list. Since the valence rules allow H’s to a maximum of 2 + 2C +N, none of these compositions is oversaturated. CsHi;N.Os however has an odd number of H’s and may therefore represent a free radical. If wider ranges of hetero atoms are contemplated, adjustments of blocks of 6 N (84.01844) and 12 O (191.9389) can be applied repetitively in a fashion similar to Step 3 so long as the adjusted mass allows. In fact m” = m — 6N — 7H = 168.017 + 0.001 leads to Ce HiNeO,, m = 259.090. Further, m — 12N — 7H = 83.999 + 0.001. We read this asm, = 84; my = —0.001 and find two en- tries in the table: —0.000826 (HsNOio) and 0.000510 (H2N.Os), whose m,; however >84. The table is arranged so as to illustrate its use in a fast computer program. A linear array with 138 cells, indexed as shown, has entries that never slip more than one position away from the value of the index. The composition values can therefore be accessed by direct lookup, obviating a table search. A card deck version of the table is available on request from the author. This compilation is a greatly shortened form of some tables that were published some time ago.? This work has been supported in part by the Advanced Re- search Projects Agency (contract SD-183), the National Aero- nautics and Space Administration (grant NGR-05-020-004), and the National Institutes of Health (grant GM-00612-01). ! Beynon, J. H., anp Wiuurams A. E., ‘“Mass and Abundance Tables for use in Mass Spectrometry,” Elsevier, Amsterdam, 1963. 2 LEDERBERG, J., ‘Computation of Molecular Formulas for Mass Spectrometry,” Holden-Day, San Francisco, 1964. Table of Mass Fractions for all Combinations® of H, N, O (H < 10N S 60 < 11) Index ms X 106 H N oO =C Index mp X10 H N Oo =C Index m; X 105 H N Oo = —~49 — 49787 0 2 11 17 0 0 0 0 0 0 31 31537 10 3 11 9 —45 — 45765 0 0 9 12 1 510 2 5 6 14 32 32363 4 2 1 14 —38 — 38554 0 4 10 18 2 1853 4 2 7 12 34 34216 8& 4 8 16 —37 —37211 2 1 11 16 4 4532 2 3 4 9 35 35559 10 1 9 14 —34 — 34532 0 2 8 13 5 5875 4 0 5 7 36 36895 6 5 5 13 —30 ~ 30510 0 0 6 8 6 6385 6 5 11 21 38 38238 8 2 6 11 —25 — 25978 2 3 10 17 7 7211 0 4 1 6 40 40917 6 3 3 8 ~ 24 — 24635 4 0 11 15 8 8554 2 1 2 4 41 42260 8 0 4 6 ~ 23 — 23299 0 4 7 14 10 10407 6 3 9 16 42 42770 10 5 10 20 —21 ~ 21956 2 1 8 12 11 11750 8 Oo 10 4 43 43596 4 4 0 5 —19 — 19277 0 2 5 9 13 13086 4 4 6 13 44 44939 6 1 1 3 -15 — 15255 0 0 3 4 14 14429 6 1 7 Il 46 46792 10 3 8 15 ~14 — 14745 2 5 9 16 15 15765 2 5 3 10 49 49471 8 4 5 12 —13 — 13402 4 2 10 18 17 17108 4 2 4 8 50 50814 10 1 6 10 —10 — 10723 2 3 7 13 18 18961 8 4 11 20 52 42150 6 5 2 9 +9 ~— 9380 4 0 8 li 19 19787 2 3 1 5 53 53493 8 2 3 7 —-8 — 8044 0 4 4 10 20 21130 4 0 2 3 56 56172 6 3 0 4 —~6 —6701 2 1 5 8 21 21640 6 5 8 17 57 57515 8 0 1 2 —4 — 4022 0 2 2 5 22 22983 8 2 9 16 58 58025 10 5 T 16 ~2 —2169 4 4 9 17 25 25662 6 3 6 12 62 62047 10 3 5 11 -1 — 826 6 1 10 15 27 27005 8 0 7 10 64 64726 8 4 2 8 28 28341 4 4 3 9 66 66069 10 1 3 6 29 29684 6 1 4 7 68 68748 8 2 0 3 30 31020 2 5 0 6 73 73280 10 5 4 12 77 77302 10 3 2 7 81 81324 10 1 0 2 88 88535 10 5 1 8 (-0.049 to —0.0008) (0 to 0.03) (0.03 to 0.088) * Arranged so that the index for each entry agrees with 1000 x my + 1.9, ; __ [Reprinted from Journal of Chemical Education, Vol. 49, Page 613, September, 1972.] Copyright 1972, by Division of Chemical Education, American Chemical Society, and reprinted by permission of the copyright owner cf. PART BCii): ANALYSIS OF THE CHEMICAL CONSTITUENTS OF BODY FLUIDS PART b-(2i) ANALYSIS OF “tHE CHEMICAL CONSTITUENTS OF BODY FLUIDS | OBJECTIVES: The overall objectives of this part of the ptoposal are to develop the uses of gas Chromatography (GC) and mass spectroietry (45), undec “intelligent" computer management, for the clinical screening, diagnosis, and study of errors ot metabolism. The efficacy of these analytical tools has teen demonstrated when applied to lamited populations of urine Samples in the research laboratory environment. we propose to enlarye the clinical investiyative applications of SC/“S technoloyy and to demonstrate its utility tor the diaynosis and screeniny ot disease states. Specitically we will apply our GcyMs analysis capabilities to larger and more diversified populations to establish better defined norms, deviations related to identifiable disease states, and control parameters required to remove ambiguities troe results. BACKGROUND AND PROGRESS: For some time we have focussed a substantial part of ouL eftort on exploiting the use of the mass Spectrometer as an analytical instrument for biochemical purposes. Uur central approach has been to intoyrate the mass spectrometer with the yas chromatograph on tae one hand and with “intelliyent" computer management on the other. Gas chromatography is a versatile aud broadly applicable method for the separation of biochemical specimens into a large number of distinct hut unnamed fractions. The mass spectrometer has unique power to analyze such fraction: and give information relevant to their molecular structure. whe conputer becomes indispensable for the overall Mahnayemont of the System and for the reduction and interpretation of the larje volume of data emanating from the analytical instruments. Cur effort in instrumentation, therefore, is an integral part of this research and comprises a good deal of computational software embracing both real time instrument and data Management as well aS artificial intelligence. It also requires considerable eftort in electronic and vacuum technoloyy for the instrumentation hardware, and a coherent system approach for the overall integration of these components. These aspects of the effort are described in section B(i) of this proposal. The voutine screening of normal and abnormal body metabolites, as well as adruys and their metabolites, ain husan body fluids (ret 1) is currently the object of several research programs. Various non-specific methods, including thin layer (rof 2, 3), ion exchange (ref 4, 6), liquid (ref 5), and gas chromatography {ref 7-10), are used primarily with the goal of separating a large number of unnamed constituent materials. when used in conjunction with mass Spectrometry, these methods become P27 -2- specific and provide a powerful means of positive identification of metabolites in human body fluids (ref 11-13). Of these techniques, yas chromatography is the most convenient to interface to the mass spectrometer because the carrier gas can easily be removed as the analysis proceeds on a continuous tlow. Based upon the references cited, aS well as our own on-going prograds, the ability of the Gcyms technique for the analysis of body fluids is well established. we have drawn upon the published literature in helping to design our experimental protocols. Standacd chemical procedures for extracting, derivatizing, and hydrolyzing urine and plasma are used for the GC/MS analysis (ref 13). These procedures permit separation of the following classes of substances: acids, phenols, amino acids, and carbohydrates. It is possible to detect free or conjugated compounds within these classes, The gas chromatogtaphic analysis of each class of compounds presents a metabolic protile. Abnormal profiles (containin, either excessively large peaks from one or nore components or peaks which do not correspond to metabolites usually encountered) are then assayed by mass spectrometry. The mass spectra recorded during the elution of each gas chiomatographic peak then serve to identify the constituents present in that peak. Most madical centers have access to amino acid analyzers in order to screen patients for metabolic abnormalities of the poincipal amino acids, but unless a special research interest exists, other errors of metabolism cannot eaSily be studied. At this institution the GC/MS system provides us the Opportunity tu detect a wide variety of errors which show accumulation of novel amino acids, fatty acids, and many other metabolites in urine, blood, and other bioloyical fluids and tissues. lirine is known to contain several hundred organic compounds. The separation (gas chromatography) and herce identification (MaSS Spectrometry) of these components would be an extremely difficult task. To simplity the separation problem the urine is chemically separated into four tractions as illustrated in the following diagram. URINE (pt = 1, internal standards added) ee me ee ne ee ee ee ae eee ee ee ether phase aqueous phase I | (free ucids) 00 -----+------- ---- -------- -- +--+. A \ \ i (carbohydrates) (amano acids) i Cc B i | tydrolysis i { | ether phase aqueous phase | (hydrolyzed acids) (amino acids) D E The experimental procedura used for working with a urtne sample is as follows. To an aliquot (2.5 ml.) of a Z4 hour urine Sample is added 6N hydrochloric acid until the ph is 4. Two internal standards, n-tetracosane and Z-amino octanoic acid are then added. xcther extraction isolates the tree acids (fraction a) which are then methylated and analyzed by yas chromatojraphy-mass Spectrometcy. An aliquot of the ayueous phase (0.5 ml.) is concentrated to dryness, reacted with n-butanolyhydrochloric acid followed by methylene chloride containing trifluoroacetic anhydride. This procedure derivatizes any amino acids (or water soluble amines) which are then sacjected to GC/MS analysis (fraction 8). Another aliquot (U.5 ml) of the aqueous phase can be derivatized for the detection of carbohydrates (Fraction C). Concentrated hydrochloric acid (0.15 ml) is added to the urine (1.5 al) atter ether extraction and the mixture hydrolyzed for 4 hours under reflux. zther extraction separates the hydrolyzed acid fraction (D) which is then methylated “and analyzed by GC/MS. A portion of the agueous phase (0.5 ml) trom hydrolysis ot the urine is concentrated to dryness and derivatized and analyzed for amino acids {Fraction &£). Asi an example of the application of these methods to hiomedical problems, we can usa some recent Studies we have undertaken on the urine vf a patient sufferiny from acute lymphoblastic leukemia. The gas chromatographic profile (kiyure 1) of the amino acid fraction of his urine showed the presence of an abnormal peak (A). The sass spectra (Figure 2) recorded during the lifetime of this chromatographic peak identified this component as beta-amino isobutyric acid from a comparison with a literature (ref. 19) spectrum of authentic material. Quantitation Showed that this patient was excreting 1.2 grams per day ot beta-amino isobutyric acid. After medical treatment this metabolite was no longer detected in the patient's urine thereby raising the question of whether beta-amino isobutyric acid can ie used aS a metabolic signature for the recognition of lymphoblastic leukemia and for the status of the disease in the course of the treatment cycle. Beta-amino Lsobutyric acid has been observed in the urine of 5 patients suffering frow leukemia and in all instances it disappeared immediately following uruy therapy. We are continuing our Study of this relationship in view of the recognized excretion of elevated apounts ot beta-amino isobutyric acid as the result of a genetic trait. For instance Harris et al. (ref. 14) observed daily urinary excretions of 70-300 my of beta-amino isobutyric acid and noted that histories of high excretion levels tended tu exist in patticular families. At; a second example of the application of GC/NS to biomedical problems we can cite preliminary studies on approximately 80 urine samples from a total of 11 premature or "small for gestational age" infants. This ploject was undertuken to investigate the phenomenon of late metabolic acidosis. ‘this condition 1s characterised by low blood pH levels, poor weight jain, and, as distinct from respiratory acidosis, onset after the second day of life. Its incidence is higher in infants whose birthweight is less than 1750g (one Study shows 92% incidence for these children) than in intants with birthweight greater than 1750g (26%). Of the 11 patients studied we were able tu observe 6 Closely and continuously for periods ranging from 6 to 8 weeks from day 3 of life. Three of these infants had birthweights below 10007 ana the other three were born weighing less than 150Ug. VE the 6, five showed symptoms COrLesponding to late metabolic acidusis and the other showed normal and even development. Ihe tive intants showing the acidosis all excreted very lary? amounts of p-hydroxyphenyllactic acid together with smaller amounts ot p~hydroxypheanylpyruvic acid ana p~hydroxyphenylacetic acid. After reaching a peak, the presence of these compounds in the urine jcadually diminished and almost completely disappeared at the time blood pH and weight gain had returned to normal. fhe infant who did not show symptoms of acidusis only excreted minute amounts of tiese compounds duriny the period of observation. The occurrence of large amounts of these compounds in the urine indicates a temporary defect in pheny lalanine-t yrosine metabolism and dietary fuctors such as protein and vitamin intake can Le shown to affect tie incidence and the severity of the condition. [t is hoped that further studies will result ina clearer picture of relationships between the condition and diet and hence lead to a reduction in its occurrence In the course of these studies, we have recognized two areas where computer analysis ot the data is important in order to handle the volume of data involved and tu standardize the analyses performed. At present these operations, GC profile analysis and mass spectrum identification, are largely manual. In the case of GC profile analysis, approximately 40 peaks for eaci profile must be analyzed in terms of their positions, sizes, etc. relative to other peaks in the profile and insttument pacvameters to evaluate the presence or absence of abnormalities. For cach abnormal peak, a number ot mass spectra (5 to 10), each containing Lon abundance measurements at approximately 50U masses, must be compared against catalogued known materials tor identification. Lf the material is not in the Catalog, the mass Spectrum must be interpreted from basic principles, using high resolution spectrometry and other data sources as appropriate. These are very tedious operations requiring automation for even the proposed limited screening volume. the developmental aspects of these computer-related portions of the research plrogtam are discussed in the other sections of this proposal, FUTURE PLANS In the next grant period we plan to extend our efforts in applying GC/MS techniques to clinical problems both in terns of defining norms and in terms of studying identifiable disease States in collaboration with clinical investigators. The most appropriate target material tor this developmental effort is the metabolic output of NORMAL subjects under controlled conditions of diet and other intakes. The eventual application of this kind of analytical methodology to the diagnosis of disease obviously depends on the establishment of normal baselines, and much experience already tells us how important the influence of nutrient and medication intake can ba in intluencing the composition of urine, body fluids, and breath. Among the most atttractive subjects for such a baseline investigation are newborn infants already under close scrutiny in the Premature Research Center and the Clinical Research Center of the Department of Pediatrics at this institution. Such patients are currently, for valid medical reasons, under a deyree of dietary control ditficult to match under any other circumstance. “any other features of their physiological congition are being carefully monitored for other purposes as well. fhe examination of their urine and other effluents is therefore accompanied by the most economical context of other information and requires the least disturbance of these subjects. Two obvious factors which could profoundly influence the excretion of metabolites detected by GC/MS are maturity and diet. We have alveady initiated a program for serial screening ot urinary metabolite excretion in premature infants of various gestational ages and determination of changes in the pattern ot excretion of various metabolit2s as a tunction of aye following birth. fhese studies are being performed on intants admitted to ~-6- the Center for Premature Infants and the [Intensive Care Nursery at Stanford, a source of some 500 premature infants per year, In addition, in conjunction with an independent study on the effects of both quality and quantity of oral protein intake on the incidence and pathogenesis of late metabolic acidosis of prematucity, we plan to measure the urinary excretion patterns of vactious metabolites and thereby pattially assess the effect of diet on this screening method. We shall use the analyses on blood and urine specimens trom normal individuals in the final development of Tapid, automated identification of compounds described by ass spectromotry. ihe computer will be used to match an unknown muss Spectrum with reference spectra contained in computer files. Programs are also being developed which will provide the Strateyy for the computer to interpret an unknown mass spectrum (not contained in the library) and directly identify the compound (see Parts A and Cc). Litited libraries exist for urine and plasta GC/S5S analyses and will require progressive compilation (assisted by the vENDRAL interpretation programs) as our clinical Satpling proceeds. This will in tutn speed the throughput of the system by allowing the Simple identification otf materials by computer library search procedures. this library will tbe shared freely with other investigatocs. Given our ability to identify various constituents of urine and plasma and to understand normal variation, we shall apply the GC/MS system to pathology, making use of patients with already identiried metabolic defects for control purposes. The main application will, of course, be diagnostic and patients with suggestive clinical manitestations, such as psychomotor tetardation and progressive neurologic disease, as well as suggestive pedigrees (e.y. affected offspriny of consanguineous parents or gultiplex sibships) will be investigated. fhese patients are seen relatively frequently at any university hospital, and their presence in the various in-patient and out-patient services of the Stanford Department of vediatrics 1s well documented. The GC/MS system will be helpful in diagnosing not only errors of amino acid metabolism, tut also Many other metabolic aisorders, some of which are lactic acidemia (ref (15), vefsum's disease (a defect in the oxyyenation of phytanic acid {cef 16)), methylmalonic acidemia (ret 17) and orotic aciduria (ref 16). we also recognize the potential ot this methodology to define new errors of metabolism, We will collaborate with Protessor Howard Cann of the Department ot Pediatrics and derive much of the clinically Significant material tor analysis from patients in the Premature Research Center and the Clinical kesearch Center of the Department of Pediatrics and the Stanford University Children's Hospital. Analyses will te performed on existing GC and MS equipment in the Nepartments of Genetics and Chemistry. REFERENCES 1) Schwartz, M.K., "Biochemical analysis," Anal. Chem., Hu, De QR, (1472). 2) Heathcote, J.G., Davies, D.w., and Haworth, Ce, “Phe Effect of besaltiny on the Determination of amino Acids in Urine by Thin Layec Chromatography." Clin. Chin. Acta, 32, EL. 457 (1971). 3) Davidow, B., Petri, NeLe, and Quame, B., “A Thin Layer Chromatographic Screening Procedure for betecting Druy Abuse," Amer. J. Clin. Pathol., 54, p 714, (1968). 4) kftronu, K. and wolf, b.b., “Accelerated single-coluwn Procedure for Automated Measurement of Amino Acids in Physiological Fluids," Clin. chem., 16, p tel, (1972). 5) Purtis, C.A., "The Separation of the Ultraviolet-absorbing constituents of Urine by High Pressure Liquid Chromatography," J. Chromatoy., 52, p 97, (1970). 6) Wilson-Pitt, W., scott, C.2., Johnson, W.F., and Jones, u., "A Bench-top, Automated, High-resolution Analyzer for Ultraviolet Absorbing Constituents of Body Fluids," Clin. Chem., 16, p. 657 (1970). 7) Dalgliesh, C.E., Horning, &.C., Horniny, &.G., Knose, Kobe, and Yaryger, Ke, "A Gas-uiquid Chromatographic Procedure for Separating a Wide Range of Metabolites Occurring in Urine or Tissue Extracts," Biochem. J., lull, p. 792 (1966). 8) YTeranishi, R., Men, f.R., Robinson, A.t., Cary, be, atid Pauling, Le, "Gas Chromatography of Volatiles from breath and Urine," Anal. Chem., 44, pe 168, (1972). 9) Pauling, L., Robinson, A.B, feranishi, R., and Cary, ?., “Quantitative Analysis of Urine Vapor and Breath by Gas-ligquid Partition Chromatography," Proc. Nat. Acad. Sci. USA, 68, p. 2374, (1971). 10) dZlatkis, A. and Liebich, H.M., "Profile of Volatile Metabolites in Human Urine,” Clin. Chem., 17, 592 (1971). 1t) Mrochak, J.E., Putts, W.C., dainey, W.T., and Burtis, C.A., “Separation and Identification ot Urinary Constituents by Use of Multiple-analytical fecaniques," Clin. Chem., 17, pele (971). 12) Horning, E.C. and Horning, &.G., “Human Metabolic Profiles Obtained by GC and GCyMs," J. Chromatog. sci., 9, Pe 129, (1971) 13) Jellaw, E., Stokke, O., and wldjarn, Le, “Combined Use of Gas Chromatography, dass spectroaetry, and Computer in Diagnosis F-3S -§- and Studies of Metabolic Disorders," Clin. Chen., is, p. 8OL (1972). 14) Harcis, H., "family Studies on the Urinary Excretion of Beta-Amino Isobutyric Acid," Ann, Eugenics, Vol. 14, Page 43, (1953). 15) Haworth, J.C., Ford, J.L., and Youncszai, M.K., “Familial Chronic Acidosis due to an Error in Lactate and Pyruvate Metabolism," Canad. Med. ASS. Je, 79, pe 773 (19607). 16) Herndon, J.H., Steinbery, b., and Ulhendort, H.W., “Kefsum's Disease: Netective Oxidation of vhytanic Acid in Tissue Calturces Derived from Homozyjotes and Heterozyyotes," New England J. of Med., 281, -. 1023, (1969). 17) Morrow, Ge, Schwartz, R. H., Hallock, J.A., and Barness, L.A., “Prenatal Detection of Methylmalonic Acidemia," J. Pediatrics, 77, p. 126, (1970). 18) Fallon, J.H., Smith, L.H., Graham, J.H., and Burnett, C.H., "A Genetic Study of Hereditary Urotic Aciduria," New england J. of Med., 27u, pe d7e, (1964). 19) Lawless, J.G. and Chadha, M.S., “iffass Spectral analysis of C(3) and C(4) Aliphatic Amino Acid Derivatives," Anal. Biocienm., Wu, pe 473, (1971). 20) Keynolds, W.E., Racon, V.A., Bridyes, J.C., Copurn, T.c., Halpern, #., Lederbery, J., L2vinthal, E.C., steed, &., and Tucker, &.B., “A Computer Operated Mass Spectrometer System," Anal. Chem, 42, pe Vlec, (1970). ee jn vices weep! | Po _ ate Ss FIGURE 1 Gas Chromatogram of the Amino Acid Fraction of Urine 188 whl | an 88 68 auth, ter Ji Ww 4Q : 6? | Sat It 20 4 56 2G | | 2 1" ma (la 2ou | 4. wr 1 {3 ‘ hy ‘ 1 ‘ | 3 Peet wetter serfee per \" ep rere Tt Hee Tee eae TNT ep port ET ai) 68 6B 188 128 142 168 182 200 220 249 260 FIGURE 2 Mass Spectrum of Beta-Amino Isobutyric Acid PART C: EXTENSION OF THE THEORY OF MASS SPECTROMETRY BY COMPUTER PART C. Extending the Theory of Mass Spectrometry by a Computer (Meta~DENDRAL) OBJECTIVES: The Heuristic DENDRAL performance program described in Part A is an automated hypothesis formation program which sodels "routine", day-to-day work in science. In particular, it models the inferential procedures of scientists identifying components, such as those found in human body fluids. The power of this program clearly lies in its knowledge about Various Classes ot compounds normally tound in body fluids, which knowledge allows identification of the compounds. The Meta-DENDRAL program described in this part is a critical adjunct to the performance program because it is designed to supply the knowledge which the performance program uses. Theory formation is essential in order to carry out the routine analyses - either by hand or by computer. However, the staggering amount ot effort required to build a working theory (even for a Single class of compounds) holds back the routine analyses. The goal of the Meta-DENDRAL program is to fora working theories automatically (from collections of experimental data) and thus reduce the human effort required at this stage. By Speeding up the time between collecting data for a Class of compouad® and understanding the rules underlying the data, the Meta~DENDRAL program will thus provide an improvement in the development of diagnostic procedures. Theory formation in science is both an intriguing problen for artificial intelligence research and a problem area in which scientists can benefit greatly from any help the computer can give. While the ill-structured nature of the theory formation problem makes it more a research task than an application, we have already provided computer prograas which are of definite help to the theory- forming scientist. Mass spectrometry is the task domain tor the theory formation program as it is for the Heuristic DENDRAL program. It is a natural choice for us because we have developed a large number of computer programs for manipulating molecular structures and mass spectra in the course of Heuristic DENDRAL research and because of the interest in mass Spectrometry among collaborative researchers already associated with the project. This is also a good task area because it is difficult, but not impossible, for human scientists to develop fraymentation rules to explain the mass spectrometric behavior of a class of molecules, Mass spectrometry has not been completely formalized, and there still remain gaps in the theory. Understanding theory formation enough to automate Substantial parts of it will benefit all of the biomedical Sciences. More directly, building a computer program which forms a theory of mass spectrometry will greatly enhance the power of mass spectrometry as a diagnostic instrument. FOXe Detailed accounts of this research are available in the DENDRAL Project annual report to the National Institutes of Health, in several research papers already published and in manusctipts submitted for publication. PROGRESS: In the period covered by the initial NIH grant the Meta-DENDRAL program has moved from a set of ideas to a set of working computer programs. The first three segments of Meta~DENDRAL have been plogrammed and can be used with new experimental data. These segments are first summarized and then described in more detail in subsequent sections. We described the initial design of the sMeta-DENDRAL program in a paper presented to the 2nd International Joint Conference on Actificial Intelligence (London, August, 1971). And further design details and partial implementation of programs were described in a paper presented at the 7th Machine Intelligence Workshop (Machine Intelligence 7, B. Meltzer & De Michie, eds., 1972). Summary ot Segment 1 The data interpretation and Summary program (INTSUM) defines the space of mass spectrometric processes, interprets all the data in terms ot these processes, and summarizes thea process by process. This program is capable of a much nore thorough analysis of the data than a human can perform. Summacy of Segment 2 The rule formation proyram starts with the interpreted and summarized results of the data. It searches the set of processes for those that meet the criteria for cCules, and attempts to resolve ambiguities when several processes explain many of the same data points. The resulting rules are characteristic processes for the whole class of molecules. Summary of Segment 3 The class separation program is an extension of the Sinuple rule formation program just mentioned. Because the initial set of molecules may not all behave alike in the mass Spectrometer, it is necessary to separate the important Subclasses and formulate characteristic rules for each subclass. SEGMENT 1. The initial segment of the theory formation program is data interpretation. after the experimental data have been collected for a large number of compounds, the program re-interprets all the data points in terms of its internal model of the experimental instrument. This part of the program has already proved useful to chemists studying the mass spectrometry of new classes of compounds. It has been described in a paper recently submitted for publication (Applications of Artificial Intelligence for Chemical Inference X. INTSUM. A Data Interpretation Program as Applied to the Collected Mass Spectra of Estroyenic Steroids, submitted to Tetrahedron). The computer program for data interpretation and summary has been well developed. While it is never safe to call a program "finished", this program has reached the staye where we have turned it over to the chemists who want to look at explanatory mechanisms for the mass spectra of many compounds. Ordinarily, this is such a tedious task that chemists are forced to limit their analysis to a very few out of a total space of potentially interesting mechanisms. The computer program, on the other hand, systematically explores the space of possible mechanisms and collects evidence for each, This program is described in the Machine Intelligence 7 paper, and the results obtained by running it with many estroyen Spectra are discussed in the manuscript submitted to Tetrahedron. Mr. William C. white has been largely responsible for coding the program in LISP. The progran runs in the overnight LISP system at the Medical School's ACME facility, and on the Stanford Computation Center IBM 360/67. It is currently being used by Dr. Steen Hammerua, a post-doctoral fellow in chemistry from the University of Copenhagen, to summarize the fragmentations found in the spectra of substituted progesterones, and by Dr. Dennis Smith to interpret data from other classes of steroids. SEGMENT 2. The second segment of Meta-DENDRAL produces reasonable rules of mass spectrometry. The cule formation segnent starts with the interpreted and summarized data from the first segment. [It looks for the processes which are most frequent, which explain highly significant data points, and which are least ambiguous with other processes. Atter applying these criteria, it selects a set of processes which appear to be characteristic of the whole set of molecules initially given. Planning before rule tormation is necessary because there is so much intormation in the summary of possible fragmentations found in the data. It is desirable to collect all the information to avoid missing unanticipated mechanisms which occur frequently throughout the compounds in. the data. But even the summary of the mechanisms is voluminous enough to obscure the "obvious" rules waiting to be found. Iu a planning program implemented by Mr. Steven Reiss, the computec peruses the summary looking for mechanisms with "strong enough" evidence to call them first-order rules of mass spectrometry. Out criteria for strong evidence may well change as we gain more experience. For the moment, the program looks for mechanisms which (a) appear in almost all the compounds (80%) and {(b) have no viable alternatives (where "viable alternatives" are those alternative explanations which are frequently occurring and cannot be distinguished unambiguous1y). The output of this program, even though crude in many seases, is useful to chemists who first want to see the highly reliable, unambiguous rules which can be foraulated. If there are none, ot course, there is little point in pressing ahead blindly. This is an indication that some modifications need to be made, for example, splitting up the original set of compounds into sore homogeneous subgroups. On the other hand, if some likely rules can be found, these will serve as "anchor points" for resolving ambiguities with other sets of mechanisms and also serve as a "core" of rules to be extended and modified in the course of detailed rule formation. SEGMENT 3. As mentioned above, class separation is important because the initial collection of compounds may not be known to behave alike in the instrument. The rule formation program gust be prepared to retract its asSuaption o£ homogeneity. Mr. Steven Reiss, working with Dr. Buchanan, has written a first extension of the rule formation program which allows class separation on the basis of characteristic rules found for the subclasses. A paper describing segments 2 and 3 - rule formation with Subclass separation - thas been submitted to the 3rd International Joint Conference on Artificial Intelligence. The computer proyrams produced to date have already proved useful for helping to formulate mass Spectrometry theory for classes of biologically relevant molecules. Chemists have used these programs as tools for rule formation. They have examined the estrogenic steroids this way, including separate studies on some eyuilenins, acetates and benzoates. Also, they have used the program to interpret data fron several classes of pregnanes. Planss: In the coming period we propose to focus on three aspects of theory tormation. We plan to {1) extend the Capabilities of the programs, (2) make our rule formation programs more usable by chemists, and (3) continue our exploration of the more theoretical aspects of rule formation. 1. We anticipate new diftficulties as the classes of molecules under study become more complex, either with respect to Structural features or mass spectrometric behavior. Although we have made the programs flexible, extending the work just to new sets of data will undoubtedly introduce new problems. Now that the usefulness of the prograas has been demonstrated, we propose to couple the theory formation program more closely to data of more direct clinical relevance. For example, the mass spectrometry of amino acids and the aromatic acids frequently found in urine needs to be better understood before automatic analysis of the components of (the acid and neutral fractions of) urine is successful. Parts A and B of this proposal, in other words, can both be helped by the continuation of Part Cc. The program is now limited to forming cules which are more descriptive of the sample than explanatory. We are currently working on ways of generalizing the descriptive cules so that they are more truly general. Drs. sridnaran and Buchanan have started experimenting with computer programs which generalize the rules in various ways. Mc. Carl Farrell is currently working on a computer program for his Ph.D. thesis which allows systematic exploration of VariouS methods of generalizing on rules. His WOrk investigates the efficacy ot different control structures as well as different inductive rules. 2. The programs are now used by chemists, but not without a fair amount of help from the programming staff. We aust overcome some of the barriers to facile use before the programs can be counted as successful. For example, putting the data in the correct format can be made easier, aS Can defining constraints on the search space and modifying parameter values. The programs do not now require the chemist to know LISP. However, we propose to develop easier access to control of the programs through careful design of the user interface. Depending on hardware limitations, we would also like to provide a time-shared, graphics- oriented interface. 3. The descriptive form of rules agentioned above May be inherent in the conceptual framework we have chosen for the rule formation program. The program uses a "ball and stick" model of molecular structures, so it is no Surprise that Situations and actions in rules are simply described. We wish to explore more sophisticated models of mass SpectcCometry with the hope of discovering how a progran could search the space of possible sodels during rule formation. This is still a very challenging problem. We have so far concentrated on more practical aspects of theory formation - 1.e., producing results of immediate utility. But we teel strongly that we must grapple with the outer teaches of the problem in order to arrive at meaningful solutions. PUBLICATIONS ~- PART C B.G. Buchanan, E.A. Feigenbaua, Je Lederberg, "A Heuristic Programming Study of Theory Formation in Science", in Proceedings of Second International Joint Conference on Artificial Intelligence, Imperial College, London (September, 1971). (Also Stanford Artificial Intelligence Project Memo No. 145, Computer Science Dept. Report CS-221) B.G. Buchanan, E.A. Feigenbaum, and N.S. Sridharaag, "Heuristic Theory Formation: Data Interpretation and Rule Formation". In Machine Intelligence 7, Edinburgh University Press (1972). B.G. Buchanan and WN. Sctidharan, "Rule Formation on Non-Hoaogeneous Classes of Objects", submitted for presentation at the Third International Joint Conference on Artificial Intelligence (Stantord, August, 1973). PART D: APPLICATIONS OF CARBON(13) NUCLEAR MAGNETIC RESONANCE SPECTROMETRY TO ASSIST IN CHEMICAL STRUCTURE DETERMINATION PART D. CARBON-13 NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY The goal of our Heuristic DENDRAL research is to develop Capid, accurate and flexible computer techniques for identifying unknown steroids and other biologically important compounds from spectroscopic data. We have made Significant progress toward this goal: Our systen is currently capable of correctly analyzing high-resolution maSS spectra of estrogenic steroids and mixtures thereof. AS we extend our methods to the more complex probleas presented by other steroid classes, and eventually by other types of biologically important molecules, we will find it necessary to have available sources of structural information other than mass spectroscopy. Carbon-13 nuclear magnetic resonance (CMR) spectroscopy is an ideal candidate. Basically, the CMR experiment measures the extent to which each carbon nucleus in the sample molecule is shielded fron an applied magnetic field. This Shielding, of chemical shift, is caused by the distribution of electrons around the nucleus, and is determined by the carbon's hybridization and local chemical environment. Other investigators have determined that the shift of a carbon is strongly dependent upon the nature and placement of substituents at nearby centers, and that to a first approximation these substituent effects are additive. Thus, the CMR spectrum of a compound contains information which rather straightforwardly can be related to the possible local environments of each carbon. The structural information provided by CMBR data compliments that from mass spectroscopy, and there is relatively little redundancy between the two methods. Data from the latter represent molecular fragmentations, which take flace most readily neac functional groups. Thus, mass spectroscopy frequeatly gives structural information about the environments of such groups. In CMB spectroscopy, on the other hand, the chemical shifts of carbons in large alkyl moieties, far removed from functionality, are the best understood and _ the most predictable. Further, the Owe! of & fragmentation of large molecules such as steroids can show the general pattern of substitution in the molecule, while CMR shifts are sensitive to specific local patterns. Because the two methods “mesh" so nicely, we see the development of analytic CMR techniques as an extremely fruitful field of research. Our eventual ain is to completely define the structures of unknown compounds using only these two sources of information. We are well equipped to study this field. Ia our Chemistry department, we have a Varian XL-100 (Fourier-transfora) nuclear magnetic resonance spectrometer, one of the sost sensitive and flexible instruments currently available for CMR work. We have competent investigators in our Chemistry and Computer Science departments who are interested in, and in fact currently working on, the project. Finally, we have had considerable experience with computerized structure analysis, and much of what we have learned can be applied to the CMR problen. We have already begun investigating the use of CMR data in automated structure analysis, with our initial study focussed upon the acyclic amines. The analysis of low-resolution mass spectra of large amines is not capable of discerning the structures of long alkyl chains, so we felt that this class of molecules would provide a good test of CMR methods. Ms. Hanne Eggert of our group has obtained the CMR Spectra of over 100 acyclic amines, and has derived ah accurate set of predictive rules relating structure to chemical shifts. Dr. Raymond E. Carhart has used these rules to develop a computerized approach to the identification of amine structures from observed CMBR spectra (See attached manuscript). The progran, entitled AMINE, has proven to be extremely selective: The analysis of the CMR spectrum of trioctyl amine, tor example, yields only seven possible structures, though the molecule has over 700 million structural isomers. [In contrast, the analysis of the low-resolution mass spectrum of triheptyl amine gives nearly 2000 solutions out of a possible 38 million isomers. These results illustrate the tremendous amount of structural information which CMR spectroscopy can provide. This source of information has, in general, been ignored in steroid-identification research, primarily because large amounts of sample (50 milligrams or more for steroids) are needed to obtain reliable CMR spectra. However, CMR spectroscopy is still a relatively new field, and the sensitivity of current instruments is far from the threshold which new technologies can provide. We expect the minimua Sample size to drop to the sub-milligram level in the future, and with such sensitivity, the CMR spectrometer could be a powerful tool in biochemical and smedical research. If this tool is to be utilized to its fullest extent, it is important that we begin now to develop the concepts and techniques needed in the interpretation of CMR data. We propose, then, to study various classes of steroids in a manner analogous to the amine study, with the goal of developing a program which can! ‘reason out? steroid 4 j 2 as “Sf Sturctures from CMR data, perhaps in combination with mass-spectral data. Ms. Eggert has already collected CMR data on a variety of keto-substituted androstanes and Cholestanes to assess the effect of the carbonyl group on the chemical shifts of the steroid-skeleton carbons, and has, in the process, uncovered some aistaken CMR shift assignments published in the literature. we will study a variety of functional groups in this way, deriving general rules for predicting the spectra of more complex steroids. As these rules emerge, we will couple them with tae computerized heuristic~search and structure-generation techniyues which we have developed in our previous mass~- and CMR-spectroscopy research. PUBLICATIONS -- PART D RoE. Carhart and C. Djerassi, J. CHEM. SOc. (PERKIN II), submitted for publication (see attached preprint). He Eggert and C. Djerassi, J. Amer. Chem. soc., in press. Proofs (if required) by air mail to Professor Carl Djerassi Department of Chemistry Stanford University Stanford, California 94305 Applications of Artificial Intelligence for Chemical Inference. xr.) Analysis of Carbon-13 NMR Data for Structure Elucidation of Acyclic Amines Raymond E. Carhart* and Carl Djerassi, Departments of Computer Science and Chemistry, Stanford University, Stanford, California, 94305, U. S. A. This paper describes a computer program, entitled AMINE, which uses a set of predictive rules to deduce the structures of acyclic amines from their empirical formulae and Carbon-13 NMR (CMR) spectra. The results, summarized in Tables 2-5, of testing the program on 102 amines indicate that AMINE is quite accurate and selective, even for large amines with many millions of structural isomers, and demonstrate that the computerized analysis of CMR data can be a powerful analytical tool. The logical structure of the program is outlined here, including a section on the general problem of spectrum matching. Generalizations of the methods used by AMINE are suggested. I. INTRODUCTION In recent years, there has been a substantial amount of research directed toward the computerized identification of molecular structure 3-5 NMR, 726»? 7 3,4 from mass-spectroscopic and infra-red’ data. Our Heuristic DENDRAL program, which relies primarily upon mass-spectral -2- data, has been shown to be quite accurate for certain classes of Saturated, acyclic, monofunctional compounds, and more recently, the 3b There are methods have been extended to the estrogenic steroids. limitations to the information content of mass-spectral data, however, particularly when compounds are considered which have long, perhaps highly branched alkyl chains. An analysis of the mass spectrum of triheptylamine, for example, yields about 2000 solution structures,“ and although this is only a small fraction of the roughly 40 million (non-stereochemical) isomers of CopHggns it is still an impractically large number. The problem is that alkyl moieties do not give characteristic fragmentation patterns, and in fact, most spectroscopic methods are relatively insensitive to their structure. However, recent studies indicate that C-13 nuclear magnetic resonance (CMR) spectroscopy” is an exception. For several classes of compounds? rules have been obtained which allow one to predict the CMR spectrum of a substance from its molecular structure, and in all cases, the rules indicate that the chemical shift of any Carbon, even one in a large alkyl chain-end, depends heavily upon branching at nearby centers. Thus, it appears that CMR spectroscopy, either alone or in combination with other methods, could be a powerful tool in the computerized analysis of molecular structure. This paper outlines the methods py which such an analysis may be carried out for the acyclic amines, and 10 describes a FORTRAN IV computer program,’~ entitled AMINE, in which these methods are implemented.