- 26 - V. PROJECT DESCRIPTIONS CORE RESEARCH & DEVELOPMENT A. DENDRAL Project: DENDRAL Realtime Investigator: Edward Feigenbaum, Joshua Lederberg, and Carl Djerassi Dept. of Chemistry, Computer Science, and Genetics The DENDRAL project involves collaboration between the Instrumenta~ tion Research Laboratory operating under NASA grant NGR-05-020-004, investigators operating under NIH grant RROO612, and ACME. The emphasis of the DENDRAL-ACME efforts is computer science, while that of IRL-ACME endeavors is data acquisition and computer instrument control. The DENDRAL project aims at emulating in a computer program the inductive behavior of the scientist in an important but sharply Limited area of science; organic chemistry. Most of the work is addressed to the following problem; given analytic data (the mass spectrum) of an unknown compound, infer a workable number of plausible solutions, that is, a small list of candidate molecular structures. In order to complete the task, the DENDRAL program then deduces the mass spectrum predicted by the theory of mass spectrometry for each of the candidates and selects the most productive hypothesis, i.e., the structure whose predicted spec- trum must closely matches the data. The project has designed, engineered, and demonstrated a computer program that manifests many aspects of human problem solving techniques. It also works faster than human intelligence in eedving problems chosen from an appropriately limited domain of types of compounds, as illustrated in the cited publications. Some of the essential features of the DENDRAL program include: Conceptualizing organic chemistry in terms of topological graph theory, i.e., a general theory of ways of combining atoms. Embodying this approach in an exhaustive HYPOTHESIS GENERATOR. This is a program which is capable, in principle, of "imagining" every conceivable molecular structure. Organizing the GENERATOR so that it avoids duplication and irrele-~ vancy, and moves from structure to structure in an orderly and predictable way. ~ 27 - Core Research & Development (Continued) The key concept is that induction becomes a process of efficient selection from the domain of all possible structures. Heuristic search and evaluation are used to implement this "efficient selection." Most of the ingenuity in the program is devoted to heuristic modi- fications of the GENERATOR. Some of these modifications result in early pruning of unproductive or implausible branches of the search tree. Other modifications require that the program consult the data for cues (pattern analysis) that can be used by the GENERATOR as a plan for a more effective order of priorities during hypothesis generation. The program incorporates a memory of solved sub-problems that can be consulted to look up a result rather than compute it over and over again. The program is aimed at facilitating the entry of new ideas by the chemist when discrepancies are perceived between the actual functioning of the program and his expecta- tion of it. The DENDRAL research effort has continued to develop along several dimensions during Fiscal 1973. The mass spectra of some previously un- investigated compounds were recorded. The computer program has been extended to analyze the mass spectra of a more complex class of compounds, using new kinds of data. The artificial intelligence work on theory formation and program generality has also progressed. The techniques of artificial intelligence have been applied success- fully for the first time to a problem of direct biological relevance, namely the analysis of the high resolution mass spectra of estrogenic steriods. The performance of this program has been shown to compare favorably with the performance of trained mass spectroscopists. (see Smith, et al. (1972) Of particular significance in this effort were, in addition to exceptional performance, the potential for analysis of estrogens with- out prior separation, and for generalization of the programming approach to other classes of molecules. Because of the structure of the Heuristic DENDRAL program for estrogens, it is immaterial whether the spectrum to be analyzed is derived from a single compound or a mixture of compounds. Each com- ponent is analyzed, in terms of molecular structure, in turn, indepen- dently of the other components. This facility, if successful in practice, would represent a significant advance of the technique of mass spectrometry. Many problem areas, because of physical character- istics of samples or limited sample quantities, could be successfully approached utilizing the spectra of the unseparated mixtures. Even in combined gas chromatography/mass spectrometry (GC/MS), many mixture components will be unresolved and an analysis program must be capable of dealing with these mixtures. -~ 28 - Core Research & Development (Continued) We have, in collaboration with Prof. H. Adlercreutz of the University of Helsinki, recently completed a series of analyses of various fractions of estrogens extracted from body fluids and supplied to us by Prof. Adlercreutz. These fractions (analyzed by us as unknowns) were found to contain between one and four major components, and structural analysis of each major component was carried out successfully by the above program. These mixtures were analyzed as unseparated, underivatized compounds. The implications of this success are considerable. Many compounds isolated from body fluids are present in very small amounts and complete separation of the compounds of interest from the many hundreds of other compounds is difficult, time-consuming and prone to result in sample loss and contamina- tion. We have found in this study that mixtures of limited complexity, which are difficult to analyze by conventional GC/MS techniques without derivatization (which frequently makes structural analysis more difficult), can be rationalized even in the presence of significant amounts of im- purities. A manuscript on this study has been submitted to the Journal of the American Chemical Society. In the past year we have extended our library of high resolution mass spectra of estrogens to include 67 compounds. These data represent an important resource and have been included (as low resolution spectra for the moment) in a collection of mass spectra of biologically important molecules being organized by Prof. S. Markey at the University of Colorado. The Heuristic DENDRAL program for complex molecules has received con- siderable attention during the last year in order to remove compound class specific information or program strategies. By removing information which is specific to estrogens, the program has become much more general. This effort has resulted in a production version of the program which is designed to allow the chemist to apply the program to the analysis of the high resolution mass spectrum of any molecule with a minimum of effort. Given the spectrum of a known or unknown compound, the chemist can supply the following kinds of information to guide analysis of the mass spectrum: a) Specifications of basic structure (superatom) common to the class of molecules. b) Specification of the Fragmentation rules to be applied to the superatom, in the form of bond cleavages, hydrogen transfers and charge placement. c) Special rules on the relative importance of the various fragments resulting from the above fragmentations. 4) Threshold settings to prevent consideration of low intensity ions. e) Available metastable ion data and the way these data are subsequently used -- to establish definitive relationships between fragment ions and their respective molecular ions. f) Available low ionizing voltage data -- to aid the search for molecular ions. g) Results of deuterium exchange of labile hydrogens -- to specify the number of, e.g., -OH groups. - 29 - Core Research & Development (Continued) We have been very successful in testing the generality of the program, with particular emphasis on other classes of biologically important molecules. We have used the program in analysis of high resolution mass spectra of progesterone and some methylated analogs, a small number of androstane/ testosterone related compounds, steroidal sapogenins and n-butyl-triflu- oroacetyl derivatives of amino acids. The Heuristic DENDRAL performance program described above is an automated hypothesis formation program which models "routine", day-to-day work in science. In particular, it models the inferential procedures of scientists identifying components, such as those found in human body fluids. The power of this program clearly lies in its knowledge about various classes of compounds normally found in body fluids, which knowledge allows identification of the compounds. The Meta-DENDRAL program described in this part is a critical adjunct to the performance program because it is designed to supply the knowledge which the performance program uses. Theory formation is essential in order to carry out the routine analyses - either by hand or by computer. However, the staggering amount of effort required to build a working theory (even for a single class of compounds) holds back the routine analyses. The goal of the Meta-DENDRAL program is to form working theories automatically (from collections of experimental data) and thus reduce the human effort required at this stage. By speeding up the time between collecting data for a class of compounds and understanding the rules underlying the data, the Meta~DENDRAL program will thus provide an improvement in the develop- ment of diagnostic procedures. Detailed accounts of this research are available in the DENDRAL Pro- ject annual report to the National Institutes of Health, in several papers already published, and in manuscripts submitted for publication. 1. For pertinent reviews see: C. G. Hammar, B. Holmstedt, J. E. Lindgren and R. Tham, Advan. Pharma.Col. Chemother., 7, 53, (1969); J. A. Vollmin and M. Muller, Enzymol. Biol. Clin., 10, 458 (1969). 2. J. R. Althans, K. Biemann, J. Biller, P. F. Donaghue, D. A. Evans, H. J. Forster, H. S&S. Hertz, C. E. Hignite, R. C. Murphy, G. Petrie and V.Reinhold, Experientia, 26, 714 (1970). 3. H. Fales, G. Milne and N. Law, reported in Medical World News, February 19, 1971. 4, E. Jellum, 0. Stokke and L. Eldjarn, The Scandinavian Journal of Clinical and Laboratory Investigation, et, 273 (1971). 5. A. L. Burlingame and G. A. Johanson, Anal. Chem., 44, 337R (1972). - 30 - Core Research & Development (Continued). 6. 10. ll. 12. 13. 14, 15. H. S. Hertz, R. A. Hites and K. Blemann, Analytical Chemistry, 43, 681 (1971), S. L. Grotch, ibid., 43, 1362 (1971). E. A. Feigenbaum, B. G. Buchanan, and J. Lederberg, "On Generality and Problem Solving: A Case Study Using the DENDRAL Program", In Machine Intelligence 6 (B. Meltzer and D. Michie, eds.) Edinburgh University Press (1971). (Also Stanford Artificial Intelligence Project Memo No. 131.) A. Buchs, A. B. Delfino, C. Djerassi, A. M. Duffield, B. G. Buchanan, EK. A. Feigenbaum, J. Lederberg, G. Schroll, and G. L. Sutherland, "The Application of Artificial Intelligence in the Interpretation of Low~Resolution Mass Spectra", Advances in Mass Spectrometry, 5, 314. B. G. Buchanan and J. Lederberg, "The Heuristic DENDRAL Program for Explaining Empirical Data".. In proceedings of the IFIP Congress 71, Ljubljana, Yugoslavia (1971). (Also Stanford Artificial Intelligence Project Memo No. 141.) B. G. Buchanan, E. A. Feigenbaum, and J. Lederberg, "A Heuristic Pro- gramming Study of Theory Formation in Science." In proceedings of the Second International Joint Conference on Artificial Intelligence, Imperial College, London (September, 1971). (Also Stanford Artificial Intelligence Project Memo No. 145.) Buchanan, B. G.,Duffield, A. M.,Robertson, A. V., "An Application of Artificial Intelligence to the Interpretation of Mass Spectra", Mass Spectrometry Techniques and A liances, Edited by George W. A. Milne, John Wiley & Sons, Inc., 1971, p. 121-77. D. H. Smith, B. G. Buchanan, R. S. Engelmore, A. M. Duffield, A. Yeo, E. A. Feigenbaum, J. Lederberg, and C. Djerassi, “Applications of Artificial Intelligence for Chemical Inference VIII. An approach to the Computer Interpretation of the High Resolution Mass Spectra of Complex Molecules. Structure Elucidation of Estrogenic Steroids", Journal of the American Chemical Society, 94, 5962-5971 (1972). B. G. Buchanan, E. A. Feigenbaum, and N. S. Sridharan, "Heuristic Theory Formation: Data Interpretation and Rule Formation". In Machine Intelligence 7, Edinburgh University Press (1972). Brown, H., Masinter L., Hjelmeland, L., "Constructive Graph Labeling Using Double Cosets". Discrete Mathematics (in press), (Also Com- puter Science Memo 318, 1972. B. G. Buchanan, Review of Hubert Dreyfus' "What Computers Can't Do: A Critique of Artificial Reason", Computing Reviews (January, 1973). (Also Stanford Artificial Intelligence Project Memo No. 181) - 31 - Core Research & Development (Continued) 16. D. 4H. Smith, B. G. Buchanan, R. S. Engelmore, H. Aldercreutz and C. Djerassi, "Applications of Artificial Intelligence for Chemical Inference IX. Analysis of Mixtures Without Prior Separation as Illustrated for Estrogens". Submitted to the Journal of the American Chemical society. 17. D. H. Smith, B. G. Buchanan, W. C. White, E. A. Feigenbaum, C. Djerassi and J. Lederberg, “Applications of Artificial Intelligence for Chemical Inference X. Intsum. A Data Interpretation Program as Applied to the Collected Mass Spectra of Estrogenic Steroids". To be submitted. The preceding comments on DENDRAL involve Parts A and C as described in the table below. The balance of this section deals with Part B, instru- mentation aspects. Part A: Applications of Artificial Intelligence to Mass Spectrometry. Part B(i): Mass Spectrometer Data System Development. Part B(ii): Analysis of the Chemical Constituents of Body Fluids. Part C: Extending the Theory of Mass Spectrometry by Computer. ACME computer support for DENDRAL Part B has been treated as ACME core research activity during FY73. Excerpts from DENDRAL's annual report follow, detailing recent accomplishments. The large volume of data which must be reduced and interpreted from each GC/MS analysis of a body fluid sample together with the increasing number of samples which must be processed to be responsive to clinical needs, point to more and more highly automated and reliable GC/MS systems. This portion of the proposal addresses the problems of developing and applying such automated systems from several points of view. First, we propose to investigate the integration of sophisticated computer analysis programs into data reduction, data interpretation, and instrument manage- ment functions in order to progressively relieve the chemist from manually performing these tasks. Second, we will maintain the daily operation of our GC/MS systems for the on-going investigation of clinical applications and the acquisition of data necessary for the development of automated interpretation programs. Our overall objectives for automating GC/MS systems comprise a number of specific subgoals including a) implementing highly automated and reliable systems for the acquisition and reduction of low resolution, high resolution, and metastable mass spectral data; b) implementing a data system to support combined gas chromatography/high resolution mass spectrometry; c) automating the location and identification of constituents of body fluid extracts from gas chromatogram and mass spectrum information for the routine application of these techniques to clinical problems; and d) investigating the intelligent closed loop control of mass spectrometer systems in order to optimize the data acquired relative to the task of data interpretation. - 32..- Core Research & Development (Continued) A. Mass Spectrometer Data System Automation Concentrating initially on the MAT~711 spectrometer, we have made Significant progress toward a reliable, automated data acquisition and reduction system for scanned low and high resolution spectra. This system is largely failsafe and requires no operator support or inter- vention in the calculation procedures. Output and warnings to the operator are provided on a CRT adjacent to the maas spectrometer. The system contains many interactive features which permit the operator to examine selected features of the data at his leisure. The feed- back currently provided to the operator to assist in instrument set-up and operation can just as well be routed to hardware control elements for these functions thereby allowing computer maintenance of optimum instrument performance. Progress in this area is an integration of our efforts in hardware and software improvements: HARDWARE ~ The basic system consists of the mass spectrometer inter- faced to a PDP=11/20 computer for data acquisition, pre-filtering, and time buffering into the ACME time-shared 360/50. The more complex aspects of data reduction are done in the 360/50 since the PDP-1l has limited memory and arithmetic capabilities. New interfaces for mass spectrometer operation and control have been developed. The interfaces can handle (through an analog multiplexer) several analog inputs and outputs which require that the PDP-1l1 computer be relatively near the mass spectrometer. We now have the capability for the following kinds of operation through the new interfaces. i) Computer selection of digitization rate. ii) Computer selection of data path (interrupt mode or direct memory access (DMA). iii) Direct memory access for faster operation in the data acquisition mode. iv) Computer selection of analog input and output channels. v) Sensing of several analog channels through a multiplexer (e.g., ion signal, total ion current). vi) Magnet scan control. This control can be exercised manually or set by the computer. It controls both time of scan and flyback time. Coupled with selection of scan rate, any desired mass range can be scanned at any desired scan rate. vii) The computer can monitor the mass spectrometer's mass marker output as additional information which will be used to effect calibration. Core Research & Development (Continued) SOFTWARE ~ Automatic instrument calibration and data reduction pro~ grams have been developed to a high degree of sophistication. We can now accurately model the behavior of the MAT-711 mass spectrometer over a variety of scan rates and resolving powers. Our instrument diagnostic routines are depended upon by the spectrometer operator to indicate successful operation or to help point to instrument malfunctions or set- up errors. i) ii) iii) Some features of these programs are described below. Data Acquisition. Programs have been written which permit acquisition of peak profile data at high data rates using the PDP-11 as an intermediate data filter and buffer store between the mass spectrometer and ACME. This allows data acquisition to proceed even under the time constraints of the time-sharing system. Storage of peak profiles rather than all data collected has greatly reduced the storage requirements of the program and saves time as the background data (below threshold) are removed in realtime. An automatic thresholding program is in operation which statistically evaluates background noise and thresholds subsequent data accordingly. Amplifier drift can thus be compensated. We have developed some theoretical models of the data acquisi- tion process which suggest that high data acquisition rates are not necessary to maintain the integrity of the data. Demonstration of this fact with actual data has helped relieve the burden of high data rates on the computer system, particularly as imposed by GC/MS operation, and permits more data reduction to be accomplished in realtime or alternatively reduces the required data acquisition computer capacity. Instrument Evaluation. A high resolution mass spectrometer operating in a dynamic scanning mode is a complex instrument and many things can go wrong which are difficult for the operator to detect in realtime. In order for the computer to assist in maintaining data quality, it must have a model of spectrometer operation on the basis of which data quality can be assessed and processing suitably adapted as well as instrument performance optimized. We have developed a program which monitors the state of the mass spectrometer. Data Reduction. 1uaWals sopeey 44 Pessap yx smug weg - 5s oneA—t (9) (q} isadAj e1eg (e) "3LON i . | | { ~—+- j —t | . — | 4 ~ ! i _t —t | t 7 Fe | — + : } 1 —- 4 4 — be 4 4 I 4 4 —+- 1 4 —}- { — , —- | 4 —- | ' —- I ph = | —- I 4 Gin ONG Ful GF Oe be te dt ay TM T oT 7 =r pisfo L en)! i ’ 2, OL $ » yg & £ ~ f 0 = 2 > oL mPAZIIILING fi SuiWN wiva Wf, SLINN aWVN LEOHS 3WYN ONOT gs H x 2 ‘ ' 4 fe 7 SUNINI 13 ; ONIMD3HOD 23” veg (WLVG G3LN31HYO aWiL) NOILINIS3G LN3W3173 ANVEVLVd GOL GYOANVIS SAMPLE ELEMENT DEFINITON r NS TOD DATABANK FORM