Nanes CoA. Stearns G. Eigendorf T.F. Thomas D. L. Winter D. L. Fishel G.W. Wilcox S. Hammerunm AeM. Hogg S.A. Wikstrom D. J. Harvey D.E. Games M.A. Grayson C.H. Williams Jr. vw J. Lehman TT. SaKat Address NASA-106-1 21000 Brookpark Rd. D. of Chen U of British Columbia D. of Chem. U. of Mo, KC Rm. 228 R + D Continental Oil Co. D. of Chem. | Kent State Univ. 1600 W. Smile Rd. Ferndale, MI “DP. of General and Org. U. of Copenhagen DK-2100 D. of Chen. University of Alberta Med. Univ. of S.C. Pharmacology Pharmacology Dept. South Parks Road Dept. of Chem. Univ. Dept 221 Bldg. 33 McDonnell Douglas Research Labs UNICAMP/QUIMICA 13.100 Campinas Sp. 208 Progress Ave. Basic Research LAB- Tomay [pousTeies Inc. (ll Tesrreo College P.O. CF1 1XL United Kingdon Cleveland, Ohio Vancouver, 8 B.C., Canada Kansas City, Mo Ponco City, OK Kent, Ohio Copenhagen, Denmark Edmonton Alberta, Canada Charleston, S.C. Oxford, OX 13dT Box 78 Cardiff St. Louis, Mo. Brazil Hamilton, Ohio KAMA KURA 248 JAPAN 44135 64110 74601 44242 48220 T6G2ZG2 29401 63166 45013 te ad Appendix 4 Letter Sent to Mass Spectroscopists Responding to Questionnaire DENDRAL. Program Availability on SUMEX The Stanford University Medical Experimenta! computer facility (SUMEX) has been established at Stanford with the support of the Biotechnology Resources Branch, National Institutes of Health. Its primary mission is resource sharing, where the resources in this case are complex computer programs applied to health- research problems and the computing facility on which these programs can be used via a nationwide computer network. SUMEX is actively encouraging the development of a collaborative community of users of this facility. The DENDRAL Project at Stanford is one of the initial collaborative projects on SUMEX. This project is making its programs available to the outside community within the limits of available resources. Because resources are limited, the future may bring a more restrictive policy of access than that presently in force. Until that time, however, the SUMEX policy is to encourage as many qualified and interested persons as possible to access the program on a trial basis, to determine the potential applicability of such programs to ongoing re- search. The major programs available now are outlined below: 1!) PLANNER--Infers possible structures of unknown compounds (singly or as mixtures) given a mass spectrum and fragmentation rules of the class of compounds to which the unknown(s) presumably belongs. (See Smith et al., J. Amer. Chem. Soc., 94, 5962 (1972); ibid., 95, 6078 (1973)). 2) INTSUM--Given a set of known, related structures and the mass spectrum corresponding to each structure, INTSUM suggests possible fragmentation processes which resulted in the observed ions, and then summarizes the results in terms of processes which are general to the class of structures, and those which are specific to certain members of the class. (See Smith et al., tetrahedron, 29, 3117 ($973)). 3) CONGEN--CONGEN (constrained structure generation) accepts as input known structural features of an unknown molecule (whose elemental composition is known) and produces all structural isomers consistent with these data. The features and constraints are entered in an interactive session with the program and results can be drawn at a terminal or further constraints added based on examination of new data. CONGEN represents our initfal version of a program for computer-assisted structure elucidation. The structure generator which underlies CONGEN has been described (See Masinter et al., J. Amer. Chem. Soc., 96, 7/02 (1973) and ibid., 7714 (1974)). 4) MOLECULAR 10N DETERMINATION--Given a (low or high resolution) mass spectrum in which the molecular ion may or may not be present, this program suggests a ranked list of candidate molecular ions. (See G. Dromey, B. G. Buchanan, D. H. Smith, J. Lederberg and C. Djerassi, J. Org. Chem., in press (March 1975)). For additional information on these programs and access to SUMEX, write to Professor Joshua Lederberg, SUMEX Project, Department of Genetics, Stanford University, Stanford, California 94305, or Professor Carl Djerassi or Dr. Dennis H. Smith, Department of Chemistry, Stanford University, Stanford, California 94305. It will be helpful to indicate the basis of your interest and intended applications although it is understood that trial use is a prerequisite to a considered answer to such a question. Appendix 95> Draft of Manuscript for American Chemical Society Symposium on Computer Networking in Chemistry NETWORKING AND A COLLABORATIVE RESEARCH COMMUNITY: A CASE STUDY USING THE DENDRAL PROGRAMS. Raymond &. Carhart*, Suzanne M. Johnson, Dennis H. Smith, Bruce G. Buchanan, R. Geoffrey Dromey, and Joshua Lederberg. pepartments of *Computer Science, Genetics, and Chemistry, Stantord University, Stanford, California, 94305. Computer Science is one of the newest, but also one of the least "cumulative" of the sciences. Gordon (1) has recently pointed out the uosetting disparity between the number of potentially sharable programs in existence and the number which are easily accessible to a given researcher. Although some mechanisms exist for the systematic exchange of program resources, for example the World List of Crystallographic Computer Programs (2), a great deal of programming effort is duplicated among different research groups with common interests. The reasons for this are understandable: these groups are separated by geography, by incompatibilities in computer facilitie and by a lack of a means to keep abreast of a a rapidly chanaing field . ‘he emergence of more economical technoloagies for data cowrtunications vrovices, in crincinle, a method for lowering toes seodrentical and operational barriers; for creating, through computer networking, remote sites at which functionally specialized cavabilities are concentrated. The SUMEX-AIM (Stanford University Medical EXperimental computer - Artificial Intelligence in Medicine) project is an experiment in reducing this principle to practice, in the soecific area of artificial intelligence research applied to nealth sciences. The SUMEX-AIN computer facility (3) is a National Shared Computing kesource being developed and operated by Stanford University, in partnership with and with financial support from the niotechnology Resources Branch of the the Division of kesearch Resources, National Institutes of Health. It is national in scope in that a major portion of its computing capacity is being made available to authorized research groups throughout the country by means of communications networks. Aside from demonstrating, on managerial, administrative and technical levels, that such a national computing resource is a viable concept, the primary objective of SUMEX-AIM is the building of a collaborative research community. The aim is to encourage individual participants not only to investigate applications of artificial intelligence in health science, but also to share their programs and discuss their ideas with other researchers. This dlaces a responsibility upon SUMEX-AIM to develop effective means of communication among community members and among the programs they write. It also places resnonsibility upon those members to design and document programs that readily can be used and understood by others. Another asnect of the SUMEX facility is providing service to individuals whose interest is in using, rather than developing, the available computer programs. Although this is not a primary consideration, it is an essential part of the growth of these programs. Most of the SUMEX-AIM projects have formed, or are forming, their own user communities which provide valuable “real world" experience. Figure 1 depicts the typical interaction of such a project with its user community, and with other projects, ‘he participation by users in program development is not just restricted to suggestions, but can also include software created by comnuter-oriented users to satisfy special needs. In some projects, methods are being considered to further promote this kind of participation. The purposes of this paper are threefold: first, to indicate the range of research projects currently active at SUMEX; second, to describe in detail one of these projects, DENDRAL, which is of particular interest to chemists; and third, to discuss some problems and possible solutions related to networking and community-building. I. Research Activities at SUMEX-AIM The community of participants in SUMEX-AIM can be divided geographically into local (i. e., Stanford-based) projects and renote projects, and helow is given a brief description of the major representatives of each. Communication with the remote projects is accomplished through one or both of the communications networks shown in Figure 2. In most cases, connection with SUMEX-AIM from these remote sites involves only a local telephone call to the nearest network “node”. The SUMEX-AIM system is itself undergoing constant improvement which deserves to be called research, and thus a third section is included nere to represent system developments. kemote projects the kutaers project. Originating from Rutgers University are several research efforts designed to introduce advanced methods in computer science - particularly in artificial intelligence and interactive data base systems - into specific areas of biomedical research. One such effort involves the development of comouter-based consultation systems for diseases of the eye, specifically the establishment of a national network of collaborators for diagnosis and recommendations for treatment of glaucoma by computer. Another project concerns the BELIEVER program, which represents a theory of now people arrive at an interpretation of the social actions of others. SUMEX-AIM provides an excellent medium for collaboration in the development and testing of this theory. The Rutgers project includes, in addition, several fundamental studies in artificial intelligence and system design, which provide much of the support needed tor the development of such complex systems. The DIALOG project. The DIAgnostic LOGic project, based at the University of Pittsburgh, is a large scale, computerized medical diagnostic system that makes use of the methods and structures of artificial intelligence. Unlike most other computer diagnostic orograms, which are oriented to differential diagnosis in a rather limited area, the NIALOG system has been designed to deal with the qeneral problem of diagnosis in internal medicine and currently accesses a medical data base which encompasses approximately fifty erent of the major diseases in internal medicine. the MISL Project. The Medical Information Systems Laboratory at the University of Illinois (Chicago Circle campus) has been established to explore inferential relationships between analytic data and the natural history of selected eye diseases, both in treated and untreated forms. This project will utilize the SUMEX-AIM™ resource to build a data base which could then be used as a test bed for the development of clinical decision support algorithms. Distributed Data-Base System for Chronic Diseases. this project, based at the University of Hawaii, seeks to use the SUMEX~AIM facility to establish a resource sharing project for the development of computer svstems for consultation and research, and to make these systems available to clinical facilities from a set of distributed data bases. The radio and satellite links which comrose the communication network known as the ALOHANET, in conjunction with the APPANE'T, will make these programs available to other Hawaiian islands and to remote areas of the Pacific basin. This project coulc well have a significantly beneficial effect on the quality of health care delivery in these locations. Modelling of Higher Mental Functions. A project at tne University of California at Los angeles is using the SUMEX-AI# facility to construct, test, and validate an improved version of the computer simulation of paranoid processes which has heen developed. ‘hese simulations nave clinical implications for the understanding, treatment, and orevention of paranoid disorders. the current interactive version (PAFPY) has been running on SUMEX-AIM and nas provided a basis for improvement of the future version’s language recognition capability. Local projects The Protein Crystallography Project. fhe Protein Crystallography project involves scientists at two different universities (Stanford and the University of California at San Diego), pooling their respective talents in protein crystallogranrnyv and computer science, and using the SUMEX-AIM facility as the central repository for vorograms, data and other information of common interest. The general objective of the project is to aroly problem solving techniques, which have emerged from artificial intelligence research, to the well known "phase problem" of x-ray crystalloqrachy, in order to determine the three-dimensional structures of proteins. phe work is intended to be of practical as well as theoretical value to both computer science (particularly artificial intelligence research) and protein crystallography. The MYCIN project. MYCIN is an evolving computer rrodqram that has been developed to assist physician nonspecialists with the selection of therapy for patients with bacterial infections. fhe project has involved both physicians, with exoertise in the clinical pharmacology of bacterial infections, and computer scientists, with interests in artificial intelligence and medical computin:. the “YCIn orogram attempts to model the decision processes of the medical exverts. It consists of three closely integrated components: the consultation System asks auestions, makes conclusions, and gives advice; the Explanation System answers questions from the user to justity the program’s advice and explain its methods; and the Rule-Acauisition System permits the user to teach the system new decision rules, or to alter pre-existing rules that are judyed to be inadeavate or incorrect. The DENDRAL project. This project, being of narticular chemical interest, is described in detail in Section II. Through the SUNEX-AIM facility DENDRAL has gained a growing community ot production-level users whose experience with the programs is a valuable quide to further development. Although technically users, some members of this community might better be described as collaborators because they have provided SUMEX-AIM with various special-purpose programs which are of interest to other chemicts and which extend the usefulness of the DENDRAL programs. SUMEX-AIM System Development Current research activities at SUMEX-AIM are develonringa along several lines. On a system development level there are ongoing projects designed to make the system more user oriented. Currently, the system can be expected to provide help to the user who is confused about what is expected in response to a certain prompt. A "2" typed by the user, will, in most cases, provide a list of possible responses from which to choose. Also available in response to typing "HELP" to the monitor is a general help description containing voointers to files likely to be of interest to a new user. In an effort to facilitate communication between collaborators, a program called CONFER has been developed to provide an orderly method for multiple participant teletype "conference calls". gasically, the orogram acts as a character processor for ali tne terminals linked in the conference, accepting input from only one at a time, and passing it out to the remaining terminals. In this wav, the conference, in effect, has a "moderator" terminal, thus allowing for a more orderly transfer of ideas and information. SUMEX-AIM is also aware of the necessity of making its facilities available for trial use by potential users and collaborators. 10 this end, a GUEST mechanism has been established for persons who wish to have brief, trial access to certain programs they feel may be ot value to them, and about which they would like to obtain more knowledge. ‘this provides a convenient mechanism whereby persons, wlio have been qiven an appropriate phone number and LOGIN procedure, can dial up SUMEX-AIM and receive actual experience usin: a program they may only have heard about. Another area of system development currently being explore? at SUMEX-AIM is that of creating a comprehensive "bulletin board" facility where users can file "bulletins", that is, messages of interest to the SUNEX-AI® community. The facility will also alert wovrs to new bulletins wnich are likely to be of interest to re, ceterzined by individual user-interest profile. Il. UEit:nAL - Chemical Applications of Interactive Comeutin’ in a Network Environment ihe major research interest of the DEND&AL Project at stanford University is application of artificial intelligence technicucs for chemical inference, focusing in particular on molecular structure elucidation. vortions of our research are in the area of combined qas chromatography/high resolution mass spectrometry and include instrumentation and data acauisition hardware and software development. This area is beyond the scope of this report; we focus instead on the concurrent development of programs to assist chemists in various phases of structure elucidation beyond the point of initial data collection. SUMEX-AIM provides the comruter supnort for development and application of these programs. Another aspect of our researcn is our commitment to snare developments among a wider community. We feel that several of our programs are advanced enough to be useful to chemists engaged in related work in mass spectrometry and structure elucidation in general. These programs are written primarily in the programming jangquage INTERLISP, and thus are not easily exportable (excertions are indicated subsequently). SUMEX-AIM provides a mechanism for allowing others access to the programs without the requirement for any special vrogramming or computer system expertise. The availability of the SUMEX facility over nationwide networks allows remote users to access the programs, in many instances via a local telephone call. much of the following discussion is preliminary Decause our projrams have only recently been released for outside use. Some announcement of their availability has been made, and other announcements will occur in the near future, through talks, publications in press, demonstrations and informal discussions. Althougn most of our experience has been with local users, they nave been good models of remote users in that their wcrevious exposure to the actual programs and computer systems is minimal. Their experience has been extremely useful in helping us to smooth out clumsy interactions with programs and to locate and fix program buaqs. such polishing is important for programs which may be utilized hy users from widely differing backgrounds with respect to computers, networks and time sharing systems. We are in the processes of building a community of remote users. We actively encourage such use for two reasons: 1) we feel the programs are capable of assisting others in solving certain molecular structure problems, and 2) such experience with outside users will be a tremendous assistance in increasing the power of our programs as the programs are forced to confront new real-world problems. he remainder of this section outlines the programs which are available via SUMEX, the utilization of these programs in helping to solve structure elucidation problems and the limitations we see to their use. we discuss current applications of the programs to our research and the research of other users to illustrate better the variety of potential applications and to stimulate an interchanye of ideas. where appronriate, we point out current difficulties witn tne use both of our programs and of SUMEX. New applications and wider use will certainly change the nature of these problems; we strive to solve current oroblems, but new ones will always arise to take their place. DENDRAL Programs we have several programs which we employ in dealing with various aspects of problems involving unknown structures. Some of these programs are exportable, while the remainder are available at SUMEX. The availability of each program is discussed below. Our initial emphasis in studying applications of artificiel intelligence for chemical inference was in the area of mass spectrometry(4-6). This emphasis remains because many of our problems require mass spectrometry as the analytical tool of choice in providing structural information on small quantities of sample. sore recently, we have been developing a program (CONGEN, below) directed at more general aspects of structure elucidation. ‘This has extended the scope of problems for which we can provide computer assistence. we will begin, however, with discussion of the mass spectrometry orograms. The examples used in the discussion are characteristic of our current research problems, although we have focused on relatively simple problems to keep the presentation brief. we trace, in what might be chronological terms, the application of the programs to various phases of a structure problem. In this way we hove to illustrate the place of eacn program in the analysis. te begin bv discussing preprocessing of mass spectral data (CLEANUP and MOLION). Subsequent analysis of such data in terms of structure is then covered (PLANNER). ‘The use of CONGEN is discussed for problems which cannot be handled py the previous programs. Finally, we discuss efforts to discover, with the use of the computer, systematics in tne behavior of known substances in the mass spectrometer as a means of extending the knowledge of the system for applications in new areas (INTSUM and RULEGEN). Applications to Molecular Structure Problems ihe first three programs, CLEANUP, the library-search program and MOLION are in a sense utility programs, but all three play a critical role in processing mass spectral data. Subsequent applications of programs (e.g., PLANNER) for more detailed spectral analysis in terms of structure depend on the successful treatment of the data by CLEANUP and MOLION, while the library search program filters out common spectra which need not undergo a full analysis. the examples used are drawn from our collaboration with persons in the Genetics Research Center at Stanford Hospital. The exoerimental data which are collected are the results of combined gas chromatographic/low resolution mass spectral (GC/LRMS) analysis of various fractions (chemically fractionated and derivatized where necessary) of body fluids, e.g. blood, urine. A typical exneriment consists of 58-68 individual mass spectra for each fraction, taken sequentially over time as the various components, largely separated from one another, elute from the gas chromatograph and pass into the maSs spectrometer. Each mass spectrum consists of the mass analyzed fragment ions of the component(s) in the mass spectrometer at the ‘time the spectrum was taken. Such spectra are related, indirectly, to the molecular Structure of the component(s). CLEANUP(7). The individual mass spectra obtained from fractionated GC/LRMS analysis are auite often poor representations of corresvconding ssectra taken from pure compounds. They can be contaminated by the presence of additional peaks and/or distortions to the intensities of existing peaks in the spectrum. Fragment ions from either the lioguid phase of the GC column or from components incompletely separated by the gas chromatograph are are responsible for the contamination. We have developed a program, referred to here as CLEANUP, which examines all mass spectra in a GC/LRMS run, selects those spectra which contain ions other than background impurities, and remove contributions from background and overlapping components. A spectrum results which compares favorably with the spectrum of a pure component. Biller and Biemann (8) have developed a similar but less powerful program. For example, the CLEANUP program detected components at points marked with a vertical bar in the plot of total ion current vs. scan number (time), Figure 3. Note that overlapping components were detected under the envelopes of the GC peaks in the region of scans 485-488, 525-529 and 539-552. We focus our attention on the spectrum recorded at scan 492. The raw data, prior to cleanup, are vresented in Figure 4 (top). The spectrum resulting from CLEANUP is presented in Figure 4 (bottom). Note that the large ions (e.g., m/e 207, 221 and 315) from background impurities are removed, and that the intensity ratios of peaks at lower masses (e.g., 51 and 77) have been adjusted to reflect their true intensities in the spectrum. The CLEANUP program is capable of detection of quite low-level components in complex mixtures as indicated by some of the areas of the total ion current plot (Figure 3) where components were detected. It is completely general because nothing in the program code is sensitive to the types of compounds analyzed or the characteristics of possible impurities associated with the compounds or from the GC column. Its major limitation is that mass spectra must be taken repetitively during the course of a GC/MS run. Its performance is enhanced when such spectra are measured closely in time. The program is offered via SUMEX as an adjunct to use of our other programs; it is not offered as a routine service. Because the program is written in FORTRAN, we routinely use it on our data acquisition computer system so as not to burden SUMEX with tasks better done elsewhere. Similarly, we would assist other frecuent users to mount the program on their own systems, Library Search. With a set of "clean" mass spectra available, the next problem is identification of the various components. Over the course of several years, libraries of mass spectral data have been assembled(9). These libraries can be very useful in weeding out from a group of spectra those which represent known compounds(lt)). Clearly, one should spend time on solving the structures of unknown compounds, not on rediscovering old ones. The CLEANUP program provides mass spectra which are of sufficient quality to expect that known compounds would be identified easily from such libraries. Insert A - Dennis” sep. page. This brief example illustrates the obvious value and limitations of library searching. The most interesting compounds for subsequent analysis are those which are unknown. The fractions of urine extracts are replete with unidentified compounds because of the inadequacy of current library compilations. As new compounds are identified they are, of course, added to the library so that future analyses need not reinvestigate the same material. we currently perform library searching on our data acauisition and reduction computer systems. we can, if necessary, offer limited library search facilities via SUMEX. However, because commercial facilities are available (e.g., over the GE network), routine library search service is not available on SUMEX. MOLION(11). At this stage we are left with a collection of mass spectra of unknown compounds. The library search results may have provided some clues as to the type of compound present, e.g., compound class. Structure elucidation now begins in earnest. The key elements in voroblems of structure elucidation are the molecular weight and empirical formula of a compound. Without these essential data, the structural possibilities are usually too immense to proceed further. Mass spectrometry is frequently used to determine molecular weights and formulae, but there is no guarantee that the mass spectrum of a compound displays an ion corresponding to the intact molecule. For example, many of the derivatives of the amino acid fractions of urine display no molecular ions. When we are given only the mass spectrum (and for GC/MS analysis a mass spectrum may be all that is available) we must somehow predict likely molecular ion candidates. The program MOLION performs this task. Given a mass spectrum, it predicts and ranks likely molecular ion candidates independent of the presence or absence of an ion in the spectrum corresponding to the intact molecule. The published manuscript(1ll) provides many examples of the performance of the program. The mass spectrum of an example, unknown X, (which we will pursue in more detail below) is given in Figure 5. The results obtained from MOLION are summarized in Table I. The observed ion at m/e 263 is ranked as the most likely candidate. Table I. Results of Molecular Ion Determination for the Unknown Compound, X, whose Mass Spectrum is Presented in Figure 5. CANDIDATE RANKING INDEX 263.8 196 397.0 41 299.0 38 295.8 34 281.8 25 The MOLION program is written to operate on either low or high resolution mass spectra. The program has certain limitations which have been summarized in detail previously(1l). MOLION is available on SUMEX. A FORTRAN version, initially for low resolution mass spectra, is being written so that the program can be run on smaller computers and exported to others. However, it will continue to be available via SUMEX so that others can access it easily. MOLION is contained within PLANNER as one of the available methods for detecting candidate molecular ions. PLANNER(12). The PLANNER program is designed to analyze the mass spectrum Of a compound or of a mixture of related compounds, “Because there is no ab initio way of relating a mass spectrum of a complex organic molecule to the structure of that molecule, PLANNER requires fragmentation rules for the class of compounds to which the unknown belongs. This is its major limitation. For our example the class was unknown, forcing us to resort to other means of assistance. Applications and limitations of PLANNER have been discussed extensively(12,13). The program is very powerful in instances where mass spectrometry rules are strong (i.e., general, with few exceptions). In instances where rules are weak or nonexistent, additional work on known structures and spectra may yield useful rules to make PLANNER applicable (see INTSUM and RULEGEN, below). One useful feature of PLANNER is its ability to analyze the svectra of mixtures in a systematic and thorough way. Thus, it can be applied to spectra obtained as mixtures when GC/MS data are unavailable or impossible to obtain. PLANNER is available in an interactive version over SUMEX, requiring three kinds of information as input: the high or low resolution mass spectrum, the characteristic skeletal structure for molecules in the specific compound class, and the fragmentation rules for the class. Additional knowledge about the unknown can be used by the program to constrain the structural possibilites. PREDICTOR. The purpose of the BDENDRAL predictor is to make a testable prediction for each candidate structure and suggest crucial data points that would allow a chemist to distinguish among the candidate structures. The program is a simulation of the mass spectrometer that predicts both mass spectral peaks and metastable peaks for each candidate. A rudimentary additional program looks for mass spectral ions and metastable ions that are unique for each candidate, thereby providing means of disconfirming or confirming individual candidates, CONGEN(14,15). Structure problems are usually not solved with mass spectrometry alone. Even when sample size is too limited for obtaining other spectroscopic data, knowledge of chemical isolation and results of derivatization procedures freauently acts as powerful constraints on structural possibilities. Larger amounts of sample permit determination of other spectroscopic data. Taken together, this information allows determination of structural features (substructures) of the molecule and constraints on the plausibility of ways in which the substructures may be assembled. The CONGEN program is capable of providing assistance in solution of such problems. CONGEN performs the task of construction, or generation, of structural isomers under constraints. The program accepts as input known structural fragments of the molecule ("superatoms") and any remaining atoms (C,N,O,P,...), together with constraints on how they may be assembled. It is based on the exhaustive structure generator(16,17) and extensions (18) which permit a stepwise assembly of structures. In an interactive session with the program, a user supplies structural information determined by his own analysis of the data (perhaps with the help of the above programs), together with whatever other constraints are available concerning desired and undesired structural features, ring sizes and so forth. The program builds structures in a series of steps, during which a user can interact further with the procedure, for example, to add new constraints. Although very much a developing program, its ability to accept user-inferred constraints from many data sources makes CONGEN a general tool for structure elucidation which we are making available in its current form. For the unknown X, the observed fragment ions from the molecular ion (M) at m/e 263 (Figure 5) suggest several structural features when coupled with the knowledge of the chemical derivatization procedures used on this fraction of the urine extract. The ion at m/e 194 represents loss of 69 amu, probably CF3, from fragmentation of a trifluoroacetyl derivative of an amine. This suggests the partial structure 2, Figure 5. The ions at m/e 19 (M-74 amu) and m/e 162 (+181 amu) suggest the characteristic fragmentation of an n-butyl ester resulting from the second derivatization procedure, formation of the n-butyl esters of free carboxylic acid functions. This suggests the partial structure 1, Figure 5. Taken together, all the above information implies (if no other elements are present) that the empirical formula contains an odd number of nitrogen atoms, at least three oxygen atoms, three fluorine atoms and at least seven carbon atoms. Interestingly, there is only one plausible empirical formula under these constraints, C11H12NO3F3. Structural fragments ("Superatoms") 1] and 2 were supplied to CONGEN, together with the remaining four carbon atoms and three degrees of unsaturation (that is, rings plus multiple bonds). with no additional constraints, 155 structures result. The inclusion of other plausible constraints (e.g., no allenes, acetylenes, cyclopropenes, cyclobutenes) reduces the number of structural candidates to just the two isomeric forms of 3, Figure 5. This problem represents a simple example of a large class of such problems. Although a chemist could probably reach the same conclusions quickly in this case, in the general case, piecing together potential solutions is not a trivial task. Although still a developing program, CONGEN is, capable of considerable assistance in a wide variety of structure problems. Some areas of current application are summarized in the subsequent section. It is already proving its value in structure elucidation problems by suggesting solutions with a guarantee that no plausible alternatives have been overlooked. The program has a great deal of flexibility. Many of the types of constraints normally brought to bear on structure elucidation problems can be expressed. However, some types of constraints cannot be easily expressed (e.g., disjunctions of features and stereo- constraints). Recent work by our group and Wipke’s(19) will make it possible to add considerations of stereoisomerism relatively easily (a good example of collaboration via SUMEX). We are depending on a broad user community to help us guide further development of CONGEN. Knowledge Acouisition INTSUM(28) and RULEGEN. When the mass spectrometry rules for a given class of compounds are not known, the INTSUM and RULEGEN programs can help a chemist formulate those rules. Essentially, these programs categorize the plausible fragmentations for a class of compounds sy looking at the mass spectra of several molecules in the class. All molecules are assumed to belong to one class whose skeletal structure must be specified. Also, the mass spectra and the structures of all the molecules must be given to the program. INTSUM collects evidence for all possible fragmentations (within user-specified constraints) and summarizes the results. For example, a user may be interested in all fragmentations involving one or two bonds, but not three; aromatic rings may be known to be unfragmented; and the user may be interested only in fragmentations resulting in an ion containing a heteroatom. Under these constraints, the program correlates all peaks in the mass spectra with all possible fragmentations. The summary of results shows the molecules whose spectra display evidence for each particular fragmentation,along with the total (and average) ion current associated with the fragmentation. The RULEGEN program attempts to explain the regularities found by INTSUM in terms of the underlying structural features around the bonds in question that seem to "direct" the fragmentations. For example, INTSUM will notice significant fragmentation of the two different bonds alpha to the carbonyl group in aliphatic ketones. It is left to RULEGEN to discover that these are both instances of the same fundamental alpha-cleavage process that can be predicted any time a bond is alpha to a carbonyl group. These programs are part of the so-called Meta-DENDRAL effort, whose general goal is to understand rule formation activities. Both INTSUM and RULEGEN are available as interactive programs on SUMEX, the former being much more highly developed than the latter. Although these programs can be very useful to chemists interested in finding new mass spectrometry rules, they require having the collection of mass spectra and molecular structure descriptions available in one computer file. Because of this, they have been used mostly by chemists at Stanford. Applications and Resource Sharing The DENDRAL programs are being developed to serve a broad community of chemists with structure elucidation problems. Our experience is admittedly limited. In this section we discuss some of the applications, both local and from remote sites, where these programs have proven useful. CLEANUP. This program has been developed in collaboration with the research staff of the Genetics Research Center at Stanford. Because that staff is working on problems in which few assumptions can be made about the samples, the CLEANUP program has been made very general. For example, the program works with high or low resolution spectra, and makes no assumptions about the actual GC columns used for separating components. This program will be tried next on the GC/HRMS data collected on the MAT=-711 in the mass spectrometry laboratory of the Stanford chemistry department. MOLION. The molecular ion determination program has been used in conjunction with CLEANUP on mass spectrometry problems in the Genetics kesearch Center. Because of the few assumptions that can be made about the samples in the GRC, the MOLION program has been made very general. We have incorporated this program as part of the PLANNER, as a powerful, class-independent method for inferring the composition of the molecular ion before structure analysis. whe anticipate wide use of the program when the FORTRAN version is available for export as well as stand-alone use on SUMEX. PLANNER. The planning program has been used to infer plausible placement of substituents around a skeletal structure for numerous test problems in which the class of the sample was known and the fragmentation rules for the class were known. Those tests have resulted in a program that we believe is general. We have applied this program to unknown mixtures of estrogenic steroids(13). We are preparing to use PLANNER for screening mass spectra of marine sterols to identify quickly those spectra of known compounds and to suggest structures for spectra of new compounds. CONGEN. CONGEN is being used locally and from remote sites in a wide variety of applications. We have used it for construction of ring systems under constraints(21) and for generation of structures of chlorocarbons(22). We have investigated several monoterpenoid and sesquiterpenoid structure problems to suggest solutions and to ensure that all alternatives had been considered. We are currently investigating the scope of terpenoid isomerism. Two problems relating to unknown photochemical reaction products have been analyzed and results used to suggest further experiments. In most cases we do not know the precise problems under study by remote users, only that they are using the program. CONGEN will perhaps be the most widely used (by remote users) program of those mentioned above as accessible through SUMEX. This is primarily a result of the wider scope of problems which might benefit from use of the program. However, the need for remote users to have their mass spectral data available at SUMEX for analysis present a significant energy barrier to use of the programs which require these data. INTSUM and RULEGEN. INTSUM is essentially a production program now, and is being used as such in a variety of applications involving correlations of molecular structures with their respective spectra. Recent or current applications include analysis of the mass spectra of progesterones and related steroids, androstanes, macrocyclic antibiotics, insect juvenile hormones and phytoecdycones. These studies serve to develop fragmentation rules which, if of sufficient generality, can in turn be used in PLANNER in the study of unknown compounds. III. Problems related to networking During this first year of operation, the SUMEX-AIM facility has encountered a variety of problems arising from its network availability. In most cases, there has been no clear precedent for the handling of these situations, in fact, many problem-areas still reflect the influences of a yet-developing policy. The hope is that this presentation and discussion of problems and their solutions may give foresight to others who contemplate networking or network use. The problems to be discussed can be loosely associated into three classes; those related to the management of the facility, those pertaining to research activities on the system, and those involving psychological barriers to network use. Managerial problems "Gatekeeping." The most general problem faced by the organizers of the SUMEX-AIM facility is the question of "gatekeeping." In order to insure a high quality of pertinent research, some kind of refereeing system is needed to assess the value of proposed new projects. The organizers of the facility would seem to be the best source of such judgements; yet, because we are both organizers and members of the SUMEX community, there is a danger that our decisions would unfairly favor local priorities. In order to establish credibility in SUMEX-AIM as a truly national resource, a management system has been instituted that allocates a defined fraction (initially 50%) of the SUMEX resource to external users, under the jurisdiction of an independent national committee (the AIM advisory group). The remaining 58%, allocated for local use, contains a portion for flexible experiments outside of local projects, but on our own responsibility. Choice of computer and operating system. A second management level problem is the choice of a computer and operating system which optimize the usefulness of the facility for a majority of users, and which encourage intercommunication between remote collaborators. Because SUMEX-AIM is intended to be used primarily for applications of artificial intelligence, and because interactive LISP (INTERLISP[ref]) is a vrimary language in this type of work, the choice of TENEX[ref] as an operating system was dictated somewhat by necessity. TENEX incorporates multiple address spaces, thereby allowing multiple "fork" structure and paging, a design which is necessary to create the large- memory virtual machine required by INTERLISP. The PDP-1@ is a popular machine for interactive computing of all sorts in university research environments, and thus an added benefit of this choice was expected - the possibility of easily transferring to SUMEX programs developed at other sites. Many of these programs were written not under TENEX but under the 16/58 monitor supplied by the manufacturer. Because a large and useful program library was already available under the 10/5 monitor, one of the design criteria of TENEX was compatibility with such programs; when a 18/58 program is run under TENEX, a special “compatibility package" of routines is invoked to translate 16/5 monitor calls into equivalent TENEX monitor calls. Although the concept is sound, we have found that in practice very few programs written for the 1@/589 monitor are able to run under TENEX without extensive modification. Other problems with TENEX include weaknesses in the support of peripheral devices and the lack of a default line-editor. The latter has caused a proliferation of editing programs, and some confusion has resulted because editor conventions vary from program to program. These difficulties have dampened somewhat our initial enthusiasm for the TENEX system. Nonetheless, TENEX provides some features which are crucial to a comfortable network environment. The standard support programs included with this system facilitate both the sending of messages to other users (either at the same site or at other sites on the ARPA network) and the transfer of data and programs from site to site on that network; also, the ability to "link" two or more terminals allows users to communicate easily and immediately. Both the linking and message facility have been found to be invaluable aids in inter-group communications and in such problems as interactive program debugging. When two terminals are linked, their output streams are merged, thus allowing each terminal to display everything typed at the other terminal. Since only the output stream is affected under these circumstances, it is still possible for each terminal to be used to provide input to separate programs, in addition to being used in a conversational mode. Maintaining a livable system. This third management-level problem is multi-faceted, with major areas of concern being resource allocation, security and file backup. Each of these issues involves keeping the system comfortably useful for the menbers of the SOMEX community. Although these difficulties are felt at sites which serve only a local community, they are accentuated by network connection, As noted above, the computational resources of the SUMEX~AIM facility are apportioned by the AIM advisory group and SUMEX management. some extensions to the basic TENEX system have been made to reflect this apportioning in the actual use of the facility. Basically, it was recognized that users of the facility are members of groups working on specific projects, and it is among these projects that the facility is apportioned. Disk space and cpu cycles are now distributed among groups instead of among individual users. For example, a user may exceed his individual disk allocation somewhat without any ill effect, so long as the total allocation of his group remains within the limits. Similarly, a Reserve Allocation Scheduler has been added to TENEX which tries to match the administrative cycle distribution over a ninety second time frame. Thus a particular group cannot dominate the machine if a lot of its members are logged in at one time. It is typical for usage of a facility to peak through the middle hours of the day. Indeed, one of the advantages of having users from around the country is the spreading of the load caused by the difference in time zones. Even so, the facility could offer better service if more people would shift their main usage hours toward either end of the day. To encourage “soft-scheduling” within groups on the system, SUMEX-AIM publishes a weekly plot of diurnal loading . This plot shows the total number of jobs on the system as well as the number of LISP jobs, since these jobs seem to make the biggest demands of system resources. The result has been an increased awareness by users of system loading and a noticeable increase in the number of users at all hours of the night and early morning. Protection for a computer system covers a range of ideas. It means the ability to maintain secrecy - for example, to guarantee the privacy of patient records. It also guarantees integrity by assuring that programs and data are not modified by an unauthorized party. Questions of protection generally become more interesting and complex as more sharing is involved. Consider the example of a proprietory program which generates layouts given a user ‘s circuit data. The program owner demands assurance that he will be paid whenever his program is used and that copies of the program cannot be made. The user wants guarantees that his data sets cannot be destroyed or copied for a competitor. yet the user must have access to the program and the program must have access to the data. Unable to support such complicated examples of protection, SUMEX-AIM assumes that sharing takes place between friendly users. This is not to imply that issues of protection and sharing have not appeared. For example, in an effort to improve the human engineering of programs for public use, the capability of recording a session has been built into several of the programs. studied by the program designers to pinpoint confusing aspects of programs, these recordings serve to improve program design. Since the issue of violation of privacy has been raised, some cf these programs now request permission to record a session before doing so. At this time, any quarantee of privacy must be provided by the program designer because TENEX itself does not have the ability to render the protection . The general design for systems offering "state of the art" protection involves a tolerance for failure; that is, if a potential offender succeeds in breaking through some of the defenses, he still does not place the entire computer system at his mercy. Encrypting of data files provides an additional line of defense. This method is used by at least two calendar or appointment programs on the computer. At this time, however, there are no general encrypting facilities available and users must do this for themselves as needed. Tenex provides the usual keyword protection at login time and a measure of file protection. Owners of a file may assign a protection number which specifies some combination of READ, WRITE, EXECUTE, or APPEND access to a file for owners, members of a group, or other users. This level of protection is basically enough to prevent accidents and most mischief. System programmer s around the country are aware of a number of TENEX bugs which permit this access to be violated. One user of our system found a way to place himself in a mode where he could modify any file on the system. To date, we have no examples of such activity~ actually having a deleterious effect on SUMEX-AIM, To make the use of SUMEX-AIM programs easily available on a trial basis for prospective users, a "guest" account system has been established. Since this makes logging into SUMEX-AIM so easy, it has invited some misuse by people using those accounts to play the computer games. A proposed extension to the system now being implemented is a special "guest EXEC" which would extend the protection of the TENEX monitor by allowing guest accounts access to only a more restricted set of programs. In order to assure the user maximum protection against loss of valuable work, SUMEX operates a multi-level file backup system. In addition to routine file backup system there are facilities to enable the user to selectively archive his or her disk files. By issuing a simple command to the TENEX executive the user can transmit a message to the operator to copy specified files to magnetic tape. Each such file is copied to two magnetic tapes within 24 hours of issuing the archive command. File retrieval is affected by a similar process. The user also has the alternative option of being able to lodge files in a special backup directory. Files are held in this directory until the next exclusive file dump (see below) at which time they are deleted. In this way the user can remove files from his directory at his own choosing knowing they will be archived by the exclusive dump. On a system level, an effort is made to maintain file backups such that the maximum possible loss, in the event of a crash fatal to the file system, would amount to no more than one day's work. Once each day all files that have been read or written within the last 48 hours are dumped onto magnetic tape. Files that exist for 48 hours are thus held on two separate tapes. The rotation period for files dumped in this way is 6@ days. Once each week a full file dump is made to separate disk storage. Bach such dump is kept for two weeks at which time it is replaced by a new file dump. rach month there is a full system dump from disk to magnetic tape. Files can be recovered from the system backup by sending a message to the operator specifying the file name(s) and when the file was last read or written (if such information is available). Excessive demand for production programs. One of the concepts behind the creation of a shared resource is elimination of the problems which arise when large, complex computer programs are exported. Since, in theory, exportability is no longer a problem, there is greater latitude in choice of a language in which program development can take place. In the case of some of the DENDRAL programs, it was thought that program development should take place in INTERLISP, a language that lends itself well to the artificial intelligence nature of these programs, but does not lead to particularly efficient run-time code. In order to ascertain the usefulness of these programs and to determine what areas remained in need of work, chemist collaborators were sought. As these users increased in number and began to use the programs more frequently, it became obvious that the inherent slowness of the predominately LISP code was affecting the whole system as well as handicapping the efficient use of the DENDRAL programs. Additionally, some of the chemist-users who were finding the programs most useful and who were most enthusiastic about their potential use, were versons who were working in industry. Although, in one sense, this interest from industry could be interpreted as an indication of the “real-world” usefulness of the programs, it came aS rather a surprise to both SUMEX and DENDRAL personnel. The fact that SUMEX-AIM is funded by NIH as a national resource vrohibits the facility from providing a service, at taxpayer ’s expense, to a private industry. Although there is precedent for a site funded via government grant to charge a fee for service, such an arrangement leads to highly complicated bookkeeping, and is contrary to the essential purpose of SUMEX~-AIM; to be a research-oriented rather than service-oriented facility. This leaves the industrial users in the position of being more than willing to pay for the use of the programs, but of having no mechanism whereby they can be charged. Furthermore, the fact that the programs are coded in LISP for a highly specialized environment, almost guarantees the impossibility of export, except to an almost identical computer system. An intermediate solution that will help to solve the problem of industrial users on SUMEX and will help to alleviate the system loading resulting from heavy usage of LISP coded production programs, is to mount CONGEN on a closely related computer which is operated on a fee for service basis. However, in order to make this transfer economically feasible, it has become evident that it will be necessary to recode the LISP sections of the program into a more efficient and easily exportable language. Research-oriented problems Community mindedness. Those involved in computer science research at SUMNEX face a general problem which is absent or greatly lessened at non-network sites; the problem of community mindedness. The network provides a large and varied set of other researchers and users who have an interest in their work. Although the network-TENEX combination provides new forms of communication with these remote parties, the traditional means of fully describing the use and structure of a complex program, a detailed person-to-person discussion, is not convenient. Comprehensive documentation gains importance in such a situation, and within the DENDRAL project a great deal of time has been needed in the development of program descriptions which are adeauate for a diverse audience. Also, in both DENDRAL and MYCIN, effort nas been and is being directed toward "human engineering" in program design; to provide the user with commands which assist him in using the programs, in understanding the logic by which the programs reach certain decisions and in communicating questions or comments on the programs” operation to those responsible for development. Such "housekeeping" tasks can often be neglected, yet are guite important in smoothing interaction with the community. Choice of programming language. High level programming languages which are desianed for ease of program development are frequently poor as production=-level languages. This is because developmental languages free the researcher from a raft of programming details, thus allowing him to concentrate upon the central logical issues of the problem, but the automatic handling of these details is seldom optimal. Also, because such languages tend to be specialized for certain computers and operating systems, the exportation of programs can be a serious problem. One solution to these problems is the recoding of research-level programs into more efficient language when fast and exportable versions are needed. Networking greatly eases the problem of exportability, but can also aggrevate the the problem of efficiency. As mentioned in the previous section, the DENDRAL programs, which are undergoing constant development, found a substantial number of production-level users. pecause of the inefficiencies of INTERLISP (a 59- to 109-fold improvement in running time is not uncommon when an INTERLISP program is translated into FORTRAN), this use adversely impacted the entire system. Because the DENDRAL programs are quite large and complex, their translation into other languages is impractically tedious. A partial solution to this problem is provided by the TENEX operating system, which allows some interface between programs written in different languages. With such intercommunication, time-consuming segments of an INTERLISP program which are not undergoing active development can be reprogrammed in another more efficient language. The developmental parts of the program are left in INTERLISP, where modifications can easily be made and tested. The CONGEN program uses three languages; INTERLISP, FORTRAN and SAIL[ref]. The SAIL segment was added when a new feature, whose implementation was fairly straightforward, was included in CONGEN. Since then, the SAIL portion gradually has been taking over some of the more time-consuming tasks. this method allows a balance in the tradeoff between ease of program development and efficiency of the final program. Accumulation of expert knowledge in knowledge-based programs. Just as statistics-based programs need to worry about accumulation of large data bases, knowledge based programs need to worry about the accumulation of large amounts of expertise. The performance of these programs is tied directly to the amount of knowledge they have about the task domain -- in a phrase, knowledge is power. Therefore, one of the goals of artificial intelligence research is to build systems that not only perform as well as an expert but that also can accumulate knowledge from several experts. Simple accretion of knowledge is possible only when the "facts", or inference rules, that are being added to the program are entirely separate from one another. It is unreasonable to expect a body of knowledge to be so well organized that the facts or rules do not overlap. (If it were so well organized, it is unlikely that an artificial intelligence program would be the best encoding of the problem solver.) One way of dealing with the overlap is to examine the new rules on an individual basis, as they are added to the system in order to remove the overlap. This was the strategy for developing the early DENDRAL programs. However, it is very inefficient and becomes increasingly more difficult as the body of knowledge grows. The problem of removing conflicts, or potential conflicts, from overlapping ru]2®s becomes more acute when more than one exnert adds new rules to the knowledge base, Of course, the advantages of allowing several experts to “teach" the system are enormous -- not only is the crogram’s breadth of knowledge potentially greater than that of a single expert, but the rules are more apt to be refined when looked at by several experts. On the other hand, one can expect not only a greater volume of new rules but a higher percentage of conflicts when several experts are adding rules. Having a computer program that can accumulate knowledge presupposes having an organization of the program and its knowledge base that allows accumulation. If the knowledge is built into the program as sequences of low-level program statements -- as often happens -- then changing the program becomes impossible. Thus current artificial intelligence research stresses the importance of separating problem-solving knowledge from the control structure of the program that uses that knowledge. Another problem , at a political rather than a programming level, becomes apparent with one accumulation process: how does the program distinguish an expert from a novice? In the MYCIN program we have circumvented the problem by having the program ask the current user for a keyword that would identify him as an expert. It is then a bureaucratic decision as to which users are given that keyword. There is nothing subtle in this solution, and one can imagine far better schemes for accomplishing the same thing. The point here is that not every user should have the privilege of changing rules that experts have added to the system, and that some safeguards must be implemented. "Human nature" barriers to SUMEX use Countering disbelief. There is sometimes a tendency among those unfamiliar with the capabilities and limitations of computers and computer programs to express disbelief. This is not disbelief in the sense of worrying that the programs have errors and produce erroneous results. Indeed, the fact that a problem is being done by a computer seems to generate some faith that it might be right, or at least significantly reduces questions about correctness. The disbelief is that programs, which are designed to model, or to emulate, human problem solving will not be capable of useful performance. This, of course, is the classic argument against artificial intelligence - we think in mysterious ways and have such a complex brain that a computer program must be inferior. In some cases, authors of artificial intelligence programs have brought such criticism upon themselves by not stressing limitations, or by making extravagant claims. In the DENDRAL project, we have tried to counter this type of disbelief in a number of ways. We have tried to stress that our programs are designed to assist, not replace chemists. We have always discussed limitations to give a reasonable perspective on capabilities vs. limitations of a program. Most importantly however, we have focused on those aspects of problems which are amenable to systematic analysis, i.e., those problems which can be done manually, but only with difficulty and with the consumption of a great deal of time which a chemist could better spend on more productive pursuits. Examples of this would include the application of PLANNER to mixtures where all fragmentations may have to be considered as possible fragments of every molecular ion, the systematic analysis by INTSUM