I. PESOURCS TDENTIPICATION
Project Title: Resource-Related Research Computers and Chemistry

Principal Investigator: Dr. Edward A. Feigenbaum
Telephone (415) 321-2300 Ext. 4&78

Department: Computer Science
School: Humanities and Sciences
Site: Stanford University, Stanford, California

Total Project Period: May 1, 1971 through April 30, 1974

Rusiness and Administrative Official: kK. D. Creighton
Telephone (815) 321-2300 Ext. 2251
NATIONAL ENSTETUT OS om rary

JEVEG PL OP Pr araApe eo prequnece

OPUTE Cap oy rea Ne fypr A sta

SUCTION p= prgainec 1oeiperbesr pa

 

 

 

 

Peanore Partots Sraint tio, © Pay Pe ganysas
Santesnhepr 37, [07%
Prom: l/i/72 Tis L2/3i/7" Jate af Penorh Creparatios
ao/tayfyear sof toayvfsyesr
“ye of rasource | Tesource (btress { Seceurcc
| ' Te Jantar ey,
PasourcesPo lated Sesearch|  Camouter Science Ment, |
Ponputers ant Cheoistry| Ctanfor’ University f CRUERY 29TH 9 GS
fo Stanford, Sapte eae | Twp OTe
Mrinedoal tovestheater i Tita f Serternte Mast,
| |
Yr, Peuygre PAL Ted rert caps pam aggar | Crnaypber Cateae
! !
Hrantes tastitution Tre Of JPastitutian J Invesetoatorts
Corivete Unive, State | Telartars Ua,
Stanford Uotversity TIntul, Ctasa,, fe.)

t
Stantort, Sali t, Vass Chir y 294

 

 

 

Irji vate ie faare t bay Tee rate
"ane 9a” tastitutionts @intechantacy Paanures Adtvisary Sormitteas:
eM
Madhershin of Voteckanlory Resource Niwisere Fo rvottres:
(lotteate Thedenan an? thase oho bayve royteues Eo othia ranari
Hoan Titio Vasant re
Tyoe tame % Title of Princinel Investivcater [ Tirnature Poatre

Jr. Tebrard §\. Teperentbouy \ Fiza
Prof: ssor | dened A. ae

 

Tyne thane 3% TErle atv trantee Insti cucions J Ctenatirs
NEFictAl |

Nathieen “acler |

Mssistant esearch Cdoinistrater |

 
TI. RESOURCE OPERATIONS
A. Description of Progress
1. Overview

The Heuristic DENDRAL Project at Stanford University is an
interdisciplinary research effort. The task area is a man y-taceted
problem of interest to medicine, chemistry, and computer science.
3ecause the actual work has been divided into separate sub-probleas
along lines of scientific expertise, an overview is given here to
astablish the context of progress in each area, details of which are
described in subsequent sections.

Following the organization of the oriqinal proposal, the progress
and plans are organized into Parts A, B, and C, representing the
different research efforts included within the scope of the proposal.
Part A is aimed at enhancing the reasoning power of the existing
Heuristic DENDRAL performance program so that eventually it may become a
useful working tool for mass spectrum analysts. The goals of Part B
include the closed-loop control of a mass spectrometer in realtime by a
version of the Heuristic DENDRAL program; and the development cf mass
spectrum analysis techniques for certain classes of biologically
jmportant compounds. Part C concerns the development of the Meta-
QENDRAL program, an attempt to achieve automatic theory formation in the
area of mass spectrometry.

During this year we have made continued progress in each cf the
three major project areas. The following sections describe out progress
and plans in detail. The highlights are:

Part A: (1) Analysis of high resolution mass spectra of estrcgens
and estrogen mixtures.
(2) Completion of the algorithm for generating cyclic
structures.

Part A: {1) Development of hardware and software for routine data
acquisition on the Varian-MAT 711 mass Spectrometer,
sending data to the IBM 360/50 computer at the Medical
Schoolts ACME facility.

(2) Preliminary work on analysis of the chemical components
of urine. An initial application of this work for
analyzing the urine of premature infants.

Part C: (1) Completion of the data interpretation program, the
first part of antomatic theory formation. Application
of this program to new sets of data.

(2) Continued work on cule formation, the second main stage
of theory formation.

The problem we have chosen to work on - the application of
artificial intelligence to mass spectrometry - remains a richly varied
prohlem domain. Its interest to medicine, analytic chemistry, and
computer science have not diminished. We have discovered aspects of the
problem which are more difficult than we initially thought. On the
other hand, we have made more progress with other aspects in the last
yeac than we would have predicted.
Interpretation of mass spectra requires the judicious application
of a very Large body of knowledge, whether it is done by a chemist or by
a computer. Part of our work centers on acquiring new knowledge of mass
spectrometry and codifying old knowledge. This means cunning and
analyzing the mass spectra of unstudied classes of compounds as well as
putting mass spectrometry rules into the computer program. These tasks
have reguiced the development of artiticial intelligence techniques
necessary to apply the chemical knowledge efficiently.
Part A. APPLICATIONS OF ARTIFICIAL INTELLIGENCE TO MASS SPECTKOM ETRY
Objectives:

The overall objective of this part of the research is extension of
the Heuristic DENDRAL program to analysis of the mass spectra of complex
organic molecules. This overall objective encompasses several
sub-tasks, all of which represent critical steps in building a powerful
program in an incremental fashion. Thus the current status of the
program permits operation to continue in a routine, production mode
wherein problem areas within the scope of the program can be
investigated while extensions of the program are under development. The
following specific objectives reflect both applications of the existing
program and ongoing program development:

I) Assess the capabilities and limitations of the programming
techniques for estrogenic steroids analyzed as unknown compounds and
mixtures of compounds.

II) Generalize the programmirg techniques to ensure a high level
of compound class independence.

III) Apply the techniques to other classes of steroids, alkaloids,
and amino acids.

IV) Develop the cyclic structure generator for inclusion into the
Heuristic DENDRAL program and explore the potential of the generator as
an analytical aid of general utility.

V) Refine planning rules to infer compound classes or molecular
substructures to minimize structures considered by the DENDRAL
algoritha.

VI) Exploit ancillary information which can be obtained trom
other mass spectral technigues such as metastable ion spectra, low
ionizing voltage spectra and wass Spectral pattern shifts in
isotopically or substituent labeled molecules.

VIT) Design experimental strategies to collect, using the
techniques of part VI, only those ancillary data required by DENPRAL to
effect a solution or minimize ambiguities.

VIII) Structure the programs to utilize and/or request data from
other spectroscopic techniques (e.g., proton maqnetic resonance (PMR),
carbon-13 magnetic resonance (CMR), infra-red {IR) or chemical
techniques, such as isotopic labelirg with deuterium).

IX) Explore the theoretical bases for mass spectral fragmentation
processes to improve existing mass spectral theory.

X) Implement production analysis programs on the ACME computer
facility to permit closer integration with the mass spectral data
acguired and reduced on this facility.

Progress:

The following discussion of this task area of the proposal is keyed
to the sub-task objectives described above:

I) The techniques of artificial intelligence have been applied
successfully for the first time to a problem of direct hiological
celevance, namely, the analysis of the high resolution mass spectra of
estrogenic steroids. The performance of this program has heen shown to
compare favorably with the performance of trained mass spectroscopists,
see Smith, et.al. (1972). The operation of this program has heen
detailed in this publication, a copy of which is attached. Briefly, the
program was designed to emulate the thought processes of an expert as
far as possible. High resolution mass spectral data are searched for
evidence indicating possible substituent placements about the estrogen
Skeleton. Molecular strictures allowed by the mass sSpecttal data are
tested against chemical constraints, and candidate solutions are
proposed. Further details of the pectormance in analysis of mcre than
thicty estrogen- related derivatives are presented in the above
publication.

Qf particular significance in this mffort were, in addition to
axceptional performance, the potential for analysis of mixtures of
estrogens WITHOUT PRIOR SFPARATION, and for generalization of the
programming approach to other classes of molecules. The last topic is
discussed in more detail in (IT) and (TII) following.

Because of the structure of the Heuristic DENDRAL program for estrogens,
it is immaterial whether the spectrum to be analyzed is derived froma
single compound or a mixture of compounds. Each component is analyzed,
in terms of molecular structure, in turn, independently of the cthet
components. This facility, if successful in practice, would represent a
significant advance of the technique of mass Spectrometry. Many problem
areas, because of physical characteristics of samples or limited saaple
quantities, could be successfully approached utilizing the spectra of
the unseparated mixtures, Fven in combined gas chromatography/mass
spectrometry (GC/MS), (see proposal section Part B-2 below), many
mixture components will be unresolved and an analysis proyram must be
capable of dealing with these mixtures.

We have, in collaboration with Prof. H. Adbkercreutz of the University of
Helsinki, recently coapleted a series of analyses of various Fractions
of estrogens extracted from bodily fluids and supplied to us by Prof.
Adlercreutz. These fractions (analyzed hy us as unknowns) were found to
contain between one and four major components, and structural analysis
of each major component was carried out successfully by the abave
program. These mixtures were analyzed as unseparated, underivatized
compounds. The implications of this success are considerable. Many
compounds isolatei from bodily fluids are present in very small amounts
and complete separation of the compounds of interest from the many
hundreds of other compounds is difficult, time-consuming and prone to
cesult in sample loss and contamination. We have found in this study
that mixtures of some complexity (<10 components), which are difficult
to analyze by conventional GC/MS techniques without derivatization
(which frequently makes structural analysis aore difficult), can he
cationalized even in the presence of significant amounts of impurities.

A manuscript on this study will be submitted shortly. Because of the
potential generality of this technique we will continue our
investigations of estrogens and begin studies on mixtures of other
steroids.

In the past year we have extended our library of high resolution mass
spectra of estrogens to include 67 compounds. These data represent an
important resource and will tentatively be included (as low resolution
spectra for the moment) in a collection of mass spectra of biologically
important molecules being organized by Prof. S. Markey at the University
of Colorado. These data are being used extensively in developing the
program strategies for Meta-DENDRAL (see Part C, below).

IQ) The Heuristic DENDRAL program for complex nolecules has
received considerable attention juring the last year in order to remove
compound class specific information orf program strategies. By removing
information which is specitic to estrogens, the program has become mucn
more general. This effort has resulte? in a production version of the
projram which is designed to allow the chemist to apply the preqram tn
the analysis of the high resolution mass spectrum of any molecule with a
minimum of effort. Given the spectrum of a known OF unknown ccmpound,
the chemist can supply the following kinds of information to guide
analysis of the mass spectrum:

a) Specification of basic structure (superatom) common tc the
class of molecules.

b) Specification of the tragmentation rules to be applied to the
superatom, in the form of bond cleavages, hydrogen transters and charys
placement.

c) Special rules on the relative importance of the various
fragments resulting from the above fragmentations.

d) Threshold settings to prevent consideration of low intensity
ions.

@) Available metastable ion data and the way these data are
subsequently used -- to establish definitive relationships het ween
fragment ions and their respective molecular ions {see VI, belcw).

f) Available low ionizing voltage data -- to aid the search for
molecular ions (see VI, below).

g) Results of deuterium exchange of labile hydrogens -- to specify
the number of, e.g., -OH groups (see VI, below).

In the case of a known compound this procedure may be used to
validate fragmentation rules developed on other, celated compounds.
This mode will be used extensively in testing the output of the data
interpretation program (see Part C, below).

In the case of unknown compounds, rules with known generality for
related, known structures may be used to determine the structure of the
unknown. This mode has been used extensively for estrogens and Will be
extended to other classes (see TIT, below).

IIl) The first step away from estrogen analysis was iritially
going to be to the analysis of pregnanes, another biologically important
class of steroids. A review of the mass spectrometry literature,
however, revealed a paucity of information on the mass spectral
fragmentation behavior of these molecules. Without fragmentation rules
we cannot proceed with spectral analysis. We have, therefore, collected
the high cesolution mass spectra of approximately 50 pregnane related
compounds. The data interpretation vrogram (see Part C, below) will be
usei extensively to help elucidate the fragmentation mechanisms
involved. This study has already achieved the result of clarifying,
through the use of high resolution data, the iuterpretation of mass
spectra of the small number of pregnanes reported in the literature
which were recorded only under low resolution conditions. Peaks have
been found which have elemental compositions different trom these
assigned by past studies.

we have also collected a total of 26 spectra of threo classes of
quinazolone and quinolone alkaloids for which mass spectra have not heen
previously recorded. As fragmentation mechanisms are developed for
these classes, they will be tested ayainst the known structures, and in
the case of the quinazolone alkaloids tested against a set of nina
compounds for which spectra have not been determined and which then can
be treated as unknowns.

In connection with the goals of Part 8-2 (see below) we will
shortly commence a study of derivatized amins acids (N-
trilousracetyl-O-lautyl esters). These are derivatives of choice for
GC/MS analysis of amino acids whether derived from, e.y., bodily fluids
oc geological samples. This will be an important first step in
integration of the data analysis programs with GC/HRMS data on urine
extracts, as essentially no high resolution mass spectral studies have
been carried out on constituents of urines.

tV. The cyclic structure generator now rests on a firm
mathematical foundation such that we are confident of its thorcughness
and ability to generate structures with PROSPECTIVE elimination of
duplicate structures. The prospective nature of the generator is a
necessity for efficient implementation, as retrospective checking of
each generated structure to eliminate redundancies is too time
consuming. The necessary concepts have recently been transformed into
an operating algorithm.

The next step in its development will be to implement constraints
on the generator so that greater flexibility is possible. For example,
in many cases the chemistry of a situation dictates that certain
structural types may be present, or that others must be absent. “he
genecator will use this information as constraints. de have planned a
set of constraints which are useful to the chemist, for example, numbers
of rings as opposed to double bonds, ring sizes, riny fusions, and so
forth, and have begun developing ways to incorporate these constraints
without compromising the requirements for thoroughness and
non-redundancy. Mc. Larry Masinter, Dr. N. S. SEidharan, and Mr. Larry
Hjelmeland have been key personnel in bringing the algorithm tc
completion and implementing it.

A manuscript will soon be submitted describing for chemists the
core of the cyclic structure generator, the labelling algorithm. This
algorithm is capable of construction of all isomers, of wholly cyclic
graphs, which may be formed by labelling the nodes of a cyclic skeleton
with atoms (e.g., C, N, 9) or labelling the atoms of the skeleton with
substituents (e.g-, -CH3, -OH). Through the use of graph theory, yroup
theory, and the symmetry properties of cyclic graphs the labelling
algorithm avoids constriction of redundant isomers by identification of
equivalent node positions on the graph structure before labelling takes
place, It is indicative of the complexity of this problem and the
importance of its solution to both chemists and mathematicians that it
has remained unsolved (until now) despite attention for over 100 years.
A manuscript describing the underlying mathematical theory has heer
submitted to the DISCRETE MATHEMATICS.

The cyclic structure generator in its entirety (encompassing
acyclic, wholly cyclic and combinations thereof) will be describe
separately. Apart from the Labeling algorithm the remainder of the
problem involves, first, the combinatorics of asSignment of atcms to
cycles or chains, and second, construction of acyclic radicals to attach
to the rings using the well known principles of acyclic DENDRAL.
Manuscripts describing the mathematical and chemical aspects of the
structure generator are in preparation.

Over the summer we were fortunate to have the help of Prof. Harold
Brown, a visitor to Stanford from the Dept. of Mathematics at Chio State
University. He brought to the problem a depth of mathematical analysis
which was important for finishing the design of the algorithm and
working out details of its implementation. He was largely responsible
for the manuscripts describing the graph theory of the labeling
algorithm and the graph theory of the structure generation algcritha.
The cyclic structure generator makes it possible to define the
boundaries, scope and limwitations of organic chemistry as a whcle,
rather than simply the acyclic part of it. As an indication oft tha
complexity of chemistry in terms of numbers of possible structures, take
the example of C6H6. The most familiar molecule with this molecular
formula is benzene. Yet there are more than 200 topological isomers fot
C6H6 (with valence constraints) of which only 15 are totally acyclic.

The first use of the generator has been to create a dicticnary of
carbocyclic skeletons. This time-consuming task would otherwise have to
be done each time a new molecular formila is presented. The dictionary
is structured to contain keys as to type of skeleton, number of rings,
ring fusion, and so forth, so that the constraints mentioned previously
are simple to exercise in the context of the dictionary.

we feel that the cyclic structure generator has the potential of
acting as the focal point for an interactive laboratory analytical tool.
Constrained by inferences obtained from data (such as MS, IR, etc.) and
from chemical treatments, such a generator would, under control by the
chemist, be a powerful proposer of an exhaustive set of candidate
solutions based on available data. This concept will certainly be
developed further as we improve both our capabilities for inference fron
scientific data and our techniques for using the generator.

Vv) zfforts in analysis of mass spectra have to this point heen
relatively restricted in terms of the types of structures which may be
considered. AS our knowledge base and the scope of the proyram increase
it is necessary to consider general planning rules. These rules are
used in initial examination of a mass spectrum to determine which
compound class might be cepresented so that subsequent analysis utilizes
rules for that class. One approach was used successfully in the past
analysis of saturated aliphatic monofunctional {SAM} compounds. For
more general utility, however, other approaches must be considered. The
following areas are presently under investigation:

a) How best to exploit a version of library matching procedures to
ease the computational burden on DENDRAL when dealing with routine
analyses of mixtures of compounds that have previously heen at least
partially characterized. In this way attention can be focusea on those
previously uncharacterized components. This aids planning in that
effective library matching procedures frequently provide hints as to
molecular structure even when the correct spectrum is absent trom the
library. Mc. Lacry Hjelmeland and Mc. Mark Stefik have been
investigating library matching procedures which fit our needs.

b) Utilize ion series spectra (Smith, 1972), an extension of the
planning procedure for SAM compounds, in conjunction with the specific
information embodied in a high resolution mass spectrum, which yields
not only formulae but the implicit number of rings plus douhle bonds;
both items serve as powerful limitations on compound class.

Cc) For complex molecules which may contain several functional
qroups we have explore: and are continuing exploration of incorporation
of molecular substructures into the planning scheme. Thus tather that
infer a class or particular skeleton, inferences ire made about specific
functional groups (e.g., -N42, OH) oF substructures (e. 9.,
-CH2-CH2-CH3). This is the form in which information fron other
spectroscopic techniques is available, and we plan to extend our oresent
capabilities for planning based on this information (see VIII, below).
VI) There are several additioral techniques available to the mass
spectroscopist other than recording the conventional mass spectrum,
These techniques are used routinely in everyjay research as they provide
considerable complementary data which frequently are of great assistance
in rationalization of the conventional spectrum, either in terms of
structure oc fragmentation nechanisms. We have modeled the Heuristic
DENDRAL program for complex molecules to use data from these additional
techniques in much the same way as a chemist does. We have the
capability of determining the following three types of data on our mass
spectrometers and using them in the progran.

a) Metastable Ion (MI) Data. Metastable ions provide a means for
relating fragment ions to molecular ions in a mass spectrum. This
information is extremely important in two contexts. In examination of
the spectrum of a known compound, the existence of a metastable ion
provides strong evidence that a given fragment ion arises at least in
part in a single decompositior process from an ion of higher mass (not
necessarily the molecular ion). Investigations of this type are
necessary to establish that a set of fragmentation processes which are
to be used as rules to guide the Heuristic DENDRAL program are in fact
viable processes and occur in a known manner. An example of the utility
of these observations has been investigations of metastable ion data in
the mass spectra of estrogens (Smith, Duffield and Djerassi, 1972).

The second context is, in the case of analysis of mixtures of
compounds, a determination of which fragment ioas in a very complex
spectrum are related to which molecular ions. we have explored the
analysis time and specificity of results as a function of the amount of
metastable ion data available on a mixture and noted one to twe orders
of magnitude reduction in computer time to arrive at Single, ccrrect
solutions for various mixture components (rather than 5-20 possible
solutions limited by the conventional mass spectrum alone). These
cesults will be reported in detail in the description on analysis of the
estrogen mixtures (see I, above).

Metastable ions are those which are formed by fragmentaticn
processes occurring during the flight of an ion after formation and
acceleration. These fragmentation processes may occur at any point
along the flight path of ions through the mass spectrometer. Recause of
the complex behavior of metastable ions formed in magnetic or electric
fields, they are usually studied in field-free regions of a mass
spectrometer. Earlier work was directed at ions formed in a fieldfree
cegion just prior to entering a magnetic field {mass analysis). This is
the only method available for metastable ion studies for a single
focussing mass spectrometer: The metastable ions formed in this region
appear as diffuse peaks superimposed on the normal mass spectrum. The
mass positions of these metastable ions, however, satisfy
{mathematically) several relationships of pairs of normal ions. This
lack of specificity and frequent difficulties in accurately determining
the mass positions has caused us to turn our attention to studies of
so-called "“defocussed" metastable ions. A conventional double focussing
mass spectrometer possesses two field-free regions where metastable ions
may be studied. One field-free region lies hetween the electric sector
and the magnetic sector. This region can be used to study metastable
ions of the type discussed above. The other field-free region lies
between the ion source and the electric sector. Metastable ions formed
in this region can be examined by de-tuning the instrument (defocussing)
so that normal ions are not observed, but metastable ions are. This
procedure allows establishment of snecific relationships between ions
involved in a metastable decomposition so that the original ion which
decomposes during flight, and its decomposition product, can he
identified. This technique has let to much more nseful information for
the Heuristic DENDRAL program, as illustrated earlier in this section.

b) Low Ionizing Voltage (LV) Data. The key to successful
operation of the Heuristic DENDRAL program is correct inference of the
molecular ion{s) and molecular formula (e) in a given mass spectrum. Ih
the past, metastable ion data were used to assist the program in correct
identification of molecular ions. This procedure has now heen
supplemented, making the program cognizant of LV data. At lower
ionizing voltages, molecular ions are formed with lesser amounts of
excess internal energy. Most classes of molecules {those that display
significant molecular ions) can be analyzed at a sufficiently low
ionizing voltage that only molecular ions are observed, as the internal
energy is not sufficient to allow fragmentation. This technique was
used extensively in the analysis of estrogen mixtures and the resulting
data simplify the program's task of determining molecular ions.

s) Isotopic Labeling. We have previously described how isotopic
labeling of labile hydrogens with deuterium aids analysis. For example,
the last phase of the analysis of spectra of complex molecules involves
several "chemical" checks on the validity of proposed structures. The
knowledge of the number of hydroxyl groups can be a powerful filter to
reject certain candidate structures. Isotopically labeled molecules
have permitted a detailet examination of fragmentation processes of
complex molecules utilizing comparisons of metastable ion spectra of
labeled and unlabeled molecules (Smith, Duffield and Djerassi, 1972).

Future work will involve suggestions by a program of likely sites
of hyirogen transfer in the course of fragmentation. Elucidation of
fragmentation processes is a part of the Meta-DENDRAL effort (Part C,
helow). More detailed specification of these processes can he effected
by isotopic or substituent labeling of molecules and we feel that a
proyram is capable of suggesting the necessary experiments.

In addition, we are exploring the feasibility of using C1? NMP data
to complement mass spectrometry data. Its initial use will be to
determine the branching structure of alkyl chains away from the
heteroatom in aliphatic monofunctional compounds. Dr. Ray Carhart, an
NIH post-doctoral fellow, is working on this problem together with Ms.
Hanne Eggert, a visiting scholar frop the University of Copenhagen,
Denmatk. Substantial work on the C13 NMR theory of amines has been
described in a manuscript: (by Fygert & Djerassi) to be submitted soon.

VII) Designs of experimental strategies represent a crucial link
between the Heuristic DENDRAL program and the instrument contrcel aspects
of this proposal (see Part 83-1, below). We have begun planning ways in
which the program, cojnizant of intermediate results, can suggest
additional collection of data that will be reyuired for an unagrbiguous
determination of structure, or at least to minimize ambiguities. These
suggestions can ultimately he translated into control parameters sont
hack to the mass spectrometer, In any real-time data collecticn scheme
involving small amounts of sample, time is of the essence, It is
crucial to select those data which are necessary ind sufficient and to
avoid collection of redundant or spurious data. We feel an
‘intelligent" program can supervise the lata collection and analysis to
fulfill this goal and can accomplish the task in real-time.

VIII) The Heuristic DENDRAL program foc SAM molecules is) alrealy
structured to accept additional spectroscopic data in the forms of
GOODLIST and BADLIST specifying molecular substructures which are
present or absent. We have deferred implementation of this more general
approach to the Heuristic DENDRAL program for complex molecules until
the cyclic structure generator is ready. Up until now, any such data
from other techniques have been used retrospectively to check candidate
structures for the reguisite functional gronps or substructures. Now
that the structure generator is available, we will pegin implementation
of the GOODLIST and BADLIST for cyclic molecules.

IX) We have begun to explore ways in which to predict the mass
spectral behavior of molecules without the need to resort to the
classicad method of determining many mass spectra followed by empirical
generalizations. Quantum mechanics may be capable of providing this
information. With Dr. Gilda Loew, we have been investigating extended
Huckel molecular orbital theory in an attempt to predict some
qualitative indications of the propensity of bonds to fragment. Our
initial efforts have been aimed at the estrogenic steroid estrone, and a
manusccipt will shortly he submitted describing these results. Priefly,
calculated net atomic charges appear to have little bearing on
subsequent fragmentation of the molecule. Bond densities (which are
related to bond strengths), however, provide some indication of which
bonds are likely to underao scission in the first step of a
fragmentation. We are attempting to extend these results to other
molecules, specifically, amino acids.

The ability to predict features of mass spectra given only a
molecular structure would be ar important advance both within the
context of Heuristic DENDRAL and for mass spectrometry and thecretical
chemistry as a whole.

X) A version of Stanford 360/LISP has been mounted on the Medical
School's ACME computer system, This version, available to us in the
overnight batch processing operation, has proven useful for cunning
production versions of programs, Hecause our mass Spectral data are
acquired and reduced via ACMF, this facility has temoved the need for
transferring data from ACME to the campus facility. We regret to
report, however, that this version of LISP is not available to us in the
time sharing mode during the day when mass Spectral data are ccliected.
Thus, although routine data analysis is facilitated, there is no
immediate prospect for integration of DENDRAL into the real-tine aspects
of the problem. For the near future these activities will be simulated
through batch processing to enable us to develop the necessary
techniques for real-time interaction.

Plans:

Tn most cases, the plans for future work are embodied in and
dictated by the progress we have made so far. Many of the plans,
therefore, ace outlined in the Progress section, above, As a brief
summary then we plan the following activities, again keyed to the
sub-task objectives:

[) #2 plan to continue with analyis of additional estrogen
mixtuces from bodily fluids in view of the excellent performance of tae
program so £ar.

Tf) Wwe feel we have achieved a high level of class independence in
our present program. As wore classes are analyzed we expect that
further "cleanup" may be necessary, but easy to carry out.
Iff) Extend Heucistic DENDRAL for complex molecules to the classes
foc which spectral data are or shortly will be available, fregnanes,
cholestores, the above alkaloids and amino acid derivatives.

IV) Constraints will be developed for the cyclic generator that are
easily understood by chemists and easily implemented in the computer
prograa.

V) Planning rules for compound class determination will receive
considerable attention as Heuristic DENDRAL is extended.

VI, VII) We understand how to use this additional information.
Work needs to be done on algorithms to determine which experiments to do
and how best to do them to minimize consumption of valuable samples.

VIII) As the structure generator is developed, we plan to itplement
it in Heuristic DENDRAL so that constraints imposed by spectroscopic
data may be used effectively.

IX) We plan to analyze amino acids using molecular orbital theory
to extend the theoretical basis for prediction of mass spectra.

X) We plan to simulate ir as much detail as possible the
interaction between Heuristic DENDPAL and the mass spectrometer to
direct data collection in an intelligent fashion.
Part B-i. FXTENSIONS OF THE COMPUTER-MASS SPECTPROMFTFR SYSTEM.

Objectives:

Data acquisition in real-time from the Varian-MAT 711 mass
spectrometec with analysis of these data by Heuristic DENDRAL is the
primacy objective of this section of the research. we ultimately seek a
substantial degree of control by computer program over the acquisition
of data from the mass spectrometer. With sufficient computer power it
is possible to accomplish the control within the time scale of GC/MS
operation. A tationale of this approach and our efforts toward devising
suitable programs to achieve this goal are described above under Part A.

The following operational parameters of the mass spectrometer are
desirable and amenable to control: magnetic scan speed and mass range
of scan, slit widths (to adjust to high or low resolution operation, ion
optical stops (to increase resolution in the metastable defocussed
mode), accelerating or electrostatic sector voltages, ionizing voltage
(to switch from normal to low ionizing voltage), and rate and
temperature of probe heating when the direct insertion proke is used to
introduce samples into the mass spectrometer. Control of GC ccnditions
is also possible.

Progress:

The Vacrian-MAT 711 mass spectrometer was formally accepted by
Stanford University on Nov. 5, 1971. Prior to this time the instrument
justallation ard performance tests went extremely smoothly. Shortly
after acceptance, however, a Series of electronic and mechanical
malfunctions occurred which necessitated a visit ot an engineer from
Germany for a period of several weeks. Since that time the instrument
has been used routinely in all its operating modes including ultra-high
resolution peak matching, scanning at high cesolution for accurate mass
measurement; GC/MS operation, low ionizing voltages, and metastable
defocussing. This instrument has now assumed the entire burden of data
acquisition for DENDRAL related activities.

There are two activities related to the goals of this Part area
which have proceeded in parallel with gaining familiarity with the new
instrument. These activities are improving the software (programming)
for data acquisition and reduction, and developing new hardware the
initial efforts toward instrument control.

Software.

Great advances have been made in the programming for data
acquisition and reduction, particularly since the arrival of Mr. Tom
Rindfleisch, who helps jlirect the Instrumentation Qesearch Labcratory's
efforts in the DENDRAL mass spectrometry area. The following items
indicate these advances.

a) Data Acquisition. Programs have been written which permit
acquisition of peak profile data at high data rates using the PDP-11 as
an intermediate data filter and buffer store between the mass
spectrometer and ACHE. This allows data acquisition to proceed even
under the time constraints of the time shariny system. Storage of peak
profiles rather than all data collected has greatly reduced the storage
requirements of the prodram and saves time as the background data (below
threshold) are removed in real-time. An automatic taresholding proyram
is in operation which statistically evaluates hackgrounu noise and
thresholds subsequent data accordingly. Amplifiec drift can thus he
compensated. We have developed some theoretical models of the data
acquisition process which suqgest that high data acquisition rates ate
not necessary to maintain the integrity of the data. Proof of this
theory with actual data would qreatly relieve the burden of high data
rates on the computer system, particularly as imposed by GC/MS
yperation, ani permit considerably more data reduction to he
accomplished in real-time. Statistical and observed models of pear
profiles have suggested certain design changes in the hardware (See
below).

b) Instrument Evaluation. A high resolution mass spectrcmeter
operating in a dynamic scanning mode is a complex beast. There are many
things that can go wrong which yield effects which may he invisible to
the operator. Furthermore mode changes during closed loop operation
require instrument adjustments which must be computer controlled. Tt
is, therefore, necessary that the computer have a model of spectrometer
operation on the basis of which data quality can be assessed and
processing suitably adapted as well as instrument performance cptimized.
To ensure that the instrument is operating properly and high quality
data are being gathered, we have devoted some time to development of a
program which monitors the state of the mass spectrometer. This
preliminary program checks the following items:

i) Data acquisition pactameters, i.e., the threshold,
specifically determined peak width and intensity criteria, the member of
peaks and the data storage utilized.

ii) Calibration of the mass/time scale, storage of same to be
used as a model for subsequent spectra, sutput of mass range over which
ecale is calibrated, calibration peaks missed, if any, and a graph of
extrapolation error versus mass. Any irregularities in this output
point to scan problems.

iii) The dynamic resolution versus mass is determined and
output as a graph. This allows the operator to adjust to constant
resolution over the mass range.

All output and warnings to the operator are provided on a Cr”
adjacent to the mass spectrometer immediately after a scan. Although
this program works for the present time only with tte calibration
compound, PFK {no additional sample), it provides a basis for a general
mechanism to monitor data quality to prevent wasting valuable samples
when the instrument malfunctions.

The program contains many interactive featuces which permit the
operator to examine selected features of the data at his leisure. He
may display any selected peak protiles, obtain listings of calculated
masses, plot a spectrum from the data and so forth.

In the Longer term as nore quantitative axperience is gained with
operating the MAT 711 in various modes and as instrument contrcel
hardware is completed, models relating instrument parameters to control
functions and interactions will be developed. These will allow
stratedies to be planned for automated mode switchiny and perfcrmance
optimization needed for intelligent control of data collection and
ceduction processes.

c) Data Resolution. A program has heen written which allows
automatic reduction of high resolution data based orn the results of the
prior instrument evaluation spectrum. This program uses Paramete cs
sunnplied by the »perator prior fo running the Sample, Calibtationr of

the nass/tiae curve is effected by napping each spectrum into fhe
calibration model developed previously. Seoaration of reteronce
compound peaks (PFK) from urkrown sample peaks is accomplished hy a
pattern recognition algorithm which compares the relationships bet ween
seguences of reference peaks ir the calibration run with the set of
possible corresponding sequences in the sample tua. The candidate
sequence is selected which best approximates calibrated perfottance
within constraints of internally consistent scan model variaticns. Pris
approach minimizes the need for selection criteria such as greatest
negative mass defect for reference peaks, the validity of which cannot
he guaranteed. Excellent performance results from using seyjuernces
containing 19 reference neaks.

Mass calculatior is accomplished with an algorithm based cn a
detailed evaluation of the behavior of the mass/time curve as a function
of mass. Determination of elemental compositiors proceeds utilizing 4
new, cCapid and efficient algorithm developed by Prof. Lederberg. This
program has made a previously onerous task (much human intervention)
into an automatic one. this is an imvortant step towara fully automatic
Jata acquisition and reduction.

Yacdware.

The gas chromatograph has been successtully interfaced to the mass
spectrometer, An oscilloscope has also been incorporated with the
spectrometer to supplement the strip chart recorder, to Sirplify initial
alqustment of the instrument and to nonitor every Spectrum.

New interfaces for 2ass spectrometer operation and contro] have
been developed. They have been designed around the POP-11 computor as
this computer cepreserts our means of real-time interaction with tha
mass spectrometer. The interfaces can handle (through an analeg
multiplexer) several analoy inputs and outputs which reyguire that fhe
computer be relatively near the mass spectrometer. This move has
recently heen accomplished, as the computer used to reside in a separate
building. We now have the capability for the tollowing kinds of
»peration through the new interfaces.

i) Computer selection of digitization rate

ii) Computer selection of data path (interrupt mode ot direct
memory access (DMA))

iii) Direct mamory access tor faster operation in the data
Acquisition mode.

iv) Computer selection of analog input and output channels,

v) Sensing of several analog channels through a multiplexer (e.4.,
ion signal, total ton current).

vi) “Magnet scan control, This control can ba exercised manualiy or
set by the computer. Tt controls both time 9€ scan and thyback tine.
Coupled with selection of scan rate, any desires mass Tanze cap be
scanned at any desired scan rate.

vii) The computer can monitor the mass snectrometer'S mass ratker
output as additional information which will ba used to erfect
calibration.

Another important *%evelopment has been a signal conditioner for the
jon signal which incorporates a hox-type integrator to sum the ion
signal batween A/D converter readings. This modification snounld lessen
-on statistical uncertainties in intensity values ard thus ultiritely
improve peak position determinations in time and mass.

21ans.
yor easy

Rs in Part A, many of the plans are Mentioned Pn the above PRagres
sections. Again, a briet summary would include the following:

T) Continue improvement of the high resolution data acyuisitiou
and reduction programs. Pay particular attention to inereased speed and
tasks which may be carried out ir real-time in the small computer,
Leaving ACME for those tasks reqmiring large compute power,

IT) Develop a data acquisition and reduction system to he used th
initial studies of the GC/MS system. Initially this system Will opecate
at low resolution to avoid sensitivity problems in the time const brimts
imposed by GC operation. The real goal is high resolution operation of
the system as we solve sensitivity problems. Some programming and
axperciments have already been done in this area.

TIT) Explore the GC/*4S system ard its intertace for optimut
sonditions for the urine samples and related mixtures extracted from
other bodily fluids (see Part B-ii, pelow).

I¥) Develop additional hariware to exercise specific cortrel
functions as necessary for on-line mode changes and instrumant
performance optimization.

Vv) Develop better analytical models for the behavior of the mass
spectrometer to yieli more accurate data (masses and intensities).

VI) Pinish study of ion signal treatment aid related digitizataren
rate requictements.

VIL) Develop software comvunication between DENDRAL, ACME and the
PPP=11 so that ACME qenerated (via PENDPAL) requests can be service! at
+*he miss spectrometer ard resultira data returned promptly.

Part @-ii. CHEMICAL CONSTITUENTS OF URINE.

Jrine is known to cortain several hundred organic concounds. The
separation (das chromatoyrapay) and btencs identification (mass
spectrometry) of these components woul? be an extremely ‘liftficult rask.
“o simplity the separation probleu the urine is chemicaliy sbarated
into four fractions as illustrate? in the following diagram,

NRINE (pH = 1, internai standacds added)

i
{
1
} ether extraction
|

{ } ether phase aguecns phase
(free acids) wrt tt rrr ttt rt
K i | I
(carbohydrates) (amino acids) )
c R {
j hydrolysis
|
| |
ether phase aqguecus
(:ydrolysed acids) (@4n1nO acias;
D F

The experimental pcocedure used for workin; with a urine Sis sie is
is follows. To an aliquot (25 ml.) of a 2 hour urine sample dsoatt

4N hydrochloric acid until the pH is 1. Two internal standards,
n-eicosane and ?-amino octanoic acid are then added. Ether extraccis.
isolates the free acids (fraction A) which are then methylated and
analysed hy gas chromatography- Mass spectrometry. Ar aliquot ct the
aqueous phase (2 ml) is concentrated to dryness, reacted with
n-butanol /hydroch toric acid followed by methylene chloride containiay
tritluoroacetic anhydride. This procedure decivatizes any amino acids
(or water soluble amines) which are then subjected to GC/MS analysis

(fraction B).

If desired another 2 ml aliquot of the aqucous phase can ke
jerivatized for the detection of carhohydrates (Fraction C). Cur
experience has been that this fraction generally contains few components
and it can be eliminated without detriment to the overall urine

analysis.

Concentrated hydrochloric acid (1.25 ml) is alded to tie ucine

(12.5 mly after ether extraction and the mixture aydrolysed for & kouLs
ander teflux. Ether extraction affords the hydrolysed acid fraction (1)
which is then methylated and analysed by GC/4S5. A portion of the
aqueous phase (2 ml) from hyirolysis of the urine is concentrated to
dryness and derivatized and analysed for aming acids (Fraction &) a5
jJescribed under step k.

Yoinary outpnt from any individual will vary to some extent with
diet. In order to suppress the probles of dietary variation it was
decided to monitor the urine of premature infants in the Starferl
Nucsery of the Pediatrics Nepartment. These infants are sustained on 4
carefully regulated diet an4 their hospital confinement is usually ot
the order of one month such that their ucinary excretion could he
investigated as a function of time.

Preliminary studies on approximately 20 urine samples from
premature infants provided the experience necessary fOr a selection of
the best operational techniques for chromatoacaphic separation. This
work has been carried ont in the Departoent of Genetics where a snitatle
jas chromatograph afd mass spectometer were available. The ass
spectrometer (Finnigan Quadrupole, model 1915) used to date in this
investigation is interfaced for data acquisition to the ACME cemnuter
system, Ducing the gas chromatography-mass spectrometcic analysis of a
yrine fraction over six hundred mass spectra are cecorded in 45 sinutes.
A data system is mandatory to handle this avalanche of data ante antil
one is functioning on th Vacian-“AT 711 mass spectrometer we anticipate
nsing the quadrupole instrument for the routine analysis of urine.

In the preliminary study of 29 urine Samples from prematare babies
the only abnormal metabolite observed was p-hydroxyphenyl lactic acid
which occurred in three of the samples, This compound's presence
reflects the known abilitv of Some premature infants to fhetaboliv.
p-hydroxyphenyl pyruvic acid to the corresponding Lactic acid. in all
cases we observed tte excretion of p-tydoxypheryl lactic acid te rap to
normal Levels after several days presumably as particular enayTe
functions became operative in the chil.

Following these »teliminary studies a joint program was fermalizes
betyeen the Departments of Genctics and Pediatrics to iftvestiqate Late
metabolic acidosis of the premataire. A copy of the protocal te he uscd

in this investiyjation is attached fo this report.

At this time several urine samples From premature rnPants have wee
investigated but only one child was acidotic when the urine Ssamole ..
collected. This urine sanple was definitey rvonormal and rt apeeat oF
contain large quantities of p-hydroxy mandelic acid and p-hydecyyphenyl
Lactic acid. These abnormal netahbolites ware present ip each ct fares
daily samples of urine submitted to 3C/MS analysis. [It is interesting
that the occurrence of p-hyiroxyphenyl lactic and p-hydroxy wandelic
acids in urine has been associated with abnormally high tyrosine Levels
while in our case tyrosine is presert in normal concentrations.

The investigation of acidotic premature infants, aitiough just
commencing, shows promise that any organic acids causing aciwdo0sis will
be identified by our analytical techniques.

In addition to these clinical aspects lescribed above, work 15
continuing on the computer analysis of the mass spectra generated from
urine specimens. Work has progressed on the construction of library
Lookup routines operatiny on data tapes obtained from Dr. Fgil Jeliun,
Oslo, Norway, a former collaborator in our laboratory.
Part C. EXTENDING THE THEORY OF MASS SPECTROMETRY BY A COMPUTER

Objectives:

Theory formation in science is both an intriguing problem for
artificial intelligence research and a problem area in which scientists
can benefit greatly from any help the computer can give. While the
ill-structured nature of the theory formation problem makes it more a
research task than an application, we hope to provide computer programs
which are of some practical help to the theory~forming scientist.

Mass spectrometry is the task domain for the theory formation
program, called Meta-DENDRAL, as it is for the Heuristic DENDRAL
program. It is a natural choice for us because we have develorfed a
large number of computer programs for manipulating molecular structures
and mass spectra in the course of Heuristic DENDRAL research and because
of the interest in mass spectrometry among collaborative researchers
already associated with the project. This is also a good task area
because it is difficult, but not impossible, for human scientists to
develop fragmentation rules to explain the mass spectrometric behavior
of a class of molecules. Mass spectrometry has not been formalized to
any great degree, and there remain gaps in the theory, but discovering
new explanatory cules and systematizing them is taking place throughout
the conntry, albeit slowly.

We have described the design and partial implementation of the
Meta-DENDRAL program in a paper presented at the 7th Machine
Intelligence Workshop (Edinburgh, Scotland, June, 1972). A copy of that
paper is attached and should be consulted for details. [It will be
published in the proceedings of the conference (Machine Intelligence 7,
B. Meltzer & D. Michie, eds., in press).

Gur objective is to explore the theory formation problem for mass
spectrometry within the context of AT research. As mentioned earlier we
hope to produce intermediate programs which will aid chemists in
formulating new pieces of theory as well. The following subgoals have
guided our researca along one dimension, although we have often been
forced to consider other dimensions of the problem, The discussions of
progress and future work are structured around these subgoals.

{1) Collect a suitable set of known mass spectra together with
representations of the molecular structures from which the spectra were
derived.

(2) Summarize and interpret the data with respect to possible
explanations of the individual data points. This re-representation of
the data is a critical step in extracting explanatory rules, fcr the
data points are, for the first time, associated with possible
mechanistic origins ("causes").

(3) Peruse the summary to make plans for intelligent rule
formation. Any of the possible mechanisms described in the
suamary-interpretation phase could be incorporated in a rule of mass
spectrometry. But planning will allow the rule formation program to
start with explanatory rules which are likely to make good reference
points for the whole rule formation process,

(4) Incorporate the possible mechanisms into general rules (rule
formation). By bringing more and more of the descriptive mechanisms
under cules, the rule formation program explains more and more of the
original data points. This is difficult for many reasons, however. For
instance, the rules must be general enough to avoid writing a new rule
for each data point. Yet there are numerous ways Of generalizing rules,
with few prospective guidelines to focus attention on the elegant
generalizations which explain many data points simply. Various
alternatives for rule formation, which we are exploring, are described
in the progress section.

(5) Evaluate the rules to decide retrospectively whether each
proposed rule is worth keeping or not. If so, it may be further
aodified in light of more data. If not, it will be discarded in favor
of cules which are simpler, explain more data, or are otherwise better
suited for incorporation into the emerging theory.

(6) Codify the rules into a theory. Although a set of
phenomenological rules can predict the mass spectral behavior cf the
class of molecules, further codification is needed to increase the
explanatory power of the rules. This may mean something as "simple" as
collapsing rules or subsuming rules under one another. Or, at a deeper
level, it may mean finding relationships and principles which explain
why the phenomenological rules are good predictors.

(7) Finally, it will be necessary to compare alternative theories
(at whatever level) that come out of the program in order to choose the
best one. Part of this research means experimenting with different
criteria of "best" theory. Although the philosophical literature is
full of suggested criteria, no one has ever tried to make then precise

enough for use in a program.

Progress:
Meta-DENDRAL has progressed in the last year within several of the

problem areas mentioned above. The attached paper (MI 7) describes much
of ouc progress in mapping out a detailed strategy for attacking the
problem. [In addition, we have explored many issues related to
alternative design or implication strategies. The unedited notes of our
frequent group meetings are attached to show the issues discussed and
some of the direction of our experimentation.

{1) Collection of mass spectrometry data was no problem because of
the files kept for the Heuristic DENDRAL program and the availability of
the mass spectrometer. Deciding which set of data to explore, however,
was more difficult.

Wwe had initially hoped to do theory formation for a large heterogeneous
class of molecules in order to test the ability of the program to
separate classes of molecules with dissimilar mass spectrometric
behavior and group the similar classes of molecules. We had initially
started working with the collection of saturated aliphatic
monofunctional compounds and their mass spectra, already collected for
previous Heuristic DENDRAL work. Later it was decided that we could
make a more direct assault on the theory formation problem by choosing a
set of homogeneous compounds whose mass spectrometry was already well
characterized. It was hoped that we could formulate rules which
corresponded closely with the known characterizations after examining
only a small number of compounds and their spectra (tens of corpounds,
not thousands). The class of nolecules chosen was the class of
estrogenic steroids. This was an especially good choice because (a) the
estrogens have beer studied extensively - and thus there are known rules
with which to compare the program's "discovered" rules - and (tf) the
estrogens, partly because of their biological interest, are not well
enough characterized - thus the intermediate results of the program's
analysis of estrogen mass spectra are interesting and immediately useful

to science.

{2) The computer program for data interpretation and summary has
been well developed. While it is never safe to call a program
"finished", this program has reached the stage where we have turned it
over to the chemists who want to look at explanatory mechanisms tor the
mass spectra of many compounds. Ordinarily, this is such a tedious task
that chemists are forced to limit their analysis to a very few
mechanisas of interest. The computer program, on the other hand,
systematically explores the space of possible mechanisms and ccllects
evidence for each.

This program is described in the Machine Intelligence 7 paper, and the
results obtained by cunning it with many estrogen spectra are discussed
in a manuscript to be submitted. Mr. William C. White has been largely
responsible for coding the program in LISP. The program runs in the
overnight LISP system at the Medical School's ACME facility. It is
currently being used by Dr. Steen Hammerum, 4 post-doctoral fellow in
chemistry from the University of Copenhagen, to summarize the
fragmentations found in the spectra of alkaloids.

AS always, we have modified the program many times after it prcduced its
initial results in order to add new items of information to the summary
or to reformat the summary - both aimed at making the program a more
useful tool for chemists instead of just a computer science reseatch
tool. In a sense this is a diversion. But we feel it is important in
interdisciplinary research to satisfy many goals (within the project) to
maintain the high motivation and cooperative spirit which have
characterized this project from the start.

(3) Planning hefore rule formation is necessary because there 15s
so much information in the summary of possible fragmentations found in
the data. It is desirable to collect all the information to avoid
missing unanticipated mechanisms which occur frequently throughout the
compounds in the data. But even the summary of the mechanisms is
voluminous enough to obscure the "obvious" rules just waiting to be
found.

In a planning program currently being implemented by Mr. Steven Reiss,
the computer peruses the summary Looking for mechanisms with "strong
enough" evidence to call them first-order cules of mass spectrcmetry-
Our criteria for strong evidence may well change as we gain more
experience. Por the moment, the program Looks for mechanisms which (a2)
appear in almost all the compounds (80%) and (b) have no viable
alternatives (where viable alternatives are those alternative
explanations which are frequently occurring and cannot be
disambiguated).

The program will be made puch more sophisticated as we gain more
experience with it. Fven the output of this crude program, however, is
useful to humans who first want to see the highly reliable, unambiguous
rules which can be formulated. If there are none, of course, there is
little point in pressing ahead blindly. This is an indication that some
modifications need to be made, for example, splitting up the original
set of compounds into more homogeneous subgroups. On the other hand, if
some likely rules can be found, these will serve as “anchor points" for
disambiguation of other sets of mechanisus and also serve as a "core" of
rnles to be extended and modified in the course of detailed cule

formation.

(4) The process of cule formation is the most difficult to define
precisely. we have explored various stratejyies which are described
briefly below and discussed in the attached notes of meetings. Although
we have in hand programs which formulate rules fros the summary data, we
are not completely satisfied with any of them. Thus, much work remains
to be done on rule formation.

The following outline, written by Dr. Sridharan and taken from our
internal working notes, encapsulates the dimensions of the rule
formation problem we have considered and some of our explorations within
those dimensions. Not all of the items presented there have been
explored by writing computer programs, although we intend to do much of
this in the future. Part I of this encapsulation presents two ways of
characterizing theories. The formal representation mentioned in I-A was
developed in the Machine Intelligence 7 paper. The less formal
characterization of I-B is the subject of much of the philosophy of
science Literature which we are researching.

Rule Formation Work in Meta-DENDRAL
I. Theory Representation and Formalization of Theory Format ion Task
A. Formal Representation

i) Kinds of theory classes
Action based, Partial, 0-1 theories

ii) Set theoretic framework and theory definition using
Generalized Cover Theory

iii) Definition of spaces: of theories, of rules, of

situations, of actions

B. Characterization of Theories

i) How much prior chemistry assumed.

ii) How much ms theory assuned/Consistency
iii) Internal consistency

iv) Simplicity/complexity

v) Testability/falsifiability

vi) Performance with respect to data, predictive performance
vii) Predictive scope, Generality
viii) Explanatory power

ix) Projectability

x) Degree of instantiation

xi) Ambiguity

xii) Efficiency

II. Exploration of Methodology and Paradigms
A. Model Building
i) Statistical analyses
ii) Discrete, charge localized model
iii) Pluid flow class of models
iv) Quantum Mechanical model

B. Deriving S-A Rules
i) Derive S-A rules from model and data
ii) Derive S-A rules from summarization of data
a) Constructive method
Generalization, Specialization, Validation,
Fvaluation and Codification
b) Generative method
Generation, Validation and Heuristic guidance

III. Confrontation with the Realities of Data
A. Latge volumes of data
B. Richness or high information density in data
c. Ambiguity
D. Limitation to the significance of data
a) Recording resolutions
b) Reproducibility linits

E. Need to watch for errors and mistakes in data, besides
the need to manage data in the presence of such eLctors

Part IL of the outline of Meta-DFNDRAL work points to numerous
places in the discussion notes concerning questions of the level of
theory to be built and the progran strategies to be used, We have
concentrated on level Tl-A-ii - a more or less descriptive mcdel ot
mass spectrometry written in terms of discrete atoms, bonds, and
electronic charge. The vrograms already written, with one exception,
use this model. The exception is the statistical programming work by
Professor Ed Blaisdell, 4 visitor to Stanford last summer from the
chemistry department of Juniata College (Huntingdon, Pennsylvania). The
programs he developed attempted to derive a regression nodel from
statistical analysis of the data in order to predict the strength of
processes as a function of properties of the molecule. Items jii and iv
of TI-A are models of mass spectrometry which computer programs could
conceivably work in. But our discussions, as yet, have not led to
actual programs which will allow us to try out our ideas with some

precision.

The strategies mentioned in Pact II-B all fit within Artificial
Intelligence paradigms, but so far we have little guidance on how to
choose a good strategy. part II-b-i refers to a Gelernter-like strategy
of problem solving in which, in our case, a rough model of mass
spectrometry in the program serves as 4 reference for checking the
plausibility of proposed additions to the theory being built, say by
statistical analysis. The so-called constructive model (1 I-B-ii-a) of
the rule formation process is the one the programs have been working
with mostly. It is the one described at the beginning of this section
as the method we are following. While this is true, we do not wish to
oxclude the other methods from consideration until some detailed
experiments have been performed. The generative method (LI-B-ii-b) is
the closest to the well-known heuristic search paradigm of Actificial
Intelligence programs. MNT. carl Farrell is pursuing this approach in
his Ph.D. dissertation {directed by E. A. Feigenbaum and B- G-
Buchanan). Outlines of his dissertation and computational procedure are
attached to this report for reference.

rhe last section of the outline (III) covers a large part of the
discussions in our meetings this year, Because we are working with
real, and not ideal, expecimental data, our cule formation protlen is
much more complex than, Say, grammatical inference problems as
currentlly formulated. working in an idealized task domain could remove
these difficulties, but we feel we would thereby lose much of the
fascinating complexity of this problem.

(5-7) Many discussions have taken place on the topics of rule
avaluation, codification of rules into theories, and theory evaluation.
However, we have considered it premature at this point to begin writing
computer programs for thse tasks until the rule formation problem itself

was on firmer ground.

Plans? ,
Our plans for the coming year are to focus on specific gafs and
problems in the design and implementation of the theory formation
research now in progress. In particular, we will continue working with
the mass spectra of estrogens, concentrating especially on the rule
formation subtask described above.
We expect the programs to contribute to the formulation of new
theory by humans for specific classes of molecules, At the same time,

we expect to capture in the program more of the judgmental elements of
rule formation.