RR-00612 Annual Report See. 5

5.2 Remote Users of SUMEX

 

Due to the fact that the SUMEX computer is available via
both the TYMNET and ARPANET communication networks, it is
possible for seientists in many parts of the world to directly
access the Dendral programs on SUMEX. Primary usage is centered
on CONGEN, although INTSUM is beginning also to gain a following.
Although access points to SUMEX are widespread, they frequently
are not diverse enough to accommodate the dispersed group of
scientists who have expressed an interest in. using one of the
Dendral programs. For example, Dr. Joseph Baker’ of the Roche
Institute of Marine Pharmacology in Dee Why, Australia, is
looking at the possibility of accessing SUMEX by using
International Direct Distance Dialing (IDDD).

5«3 Chemists Communicating by Mail

Many Scientists interested in using DENDRAL programs in
their own work are not located near a network access point.
Users of this type choose to use the mail to send details of
their structure elucidation problem to a Dendral Project
collaborator at Stanford.

54 Chemical Problems Posed to CONGEN
Following is a list of CONGEN users, and a brief summary of
their program interests during the past year.

1s Dr. Roger Hahn, Syracuse University. While at
Stanford he used CONGEN to help solve the structures
of photoproducts by obtaining all possibilities under
available constraints and designing NMR experiments
to differentiate the possibilities. This work will be
published soon.

2- Drs William Epstein, University of Utah. During a
demonstration of CONGEN, he posed a problem to verify
that the structural possibilities he determined for
an unknown were in fact all possibilities. The
structure of methyl santolinate has been published
(see Epstein, et als, J-C.S. Cheme Commun., 590
(1975)).

3. Dr. Clair Cheer, University of Rhode Island. While
on sabbatical at Stanford, Dr. Cheer has worked on a
number of structure elucidation problems using CONGEN
including Briareine D and [+]-Palustrol (Cheer et
al., Tetrahedron Letters, 1807 (1976)). Work is

26
RR-00612 Annual Report Sec.

Te

10.

continuing on the structure of another marine natural
product, presumably a cembrenolide, for which there
are currently seven possibilities.

Dr. Jerrold Karliner, Ciba-Geigy Corporation. Dr«
Karliner has solved several structural problems using
CONGEN, including materi. with flame retardant
properties, an impurity in a production sample and
nitrogen heterocycles being investigated for
pharmacological activity. CONGEN enabled reduction
of the number of possibilities to the point where
subsequent experiments led to unambiguous structural
assignment.

Dr. Gino Marco, Ciba-Geigy Corporation. He has used
CONGEN to help solve structures of conjugates of
pesticides with sugars and amino acids.

Dr. Milton Levenberg, Abbott Laboratories. He has
worked on the structure of a compound with mild
antibiotic activity, isolated from a fermentation
broth. There are currently ten structural
possibilities, reduced to that number from the 33
initially determined using CONGEN by additional
experimental data.

Dr. David Pensak, DuPont. He is currently learning
to use CONGEN and plans to evaluate its utility for
structural problems of some of his coworkers.

. Drs Douglas Dorman, Eli-Lilly. He is using CONGEN to

assist in structure elucidation of metabolites of
microorganisms shown to have pharmacological
activity. He has worked on five such problems,
including a current one where the developing MSPRUNE
capabilities are being used.

Dr. Ls. Minale, Napoli, Italy. We have worked with
him by sending him structural alternatives for
proposed structures for some marine natural products
(Pallescensins, Tetrahedron Letters, 1417 (1975)) and
cyclic diethers from the lipid fraction of a
thermophilic bacterium (Js Cs Se Chem» Commun., 543
(1974)).

Dr. K. Nakanishi, Columbia University. We have
worked with him by sending him structural
possibilities for termite defense compounds
(structure finally solved by X-ray crystallography).
This trial plus a live demonstration to one of his
students has resulted in efforts toward continued
collaboration on other insect defense secretions and

27

5
RR-00612

11s

12.

13.

He

156

16+

17s

18.

Annual Report sec.

exploration of the possibility of his direct access
to SUMEX.

Dr. L. Dunham, Zoecon Corporation. We have
collaborated with him on the use of INTSUM for mass
spectral fragmentation studies of insect juvenile
hormones.

Dr. Ae Gs Gonzales, Tenerife, Spain. We have
recently sent him structural alternatives for
constituents of Laurencia Perforata (Tetrahedron
Letters, 2499 (1975)), and expect to continue
discussions on the structures of these compounds.

Dr. T. Irie, Sapporo Japan. We have recently sent
him structural alternatives to published structures
on constituents of Laurencia Glandulifera
(Tetrahedron Letters, 821 (1974)) and expect to
continue discussions on this problem.

Dr. C. J. Persoons, Delft. We have corresponded
with him on structural alternatives for cockroach sex
pheremones (Periplanone-B (Tetrahedron Letters, 2055
(1976)), and he has agreed to further collaboration
on new problems.

Drs F. Schmitz, University of Oklahoma. We
explored for him structural alternatives for an
unknown diterpenoid hydrocarbon. We obtained 25
possibilities, of which only four obeyed the isoprene
rule,

Dr. Js Baker, Roche Institute of Marine
Pharmacology, Australia. We plan collaboration with
Dr. Baker on the sterol fractions of various marine
organisms and are exploring ways for him to access
CONGEN,

Drs Eas VanTamelen, Stanford University. We have
used the developing reaction features of CONGEN to
explore structural possibilities for both chemical
and biogenetie cyclization products of squalene-oxide
congeners. We have suggested alternatives to
proposed structures and helped to design experiments
to differentiate them.

Dr. Js Cs. Braekman, Brussels. Dr. Braekman visited
Stanford as a part of continuing collaboration in
marine chemistry with Dr. Tursch’s group. While at
Stanford he explored use of CONGEN for use in current
problems in marine natural products, and worked on
the problems of Drs. Irie and Gonzales (see above).

28

5
RR-00612 Annual Report Sees 5

He is currently exploring access to CONGEN from
Brussels, via TYMNET.

Some problems have arisen as a_ result of the Dendral
commitment to working with outside chemist users. The primary
area of difficulty arises from the fact that the Dendral project,
as one of the many projects which use the SUMEX facility, is
allocated a certain portion of system resources« Therefore,
support of an extensive body of outside users means’ that
resources to support these users must be diverted from the
research goals of the project.

In encouraging new users, Dendral must be careful to state
that access to Dendral programs might have to be restricted in
the future if system loading becomes extensive. Understandably
then, some scientists are reluctant to invest time in learning to
use a complicated, although potentially useful program which they
may well only be able to use on a temporary basis. One solution
to this problem is to make the available programs as efficient as
possible, and/or to make it possible to distribute copies of the
program to other sites.

Use of CONGEN by working scientists has turned up one major
area in which additional information to the user was thought to
be necessary. CONGEN users unanimously indicated their desire
for a method what percentage of the whole problem was solved at
any moment, i+e., total number of possible structures is
represented by the number already generated. In a prototype
system we have implemented the Cntri-I and Cntri-S user
information interrupts, to show how far CONGEN has progressed.
If, for example, someone who has generated 357 structures is told
that this indicates that they have generated 1 percent of the
total possible structures, they immediately know that they do not
want to finish generating all the structures. Even if there were
enough space, 40,000 structures would be far more than they would
want to sees

We implemented another user-oriented facility for an
invited paper presented at the 172nd American Chemical Society
meeting, in August of 1976. Special features were added for a
character-oriented, screen-addressable CRT terminals to give
users an informative visual interface to CONGEN, an otherwise
complex The dynamic field of view provided by this type of
terminal was used to advantage to give the chemist-user a
continuous, graphic summary of both the information he has
supplied to the program and the dynamic use of that information
by the program.

29
RR -00612 Annual Report Sec. 5

6 Stereochemistry in CONGEN

We have started the complex task of giving CONGEN the
capability of recognizing stereochemical features of molecules
and using stereochemical information in structure determination.
The ability to recognize stereochemical features would allow, for
example, the generation of all stereoisomers of a given
topological structure with or without constraints. The ability
to use stereochemical information would allow the determination
of constraints on stereoisomer (and topological isomer)
generation caused by, for example, partial knowledge of relative
or absolute stereochemistry of structural fragments, knowledge of
overall molecular chirality (or lack of), absolute and relative
stereochemistry from circular dichroism measurements, and so
forth. Thus far, only the topological information (constitution)
has been recognized and used by CONGEN.

The first stage of this development is to produce a program
which generates all the stereoisomers of a given topological
structures This program will be placed at the end of the
existing CONGEN program. The present report describes the
development of the theory and algorithm for stereoisomer
generation and the progress on the programming of this algorithm.

651 Algorithm

The carbon stereoisomers of a given topological structure
are in correspondence with the double cosets:

TSG(A4] / TSG[S4] / CSG
in which:

1s TSG[A4] is the wreath product of the Topological
Symmetry Group and the aiternating group Ah, This group
expresses the invariance of a carbon stereoisomer to all even
permutations of the ligands connected to any carbon stereocenter.

2. TSG[S4] is the wreath product of TSG and the symmetric
group S4. This group expresses the invariance of the
connectivity of a topological structure to all permutations of
ligands connected to any carbon center.

36 CSG is the Configurational Symmetry Group and is
isomorphic to the TSG represented on the two-valued
configurations of the carbon stereocenters+

m

The cosets of TSG[A4] in TSG{S4] correspond to the 2
maximum possible stereoisomers where m is the number of carbon

30
RR-00612 Annual Report Sec. 6

stereocenters. The effect of the group CSG on these cosets is to
collect the possible stereoisomers into equivalence classes of
distinct stereoisomers Intuitively this corresponds to the mental
process of considering all possible stereoisomers of a
topological structure and collecting those equivalent by
symmetry.

The algorithm to generate stereoisomers from a CONGEN
topological structure must perform three transformations:

1. The connection table (CT) corresponding to the CONGEN
topological structure must be modified to include only those
carbon centers which need be considered as stereocenters. That
is, methylenes, methyls, carbons with gem-dimethyls ete., do not
exhibit configurational stereochemistry< A prefilter must act on
the CT and return a Stereocenter Connection Table (SCT) «

2. The TSG which comes from CONGEN must be modified to
give the CSG described above.

3. Given the SCT and CSG, the possible distinct
stereoisomers must be generated. This involves an implementation
of the theory presented in the previous paragraphs. Further
details of this algorithm are given in the next section.

6.2 Programming Progress

All programming is being done in the SAIL language.

1; The development of a program to perform the prefilter
funetion on the connection table is currently in progress. The
CT will first be scanned to eliminate methylenes and methyls and
then iteratively scanned to find identical achiral substituents
on common carbons (gem-dimethyl, gem-diethyl, etc.).

2. A program to obtain the configurational symmetry group
(CSG) from the topological symmetry group (TSG) has been written.
The elements of TSG are allowed to act on the connection table
and the parity of the permutations on each stereocenter is
determined. The permutation with these parity designations is
the desired element of CSG;

34 A program which, when given the SCT and CSG, will
generate all distinct stereoisomers has been written. Special
use is made of the fact that all elements of the CSG will be
hyperoctahedral group elements. That is, CSG will be a subgroup
of the wreath product Sn[S2], called the hyperoctahedral group,
where n is the number of stereocenters. The order 2 group, 82,
is represented by the two-valued configuration of each carbon
stereocenter. This two-valued nature of each stereocenter’s

31
a}

RR-00612 Annual Report Sec. 6

configuration is easily represented by a single two-valued bit
which makes a very compact machine representation. The program
has the capability of representing the hyperoctahedral group by
bit permutation and reversals. This will accommodate any
conceivable symmetry and any stereochemistry resulting from
carbon (or analogous element) configurations.

As an example, consider the problem of the number of
stereoisomers of inositol, (CH(OH))6. The CSG can be obtained
from the TSG as described above and when input with the
stereochemical connection table to this segment of the program,
the desired 9 isomers are found and output as canonical
structures based on the original atom numbering. (This will
probably not be the final choice for a canonical
stereostructure.) The interfacing of these segments of the
stereoisomer generator and the interfacing with the existing
CONGEN program is also in progress.

7 The GC/HRMS DATA SYSTEM

7.1 Improvements to the Data System

The introduction of the gas chromatograph (GC) into the
high resolution mass spectrometry (HRMS) system produced a number
of problems in data reduction that are not present without the
GC. The primary problem is the increase in the number of mass
peaks in a spectrum from the column bleed of the GC. This makes
the problem of separating calibration and reference peaks from
the true sample peaks a much more difficult problem. A number of
improvements have been introduced to the software to solve this
problem.

The instrument is calibrated by injecting a sample of
perfluorokerosene (PFK) and running REFRUN. This collects a
spectrum which can be calibrated by looking for various
characteristic peaks in the spectrum. The masses of certain
peaks are stored on a file. Once these calibration peaks have
been identified, the masses can be used to interpolate and find
the mass of all other peaks in the spectrum. The results of a
satisfactory reference run is stored on a file, as well as being
listed on a line printer.

32
RR-00612 Annual Report Sec. 7

The spectrum of the sample is taken by runing SAMRUN, which
collects a spectrum of the sample and PFK. The main problem now
is finding the peaks from PFK, and using them to calculate the
masses of the peaks from the sample. The first ten calibration
peaks are located by applying a template, or pattern matching
algorithm to the data. This template assumes that
characteristics of the mass spec will change only systematically
with time. This has proven to be a very successful and sensitive
method of locating calibration peaks. Once the initial ten peaks
are located, the program seans the data by taking four
calibration peaks and, using a model of the scan, projecting for
a fifth. Once this is located the masses of the peaks in between
the calibration peaks are interpolated, and a decision is made on
whether a given peak was in the reference run, or is truly a
sample peak. The four calibration masses are shifted so that the
calibration peak just projected becomes one of the four, and the
process is repeated until masses have been assigned to all of the
peaks.

Problems occur when, during projection, either no peak or
more than one peak is found as a calibration peaks If no peaks
are found, the mass is counted as missing, and the next
calibration mass is searched for. Since the calibration peaks
are chosen as being among the most prominent peaks in the
spectrum, the problem in this case is usually not that the peak
is absent. The more common problem is that there are so many data
peaks from the GC that more than one peak shows up as a candidate
for the calibration peak. If the program chooses the wrong peak
as the calibration peak, the crawl through the data quickly goes
bad. Various schemes have been tried to minimize this problem~+
Originaly the first peak in the window was chosen, since PFK has
a very large negative mass defect. This produced occasional
problems, however. Next a more sophisticated approach was tried.
If the projection produced multiple candidates for the
calibration peak, the two peaks closest to the projection were
selected, and another projection was done from each of those.»
The one giving rise to the least total projection error was
selected. We found one batch of data, however where it happened
that at one section of the spectrum, two incorrect peaks produced
a total error less than the two correct peaks. Neither of these
algorithms use any information from the reference run, so an
attempt was made to fold in the information from the reference
run in the case of an ambiguity,

The spectrum is taken in an exponential downscan, i.e. high
mass to low mass on an exponential curve. The only two
parameters of this curve that can change are the time offset of
the curve and the time constant of the exponential. The template
mentioned earlier assumes that either of these parameters can
change and attempts to find a set of peaks in the sample run that
map most accurately into the reference run. This mechanism works
well only in low masses, however, since in the higher masses the

33
RR~-00612 Annual Report Secs 7

curve is more gaussian than exponential. The template can be
written for this, but the amount of space and time required for
it made it appear impractical for the systems On examination of
data it became obvious that the time constant of the exponential
changed very little, if at all, from the reference run to the
sample run. This means that a very good approximation of where a
given calibration peak should appear can be obtained by merely
adding inthe time shift from the reference run. The final
algorithm that resulted goes through the following steps: 1) the
fifth calibration peak is projected. 2) if there is more than
one candidate, then a projection is done on the two closest
calibration peaks. 3) if both of these peaks project to another
peak (there is still ambiguity, in other words), the peak which
is closest to the time in the reference run based an
exponentially weighted time shift from the previous calibration
peaks is chosen, This has proven to be fast and reliable on the
data tested so far, including data that had produced incorrect
results from the previous two methods.

Ts2 Changes in the Operating System

The current operating system for the PDP-11, DOS 9, has
produced a number of problems. Poor keyboard interaction,
generally slow response time, and extremely slow system programs,
while surmountable, are factors that make the system difficult to
use. We decided to look at the feasibility of changing to either
RSX-11M or RT-1t. RSX-11 proved to be too big and much more
flexible than needed. RT-11 however had several advantages over
DOS 9. The keyboard interaction is easier to use and more suited
to a real time environments The IO queuing structure is much
simpler and faster, although the file structure is not as
flexible as DOS 9. In addition the system itself is much faster,
and there is a noticable improvement in time of just loading
programs. Based on this, a decision was made to switch from DOS
to RT-114

The conversion of programs from DOS to RT-11 has proven to
be much more work than originaly expected. The main problems
have been incompatibilities between the two versions of FORTRAN
and the different linkage editors. Since all the programs in the
high resolution system are overlayed, this second factor has
proven to a major problem, since some of the logic in the program
must be reworked to make the overlay correct. The conversion
effort has been aided by several factors, however. The speed of
the system and system programs is often several times faster than
similar programs under DOS. As an example, to link the REFRUN
portion of the High Resolution system takes about 30 minutes
under DOS, whereas the same program takes about 5 minutes under
RT-114 The FORTRAN compiler and MACRO assembler are also faster.

34
RR-00612 Annual Report Sec. 7

The conversion, and software development in general, has
been greatly improved by the addition of a teletype line from the
SUMEX PDP-10 to the PDP-11. Programs have been writen to transfer
files between the two systems. This has had the effect of
switching literaly all of the editing to SUMEX because of the
superior editor. The ease and speed of the file transfer makes
it practical to make even minor modifications of a program on the
the 10, and then transfer the edited version to the PDP-11. This
process of using SUMEX to evelop software will continue with the
release of MAINSAIL, a machine independent languages MAINSAIL is
a dialect of SAIL, which is a dialect of ALGOL 60, It has
undergone many design changes since its original inception, but
has been released in a limited version for the PDP-10. The main
value of MAINSAIL is that programs written on one machine will be
directly transportable to another with no modification. This
allows us to write, test and debug software systems on SUMEX,
which leaves only the the machine dependent portion of the system
(for example the actual real time data acquisition) to be worked
out on the PDP-11. This not only gives the programmer better
tools (such as superior editors) but also frees up the PDP-11 for
production work.

7.3 New Developments

In addition to upgrading old versions of the high
resolution system, work is being done on creating a low
resolution system for the MAT 711+ The ultimate aim is collect
data that can be run through CLEANUP, a program that resolves
multiple spectra under a single GC peak, and cleans up the final
spectra. The problem with the current system is that we cannot
sean fast enough to provide CLEANUP the data it needs. The high
resolution system requires resolution good enough to separate
sample peaks from the reference peaks. If the scan is sped up
past a certain point, SAMRUN can no longer separate the peaks,
and therefore cannot calibrate the run, At the same time,
CLEANUP requires at least 7 spectra across a GC peak be taken to
insure resolution of multiple spectra. The fundamental problem
then is that an alternate method of calibrating the mass
spectrum, without using known calibration peaks, must be found
before scan speeds required by CLEANUP can be achieved. The most
direct solution to this is to directly measure the magnetic field
strength of the instrument, and using it to calculate the mass
that is being observed. To do this we inserted a hall probe
between the poles of the magnet, and connected it to the data
acquisition system on the PDP-11/20.

The main problems with the hall probe are as follows: 1) to
make sure that the ion reading and the hall probe reading are
simultaneous 2) to insure that the correct hall reading can be
assigned to the correct ion reading 3) to determine the

35
RR-00612 Annual Report Sec. 7

reproducibility of hall readings versus mass being observed in
both dynamic (scanning) and static situations and 4) to decide if
the probe has the speed and accuracy to calibrate the instrument.
The first two problems are a matter of hardware. The
configuration of the original data collection system is as
follows: the ion detector goes to an A/D converter, which is
connected to a DMA+ The DMA is on an 11/20, which has a data
collection system, SAQMON, running. This performs various low
level filtering and buffering operations. The DMA is actually a
low level processor which counts the number of samples taken,
stores them into successive memory locations, and interrupts the
eentral processor when a block of data has been collected. The
timing of the sample collection is controled by a quartz crystal
clock. On each timing pulse, a signal is sent to the A/D on the
jon detector to convert that value toa digital number. To
accommodate the hall probe, the DMA was modified so that on the
timing pulse, the start signal is sent simultaneously to both the
A/D on the ion detector and the A/D on the hall probe. The DMA
then services both of the A/D’s, and stores the readings in
successive memory locationse The net result is that when the DMA
interrupts the central processor, the block of data is a set of
pairs of readings, an ion reading and the hall reading for that
time. This solves both of the first two problems, since we now
have the ion reading and the hall reading connected both in time
and location.

The second two problems, testing the reliability and
reproducibility of the hall probe, requires new software. We are
currently modifying portions of the calibration mechanism of the
high resolution system to calculate masses for a large number of
hall readings.

8 META DENDRAL

The success of any reasoning program is strongly dependent .
on the amount of domain-specific knowledge it contains. This is
now almost universally accepted within AI, partly because of
DENDRAL’s success, Because of the difficulty of extracting
specific knowledge from experts to put into the program, many
years ago we began to explore the problems of efficiently
transferring knowledge into a program. We have looked at two
alternatives to “hand-crafting" each new knowledge base:
interactive knowledge transfer programs and automatic theory

36
RRB-006 12 Annual Report Secr 8

formation programs. In this enterprise the separation of domain-
specific knowledge from the computer programs themselves has been
a critical component of our success,

One of the stumbling blocks with the interactive knowledge
transfer programs is that for some domains there are no experts
with enough specific knowledge to make a high performance problem
sulving program. We were looking for ways to avoid forcing an
expert to focus on original data in order to codify the rules
explaining those data because that is such a time-consuming
process. Therefore we began working on an automatic rule
formation program (called Meta-DENDRAL) that examines the
original data itself in order to discover the inference rules for
that part of the domain:

The problem solving paradigm for Meta-DENDRAL is also the
plan-generate-test paradigm used in Heuristic DENDRAL.« In this
case one part of the program (RULEGEN) generates plausible rules
within syntactic and semantic constraints and within desired
jimits of evidential support. The model used to guide the
generation of rules is particularly important since the space of
rules is enormous. The planning part of the program (INTSUM)
collects and summarizes the evidential support. The testing part
( RULEMOD ) looks for counterexamples to rules and makes
modifications to the rules in order to increase their generality
and simplicity and to decrease the total number of rules+«

Meta-DENDRAL successfully formulated rules of mass
spectrometry that were new to the science. These rules, along
with a discussion of the methodology, were published in the
scientific literature [Report HPP-76-4]. The program was tested
to see if it could rediscover the rules of mass spectrometry for
two classes of chemical compounds that were already well
understood (amines and estrogenic steroids). Then it was applied
to three classes of compounds whose mass spectrometry was not as
well known (mono-, di-, and tri-ketoandrostanes). The program
produced three sets of rules that explained much of the
significant data for these classes. The time for manual rule
formation for these data was estimated to be several months.

Progress was made on generalizing the Meta-DENDRAL program,
and rules for a new domain were successfully discovered by the
program. A scientific paper on this application was submitted
for publication [Report HPP-77-4]- The new application was
learning rules for interpreting signals from C13-NMR
spectroscopy. The instrument produces data points in a bar graph
in response to the resonance of each carbon-13 nucleus in the
sample. The rules describe an environment of a C13 atom and
predict a resonating frequency range for every atom that matches
the description. The Meta-DENDRAL program needed some
modification because the rules are predicting ranges of data
puints, and not precise processes, as for the mass spectrometry
version.

37
RR-00612 Annual Report Sec. 8

The RULEGEN component of Meta-DENDRAL was demonstrated to
work with its heuristic search paradigm. Guidance from a model
of mass spectrometry is an important feature of RULEGEN. Also,
the program uses problem data for pruning possible rules (and all
more specific rules formed from those). The amount of data
examined during the search is very large and the space of rules
is immense, so the search needs to be rather coarse in order to
produce plausible, but not necessarily optimal, rules.

The RULEMOD program for "fine-tuning" Meta-DENDRAL’s newly-
discovered rules was finished. This program provides a number of
important subtasks, including merging similar rules, making rules
more specific or more general, and filtering out the weakest
rules. RULEMOD checks for counterexamples to rules and uses this
information in all of the named tasks. Because of the expense of
computing counterexamples to possible rules, this computation is
delayed until Meta-DENDRAL has a set of plausible rules, rather
than computing counterexamples on each possible rule examined in
the search of the rule space.

A report was written on the AI methodology underlying Meta-
DENDRAL The major idea developed in this report is that knowledge
of the domain can be used effectively to guide a learning
program. The major difference between Meta-DENDRAL and
statistical learning programs is that Meta-DENDRAL uses a strong
model of mass spectrometry, including any assumptions the user
cares to make about the domain, to guide the formation of
explanatory rules.

9 C13 NMR SPECTROMETRY

13C NMR was selected as a new application area for the rule
formation program, Meta-DENDRAL. The algorithms used for mass
spectrometry rule formation were extended to 13C NMR and used to
obtain a set of rules for These two classes and acyclic amines.
These two classes were chosen since compounds in these classes
are known to show a_ strong correlation between structural
environment and shift. Thus, the programs could be tested
knowing that the underlying basis for the form of the rule was
valid.

The form of the rule is
substructure ---> shift range.

38
RR-006 12 Annual Report See. 9

A sample rule generated is
C-C¥#-C-X- ---> 19.85<= (delta sub C)<=21.3.

The asterisk in the substructure description denotes the
atom for which the shift is predicted. Only topological
descriptors were used to construct the substructures. The
addition of stereochemical terms is a topic of current work.

It was necessary to change RULEGEN so that the left-hand
sides of rules were expanded outward from a carbon atom rather
than from a bond. The right-hand side of the rule is associated
with a range rather than a precise mass as in the mass
spectrometry program. This modification also required changes in
the rule search procedure. The user sets two parameters which
guide the rule search. These parameters are MINIMUM-EXAMPLES
which requires each rule to explain a given number of peaks in
the training set and MAXIMUM-RANGE which defines the acceptable
shift range for a rule. These parameters regulate the degree of
specificity or generality of the rules.

From the set of rules generated a subset is selected
corresponding to the "best" set which still covers all the
training set data. The best rule is selected by calculating

(number of peaks predicted/(range ** 2)).

Data which are predicted by the best rule are removed and
the next best rule is found for the remaining data using the
criterion given above. This process is repeated until all data
are explained,

In order to test the informational content of the rules
generated a second program was written which applied the rules to
a list of candidate molecules and ranked the molecules. Firsts,
all possible structural isomers for a given empirical formula
were generated using CONGEN. The rules were applied to each of
the possible isomers and spectra were predicted. The predicted
spectra were compared to that of a known spectrum from a compound
with the same empirical formula. The structural isomers were
ranked according a comparison score to determine how well the
correct compound was distinguished from its isomers, on the basis
of the predictive rules.

The details of the generation of rules and the use of rules
for structure selection can be found in a paper recently
submitted for publication [Report HPP-77-4]

The 13C NMR rule formation program was applied to a_ set of
paraffins and acyclic amines. The program generated 138 rules to
cover 435 data peaks. The rules generated were applied ina
structure selection test for the structural isomers of C9H20 and
C6HI5N. No structures with these empirical formulas were

39
RR-00612 Annual Report Sec. 9

included in the training set. Twenty-four C9H20 and eleven
C6H1I5N 13C NMR spectra were available to act as unknowns in the

structure selection test. The results of the structure ranking
applied to these spectra are shown below.
EMPIRICAL NUMBER OF NUMBER OF CANDIDATES
FORMULA CANDIDATE ISOMERS RANKING

ist and..s..6th......9th
C9H20 35 20/24 3/24 1/24
C6H15N 39 8/11 2/11 1/11

The performance of the rules in discriminating among
similar structures not included in the training set data
demonstrated the content of the rules.

10 BUDGET

Budget Information relevant to future funding was submitted
with the renewal proposal to the BRP.

4o
RR-00612

11
PROJECT
(Only pub

HPP-76-1

HPP~76-2

HPP-76-3

HPP-76-4

HPP~76-5

HPP-76-6

HPP-76-10

Annual Report Sec. 10

RECENT PUBLICATIONS OF THE HEURISTIC PROGRAMMING

lications related to computers in chemistry are shown. )

D.H. Smith, JsPs Konopelski and C. Djerassi,
"Applications of Artificial Intelligence for. Chemical
Inference. XIX, Computer Generation of Ion
Structures", Organic Mass Spectrometry, 11: 86, (1976).

Raymond E. Carhart and Dennis Hs Smith, "Applications of
Artificial Intelligence for Chemical Inference XX.
Intelligent Use of Constraints in Computer-Assisted
Structure Elucidation", Computers In Chemistry (in
press).

C.J. Cheer, DsHs Smith, Cs Djerassi B. Tursch, J.C.
Braekman and D.« Daloze, “Applications of Artificial
Intelligence for Chemical Inference XXI. Chemical
Studies of Marine Interbrates - XVII. The Computer-
Assisted Identification of [+]-Palustrol in the Marine
Organism Cespitularia SP., aff. subviridis".
Tetrahedron. 32:1807, Pergamon Press, (1976)+

BeGs Buchanan, D.sHe Smith, WeC.s White, R.ds Gritter,
E.A. Feigenbaum, Je Lederberg, and Carl Djerassi,
"Application of Artificial Intelligence for Chemical
Inference XXII. Automatic Rule Formation in Mass
Spectrometry by Means of the Meta-DENDRAL Program",
Journal of the American Chemical Society, 98: 6168
(1976).

T.H. Varkony, R.E. Carhart and D.He Smith, "Applications
of Artificial Intelligence for Chemical Inference
XXIII. Computer-Assisted Structure Elucidation.
Modelling Chemical Reaction Sequences Used in Molecular
Structure Problems", in "Computer-Assisted Organic
Synthesis", W.T. Wipke, Ed., American Chemical Society,
Washington, D.C., in press.

D.Hs Smith and R.Ee Carhart "Applications of Artificial
Intelligence for Chemical Inference XXIV. Structural
Isomerism of Mono and Sesquiterpenoid Skeletons 1,2-",
Tetrahedron, 32:2513, Pergamon Press (May 1976).

Bruce G. Buchanan and Dennis Smith, "Computer Assisted
Chemical Reasoning", in Proceedings of the IIf
International Conference on Computers in Chemical
Research, Education and Technology", Plenum Publishing,
(1976).

44
RR-00612

HPP-77-4

HPP-77-6

HPP-77-11

Annual Report Sec. 11

T.M. Mitchell and G.M. Schwenzer, "Applications of
Artificial Intelligence for Chemical Inference. XXV. A
Computer Program For Automated Empirical 13C NMR Rule
Formation", (Submitted to JACS, January 1977).

STAN-CS-77-597 Bruce G. Buchanan and Tom Mitchell.

"Model-Directed Learning of Production Rules",
Submitted to the Proceedings for the Workshop on
Pattern-Directed Inference Systems in Hawaii,

(February, 1977).

Dennis H; Smith and Raymond E. Carhart, "Structure
Elucidation Based on Computer Analysis of High and Low
Resolution Mass Spectral Data". Proceedings of the
Symposium on Chemical Applications of High Performance
Spectrometry. University of Nebraska, Lincoln, (in
press).

K2