* has

tes

the

Interim Report to the: © , NASA STAR No. N66~14075

National Aeronautics and Space Administration
Grant NsG 81-60 - NASA CR-68899

: Instrumentation Research Laboratory

Technical Report No. 1040

 

SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY

 

 

AND HAMILTON CIRCUITS

ak.

 

A General Outline of the DENDRAL System

submitted by Joshua Lederberg
Professor of Genetics
School of Medicine
Stanford University
Palo Alto, California _ ?

‘January 12, 1966

Studies related to this report have been supported by research grants from the
National Aeronautics and Space Administration (NsG 81-60), National Science
Foundation (NSF G-6411), and National Institutes of Health (NB-04270, AI-5160,
and FR-00151). .

i
FOREWORD. This contribution is intended as an introductory survey of the
topological concepts that underlie the DENDRAL system for chemical structure
notation. The main purpose of the system is to provide a language in which a
computer program can frame hypotheses of organic chemistry. For example, a
program to generate all the isomers of a given formula has already been imple-

mented.

This introduction is especially intended for users who wish only
a general outline of DENDRAL rather than its full details of syntax. Some
notation is necessarily used. This resembles the definitive DENDRAL forms,
but the complete manual should be used as a definitive statement of the lan-

guage.
- “. SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY
| Sys AND HAMILTON CIRCUITS
Joshua Lederberg
Genetics Department
, Stanford University School of Medicine
‘. Palo Alto, California
The structural formula for an organic molecule is a paragon of a topological
graph, that is, the connectivity relations of a set of atoms. True, we
recognize more than one type of connection, doubie, triple, and non-covalent
“bonds, as well as single bonds. However, from an electronic standpoint the special
bonds could just as well be denoted as special atoms. The structural graph .
does not specify the geometry, that is, the bond distances and bond angles of
the molecule. In fact, this is known for only a small proportion of the
enormous number of organic molecules whose structure is very well known from
a topological standpoint. Most of the syllabus of elementary organic chemistry
thus comprises & survey of the topological possibilities for the distinct ways
in which sets of atoms may be connected, subject to the rules of valence. The
student then also learns rules which prohibit some configurations as unstable
or unrealizable (and may later earn his scientifie reputation by justifying
or overturning one of these rules). The field of organic chemistry has,
however, reached its present stature without many benefits from any general
analysis of molecular topology. These benefits might arise in applications
at two extremes of sophistication: the teaching of chemical principles to
college undergraduates, and to electronic computers. They may also apply to
the vexatious problens of nomenclature and systematic methods of information
retrieval.

Although the topological. character of chemical graphs was recognized by
2.-
the first topologists, very little work has been done on the explicit classifi-
cation of the graphs having the most chemical interest. Some difficult.
problems, e.g., the enumeration of polyhedra, remain unsolved. However, the
main obstacle may be the seeming triviality of the problems, many topologists
being quite unsatisfied with systems restricted to 2= or 3-dimensional space.

This article will review some elementary features of graphs that may be
used for a systematic outline of organic chemistry. The same theory has the
broader significance of classifying the possible nets of relationships among
the members of a set of objects. For present purposes, our graphs will be
undirected, that is, any connections are reciprocal and unpolarized. Further-
more, our atoms have a maximum valence of 4. When we come to cyclic structures
we shall have occasion to study an even more restricted set of graphs, those
in which every node has a valence of 3. | 7

A problem statement might be: enumerate all the distinct structural
isomers ‘of a given elementary composition, say C,H,NO,. This is tantamount to oy
producing all the connected graphs that can be constructed from the atoms of
| the formula, linked to one another in all distinct ways, compatible with the
valence established for each element (4, 3, and 2 for C, N, 0, respectively).

For compactness, H can be left implicit, being later restored at every unused.

valence.

’

Our main approach throughout this article is mapping, a rule of correspon-

dence between a part of the chemical structure and a part of some abstract

graph. Thus, each atom may be mapped on to a node: each pond to an edge or

link of the graph. For further analysis, however, it will be important to map

: from complexes ‘of the structure to elements of a graph. The abstract graphs

lend themselves to canonical forms, i.e., @ choice among equivalent representations

ro
ir)
‘

 
3.

- according to precise rule. Since the root problem is generally not that of
producing all possible combinations of atoms, but recognizing which forms are
“unique, this is of utmost importance. Chemistry will re-emerge after a few
levels of abstraction. | |

These principles have been elaborated in a computer-oriented language
-"Denaral-6h" which is described more fully elsewhere for the purpose of
possible implementation in programming systems (Lederberg, 1964).

Trees are l-connected graphs, i.e., can be separated into two parts by
cutting any link, They correspond to the acyclic structures of organic
‘chemistry. How may we establish a canonical form for a tree, after first
noting its order (number of nodes). | |

_

The first step might be to find some unique place to begin the description.

A tree must have at least two terminals, and may have many more if highly

branched; these are therefore not very suitable. However, each tree has a
unique center. In fact Jordan (1869) showed that any tree has two kinds of
center, @ mass-center.and a radius-center. Each center has a unique place in
any tree; the two may or may not coincide. ( |

To find the radius-center, the tree is pruned one level at a time, being
cut back one link from every terminal at each level. This will leave, finally
an ultimate node or node-pair (in effect, edge) as the center; the radius of
the graph is the number of levels of pruning needed to reach the center.

To identify the mass-center of a tree, we must consider the two or more

_ branches that join to each non-terminal node. The center is the node whose

'"  Dranches have the most evenly balanced allocation of the remaining mass (node=

count) of the tree. This is the same as to say that none of the pendant
branches exceed half the total mass. A mass of even number allows the possi=

vility of the center being a node pair or edge which joins equal halves,
4.

Either of the centers (Fig. 1) is unique, and so could solve our problem

of defining a canonical starting point of a description. The center of mass |
s more pertinent to finding a list of isomers, which of course enjoy the

same mess. The radius-center is ill-adapted for this, but matches con-
vantional nomenclature, which is based: on finding the longest linear path,
i.e., a diameter. The diameter is not necessarily unique. For example, urea
nas three diameters, N - C -N andN-C=0 (twice), but just one radius-
canter, the c atom. The problem of generating isomers is the main justifica~
tion for adopting the mass-center ‘over the radius-center to work out canonical
POEMS

In chemical terms, the center divides the graph into two or more radicals.
These radicals can be ordered by obvious compositional principles, giving rise
to @ canonical description of the whole graph in a linear code. Thus arginine
becomes (C~C-N~C(N)-N C-C(N)-C(0)-0} or, in a parenthesis-free notation
with some abbreviations »2.N.C.:NN 2..NC. 200 | . Any linear code
hes an implicit number syaten: each atom is numbered according to when it is
denoted in the string.

Some thirty years ago, Henze and Blair (1931) showed how Jordan's principle
could be used for the enumeration of isomers of saturated hydrocarbons and
some simple derivatives of them. Here, the nodes are all the same (carbon
atoms) and the enumeration can proceed by recursion from smaller to larger
complexes. For example, for the isomers of undecane, C5 4oy> one atom is desig~

nated as center, leaving 10 to be allocated among 2, 3 or 4 branches. Only the

following partitions satisfy the rules (leaving dissymmetry out of account):
co ° boo
-

Ng
De

BRANCHES

| LJ.
2 oF 545
5 of 1 3'3°3
—~E] | 2.4.
. 3,3,4
ae aes
Qo 22s
| a ~15243,8
2,2,2,4
| 2,2,3,3

some
Oe ee,

To complete the solution, one must have calculated the number of alkyl radicals

“Co, C,, etc. To illustrate with Cai

The radical must have an apical atom, leaving the rest to be partitioned
dn ali distinct ways among 1, 2 or 3 pendant branchés, the radicals of the next

level. Thus we have: — ok
<

Ja -
c<—[ J]. 1,1,2
NO

. The count of C, radicals is thus derived from the table for “C,, taking i from
‘
4

lton-«~ 1, and the process may be itcrated as far as needed, i.e., until

partitions into units, Clo» prevail. | No deep mathematical insight is needed to
6.
verify that the first steps of the alkyl series C,, C,, C3, ¢, . have

1,3,2,4 forms respectively.

No closed algebraic expression has been found for this enumeration.
sawever, the recursive expansion was done by hand (Henze and Blair, 1931) with a
fav trivial errors found by a computer check; no organic chemist will be
surprised by the enormous scope of his field. (Table 1).

The total range of acyclic compounds is of course very much larger than
these subsets. ‘At each step, instead of partitioning a mere number of nodes, an
allocation to constituent radicals takes account of the kind as well as number
of unused atoms. However, the specification of a hierarchy of ordering, which
may be done almost arbitrarily to suit computational convenience, permits the
same principles to be applied to a complete enumeration of structural isomers

of a given composition, for example of alanine, C,H/NO.» (fable 2.)

Cyclic Structures
Cyclic graphs are much less tractable, since every path will return back

an
to the complex, and a center is less easily defined. Sufficient reminder of
the taxonomic difficulties posed by rings is the popularity of the Ring Index
1964) wherein the "11524 rings known to chemistry" are laid out, together with a

profusion of synomyous and alternative numbering systems to map them as nodes. _

For example, naphthoyl pyridine would ultimately form a tree, Ry - a Ro » Ry and Roe °
0
We now consider the domain of strictly cyclic structures. These are 2~

connected graphs, since at least 2 (sometimes more) links must be cut in order .

40 separate the graph.
For further analysis, we distinguish the trivalent vertices of the structure

‘
atoms thet join 3 paths, or branch points. We can then construct the full set of
T

abstract, trivalent graphs. Define a path as q link or an unbranched chain of
links and atoms. The paths between vertices of the structure can then be
mapped onto the edges of an abstract graph which is regularly trivalent or
trihedral. To illustrate, observe how pyrene is mapped onto an abstract graph

of 6 vertices, indeed, the abstract prism.

 

 

 

-CCC-

 

 

 

-CCC-

 

 

 

 

 

 

-CC- . oe

 

' Pyrene
(a) (b) (ce) (a)
Some vertices are hevalent, in so-called spiro forms, but these graphs
can be mappedoonto 3-valent graphs by expanding each k-valent node into a pair
7

of 3-valent nodes. That is, a becomes ‘ —\ There is an obvious
"relationship between the number of vertices and the number of rings conventionally

ascribed to a structure. We start with, say, benzene, O vertices, and-1 ring.

Then naphthalene, 2 vertices and 2 rings. Each additional ring entails 2

more vertices. Hence, for r rings andn vertices .

rail * n/2 , |
and for these trivalent graphs, n must be an even integer. Recalling that a he
valent vertex maps into 2 3-valent nodes, we can write |

rsl+tn/2+q

for q 4evalent vertices. This calculation agrees with the Ring Index rule which

i

‘counts rings as the number of cuts needed to convert a ring structure into a

tree.

As each edge joins 2 nodes, a trivalent graph of order n will have -
3n/2 edges.

Enumerating the trivalent graphs. A trivalent graph may have several
8.

representations, and some effort may b> required to relate them to one another,
and to decide which form is to be regacded as a canonical reference for mapping
purposes. Thus, the graphs of Figure 2 are all topologically equivalent or
isomorphic. This is to say, they all represent the same connections of node
to (three) nodes. A meaningful enumeration must unify these isomorphisms. Fure
thermore, it should relate to a convenient code by which to refer to each
graph, better still, to embody a reconstruction. Finally, it should generate
- an obvious numbering of the nodes and edges. .

Hamilton circuits. A practical key to the solution of this problem, as
to many other network problems, taxes advantage of the Hamilton circuits found
in most of the abstract graphs having chemical interest. ‘A Hamilton circuit
(HC) is a round trip through the g-aph that traverses each node just once.
It therefore uses n edges, leaving out n/2 edges. Figure 3 is Hamilton's
own example, the dodecahedron, proposed by him as 4 parlor game, each node
representing a city that the round-the-world traveller would not wish to.
revisit. The utility of HC representations will become evident. )

Finding all HC's of a graph may be a challenging game, but it is reduced
to a merely tedious algorithm on the computer. Start from an arbitrary node.
Trace a path as through a maze, each node presenting a binary choice of aaa
different edges. If the chosen path reverts to a node already visited, pack=
track one step. A successful path has n correct choices. Thus, at most
27scarch steps will exhaust all possible paths; in practice, closer .to 1/n
times this number will be needed to identify all the HC's. Even for n up
to 20 this is a modest task. And if the work has been done once, finding any

HG, at perhaps n-fold less effort, will enable @ given graph to be related to the -
9.
previously established set.
A typical problem in graph manipulation is to establish whether two

complicated graphs are isomorphic. In the long run, this might require
testing all possible permutations of nodes, with a scope of Factorial (n).
_ Aton = 20, this number is an utterly uncomputable 2.4 x 1018 steps. On the
other hand, if two graphs are isomorphic, they must have the same HC's, found
vith at most 279 = 10° steps. |

oA convenient representation of a HC maps the nodes and edges of the circuit
as Vertices and vounding edges of a regular polygon. The remaining n/2 edges
then form chords, each node being one of the two termini of one chord. A
description of the graph then needs only some notation for the n/2 chords.
First, we should canonicate the orientation of the polygon, having chosen
to initialize the HC arbitrarily among n nodes and 2 directions (the rotational
and reflectional symmetries of the polygon). Each node is joined by some
chord having a certain span. The span list can be put in cyclic order, where
it is invariant under rotation; i.e., immaterial which node is selected as
starting point. The effect of reflection is also easily computed. If the
span list is regarded as a number, its minimum value under rotation/reflection
becomes the canonical form. For example, an 8-node graph might be represented (Figure 4
: by any one of the span lists 17522663, 31752266, etc., or the reflections
75226631, etc. Of these, one quickly finds that 17522663 is the lowest-valued,
“hence the canonical form. Similarly, when other HC's are found for the same
graph, they can be compared, and the .lowest-valued of them chosen as the ~ |
reference graph.

The same procedure establishes a canonical ordering of the nodes and
10.

edges. For the latter, we take the HC sequence (the polygon) first, then
each chord in order of first reference.

The spon list has n terms. Only n/2 are necessary, since each chord
ai scscxaed to twice in the span list. For an abbreviated code, simply omit
the second reference. 17522663 becomes 1522. Indeed, one less character still
suffices, the last chord being completely determined by the ones previously
built. he chord list (152), or an alphabetic equivalent (8AEB) whose leading
numeral merely reminds us of the order of the graph, then encodes the graph in
a canonical form (Figure 4). Furthermore, the graph can be reconstructed from

«he code by retracing the steps just recited. Caution: Unlike span lists,

the abbreviated chord lists cannot be freely rotated.

Chord lists can be computed by an obvious combinatorial procedure, with
the help of a few tricks to save some fruitless effort. Most arbitrary lists
become internally inconsistent after a limited number of initial characters; the
number of combinations that must be tested is therefore considerably less than
muy eppear. Additional restrictions can also be put on prospectively. In
this way, exhaustive lists of trivalent graphs have been computed -~ Table 3
{taken from the DENDRAL report) shows their scope. To unify isomorphisms,
the complete list of HC's is computed for each chord list.
Apart from the rotation of the polygon, two or more incongruent HC's may
be present in a Graph. No general principle is known, except that graphs with
high symmetry tend to have the fewest incongruent HC's. Tutte (1946) proved
that any edge of a polyhedron must be involved in an even number (not excluding 0)
of HC's, and that if a polyhedron admits one HC, it must admit at least three. |
Classification of trivalent graphs. Two important, independent criteria

i.

\
il.

of abstract graphs are (1) planarity, and (2) level of connectedness.

A planar graph is one that can be represented on the plane without edges
crossing over one another. The graph need not be drawn as an HC-polygon, which
rarely lacks crossing chords: Figure 3. is certainly planar» Kuratowski has ~
shown that any trivalent non-planar graph must contain 6CC (Figure 5b). .
Fortunately, this condition is easily recognized in the building of span lists.
As the surface of a polyhedron can be mapped onto the plane, planarity is a
necessary condition for an abstract polyhedron.

In practice, nonplanar graphs are so far unknown in organic chemistry
(barring coordination complexes); however, they might in principle be realized,

@.g., by the hypothetical Figure 5d. ‘ ae

Connectedness is the least number of cuts that will anywhere separate: the
graph. The 3-connected planar graphs are the abstract
convex polyhedra. Intuitively, it is obvious that a region bounded only by
2 edges would be unable to enclose a volume. Steinitz (see Lyusternik, 1963)
showed that every 3-connected planar trivalent Graph could be realized as a
polyhadron. These graphs have, naturally, attracted some interest as a
meeting point of topology and classic Greek geometry. Nevertheless, a complete
enumeration is still unknown. In 1901, Brickner published figures of the
trivalent polyhedra for n < 16; in an abstract and unpublished manuscript (1928)
he also showed 1250 for n 218. This work, done by hand over several decades, .

was xepeated on the computer by Grace (1965) who found some errors in Brilckner's
12.
listings, and found 12h9, However, even this census admits some possibility
of being incomplete, though this is remote. Grace generated the polyhedra

by induction as all possible slicings of the faces of smaller polyhedra. This
produces many isomorphisms which must be unified; for this, Grace used a
criterion, “equisurroundedness", which is already known to be too weak, albeit
for much larger graphs. Therefore, it cannot be rigorously shown that the list
or 129 has not excluded additional forms, equisurrounded, but not isomorphic
with the stated set. The analysis of HC's could afford an independent avenue
of corroboration at relatively low cost.

The polyhedra play an important role in the classification of cyclic graphs
but have no remarkable chemical significance except that they represent the
most tightly caged polycyclic structures)’ Note that many unfamiliar iso-
morphisms are generated by portraying a polyhedron as a planar mesh, i.e., as
projected within an arbitrarily chosen face, called the base. The projection
can be visualized as the view of the polyhedron from a point just outside the
place of the face chosen as base (Figure 2). / -

HC-free graphs. These are promptly encountered in the 2-connected series,
starting with ng (8(AC:8,1:A) Figure 6). An analysis of the conditions for no-HC
illuminates some of the combinatorial processes involved in building graphs. --
Since all the graphs for n = 6 have HC's, an HC-free graph is generated by &
particular mode of union of HC's of lower order. The simplest mode is
bilineal, one edge is cut on each of two smaller graphs and reunited. If
either of the edges involved is barred from any HC of its graph, the bilineal
union will be HC-free. This follows, since the union introduced nodes which

must be traversed by a path known to be forbidden.
13.

In general, an HC-free graph can be canonicated by dissecting it into the
- largest cirbuits it contains. The dissertions are first completed across the |
bilineal (2-connecting) unions. If any resulting subgraphs are still HC-free, we
must consider HC-free polyhedra as a mathematical, if not a pragmatic chemical .
possibility. |
HC-free polyhedra. Tait believed that all convex trihedral polyhedra
- contained HC's and his conjecture was indéed unchallenged for over 60 years.
However, Tutte (1946) refuted the conjecture with an example ingeniously proven
to be HC-free, though with 46 vertices it would defy exhaustive search.
Chemical graphs of this order (24 rings) are out of range of systematic prediction,
but the argument gives further insight into the combinatoric of abstract graphs.
We deal here with the process of trilineal union. This can be done in
all possible ways by extracting one node from any source polyhedron, leaving
3 cut edges. This 3-cut graph can then replace one node of another graph.
However, to influence the possibility of forming an HC, the edges must be
subject to some restrictions distinguishing the 3-cut complex from a single ,
node. The node poses no restrictions. That is, its 3 edges are available in
any pairwise combination, thus any one of 3 ways. If the corresponding edges
of the source graph have the same property, i.e., none of the 3 edges is either_
compulsory or forbidden, then the Zcut graph will not influence the occurrence |
of an EC. By induction, the lower order polyhedra that already contain some
3-connected regions can be passed over in looking for special graphs. A
systematic survey of the few 4~connected, ~ i.e., b-connected except for the

isolated nodes which are, of course, 3-connected, - graphs (Table 4) shows

the polyhedron (16CGDIGDF), the smallest with a special edge, namely that the
14,

ones marked are obligatory in any HC of the polyhedron (Figure 7). Tutte
then replaced 3 nodes of a tetrahedron with a 3-cut graph from (16CGDIGDF)
leading to the contradiction that all three edges from one node must be
included in any HC; hence there can be no HC in this graph of 46 = 4 + 3(14)
nodes. The cut graph can also be planted at two mutually-exelusive edges of
the pentagonal prism to give an HC-free polyhedron of 38 = 10 + 2(14) edges NY
This is clearly the smallest HC-free polyhedron with two 3-connected regions.

A smaller HC~free polyhedron may yet be found by analogous studies of
b-lineal'e’ana d-lineal unions, and if so, is just within the bounds of
reasonable computational effort.

If Grace's list of polyhedra is correct, every one through ng has an HC.
This conclusion is corroborated by a detailed consideration of the. properties
of the graphs 216 of table 3. By the inductive argument , forms with any
triangular face -- indeed, any 3-connected region ~- could be. passed over,
Greatly reducing the computational effort. Of course, from the smallest HC
free polyhedron, larger ones can be generated by replacing a node with a triangle
or larger 3~connected region.

The HC-free polyhedra can be classified by the same principles used for
bilineal unions, as complexes of the iargest cireuits united over the least

_ levels of connectedness. —
15.
While distant from chemical graphs of any reasonable size, these studies
- do furnish a clearer indication of the sufficiency of HC representations, and
_ of the sources of exceptions.
Recapitulation: the scope of anticipation and recognition. There is no
"perceptible Limit except the computation of HC's and of alternative dissections to |
| restrict the encoding of abstract graphs either as HC's or - ecanonicated
unions of HC's. These assignments also facilitate the recognition of isomor=
phisms between given graphs.

The anticipation of all possibilities poses a greater burden. However,
all the graphs up to Ny» (7 rings) have been tabulated together with their
isomorphisms and symmetries. The series expands so rapidly that further
extension would tax the output-printer, and before long the computer itself.

Mapping and symmetry. Having explored the trihedral graphs, we now return
to mapping chemical atoms on their nodes and bonds or linear chains on their

edges. Many graphs have substantial symmetry, and the corresponding by redundant °-
operations must be considered to decide on a canonical representation. Here
again, the HC's are helpful. If an HC is present, it can also be projected on
_ the same graph after any symmetry operation: ’ Therefore, the whole set of

symmetry operations is included within the list of the HC's, giving remarkable

ot economy of computational effort to the search for the symmetries, as well as

a straightforward expression of the operators. To describe a molecular
structure, it can be mapped on an arbitrary choice of form, and the result then
subjected to the symmetry operators.:- The canonical representation satisfies

some rule, say the highest order listing,. of the mapped elements. Thus, for
| | 216.
the morphine nucleus, we would have to choose among the 4 symmetries of its

" underlying grapht (Figure 8).

Since this choice is readily computable, the human user may be relieved of
the burden to make these tedious calculations.

Besides the linear paths of the cyclic structure, the mapping may also
include specifications for fused edges (l-hedral centers), heteroatom replacements
of vertices, and specifications of sterecasymmetry of vertices. The details
are inevitably fussy and are given elsewhere. After the mapping, each atom is
numbered in the order of its reference.

Merging cycles and trees. Each cyclic structure is now fully definea, with
‘rules for a canonical code and sumbersng of every atom. The structure can then
be handled as a node in a tree, the rumbering system allowing precise reference

for the point(s) of connection. .

Applications

This development was needed for a continuing effort to program the
‘automatic computation of structural hypotheses to be matched against various
| sets of analytical data, especially mass spectra. The growing sophistication of |
: instrumental methods has already begun to outdo the chemists capacity to interpret

the fesults. a mass s spectrometers are. now _ commercially ::

i
17.
available that can generate 10,000 spectra per second, the need for computational
assistance to make full use of such devices is self~evident. (Biemann & McMurray
1965; Lederberg 1964>) Such devices are also being considered for the automated

‘exploration of the planets, which puts even heavier demands on the local
| intelligence available to the systen.
These applications relate primarily to the possibility of anticipating
_ hypothetical structures. The language also provides a format for expressing
synthetic insights, i.e., the elementary reactions by which functional groups
“ean be altered or exchanged. We might then expect the ultimate development of
* computer programs which have been taught a few thousand unit processes, and
their limitations, and could be challenged to anticipate a synthetic route
from given precursors or to a given end product. Such programs might at
’ least assist the chemist by reminding of a few among myriad possibilities of
| combining the unit processes learned from the same chemist, or better, from
| a diverse school. For the moment we leave out of consideration the empirical
testing in its own laboratory of a few thousand routes chosen on the computer's
own initiative.

_The nomenclatural applications of any system of canonical forms are also
self-evident. We are very nearly at the point where linear notation may again
‘be dispensable, since the computer should be able to interpret structural graphs
as such. However, a mathematically complete system of classification of
structures is still important, regardless of the notation in which the
structures are expressed. |

The simple graph-theoretical: ideas of DENDRAL could be implemented with a

| number of possible notations. The one adopted for DENDRAL « 64 aims to emulate
18.

traditional notation for all linear chains, only the most obvious abbreviations,
like "3." for "C.C.C.", and a "repeat" symbol, arbitrarily "/", being laid on.
The user must of course understand the principles and notation for the abstract
eyclic graphs. However, it would be quite reasonable to produce an abridged
version of the Ring Index which would list the carbocyclic equivalents of
expected forms, and allow the most unskilled assistant to transcribe structural
‘data in a form readily matched to DENDRAL.

Some examples of structural codes the isomers of alanine, Table 2 are
appended as a challenge to puzzle-ninded readers. Hopefully the tedious manual
of detailed specifications (Lederberg 196%a) is not required reading for
pragmatic understanding of the system. |

There are of course many alternative approaches to notation reviewed by
a National Academy of Sciences Comnittee (1964) and appearing from time to time
in the Journal of Chemical Documentation. As far as I know none of them has
been addressed to the exhaustive prediction of cananical forms and most of
them are too complicated to be easily adaptable to this end. |

Syntax and induction. One of the motives for this study was to uncover
the kinds of problems that would be encountered in computer-emulation of the
process of scientific induction from experimental data. A necessary step is a
means of generating a set of relevant hypotheses. I have been impressed with
both the difficulty and the utility of establishing a precise syntactical
framework for the range of hypotheses ‘even ina field as well structured as
organic chemistry. )

Some years ago, Woodger (1937) attempted to axiomatize developmental and

genetic biology. His efforts were perhaps too remote from the experimental
gata now available. However, he may have pointed the way to a more feasible
‘enterprise, to establish a precise syntax for hypothetical statements in
piology. This is a more modest ain, since it does not purport to deduce whieh
: statenents are correct. However, there is every good reason why computers
should compete very successfully in the exercises of model-building that

_ preoccupy many biologists today, and with advantage to Le rigor with which

| _ they are put together. :

 
FUOTANUTEO
SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY AND HAMILTON CIRCUITS

Footnote p. 9.

liunile this paper was being revised, another algorithm requiring only about

10 n® steps was discovered and programmed for routine use. It depends on

(1) growing a subgraph, adding one node at a time, (2) defining the list of
possible circuits at each level by recursion from the list of previous level, and
(3) looking ahead some steps to choose nodes which close facets of the graph

s0 as to minimize the size of the list that must de maintained.

Footnote to p. 12.

eme speculative "polyhedrqnes” have been discussed by Schults, H.P.: Topological

Organic Chemistry. Polyhedranes and Prismanes. J. Org. Chem. 30, 1361 (1965).

Footnote to p. 13.

3rhis is no longer true. With a new algorithm’, Tutte's graph was exhausted in
29 seconds of 7090 time. The same algorithm is also very. apt for finding the

largest circuits and for forbidden edges.

Footnote to p. 1h.

Monts had already been found by other workers as disclosed in private communications:

D. Barnett, University of Washington and J. Bosak, Bratislava.

Footnote to p. Lh.

>putte (1960) quotes an example. with 22h nodes! If any HC-free polyhedron has
fewer than 38 nodes it probably has one 3-connected region. My own investigations

leave no encouragement for such an example at less than N36°
FOOTNOTES CONTINUED 2

Footnote to p. 15.

6, note the following conjecture, that the symmetries of any abstract convex

trihedral polyhedron can be realized in a geometrical polyhedron in 3-space
with reflection, i.e. can be assigned to a point group. However, this
conjecture is not a premise of the method indicated for finding the symmetries.
The conjecture is plainly inapplicable to 2-connected or to non=-planar graphs,
I would be grateful for any refutation, or a formal proof, new or otherwise.

: ‘
i.
"2.
3.
k,
De

6.
Te

8.

9.

10.

12.

13.

14.

15.

REFERENCES .

J. Lederberg, DENDRAL-64, A System for Computer Construction, Enumeration i
and Notation of Organic Molecules as Tree Structures (NASA CR 57029). 1964):

C. Jordan, Sur les assemblages de lignes. Journal fiir die reine und
angewandte Mathematik 70, 185 (1869).

H. R. Henze and C. M. Blair. The number of isomeric hydrocarbons of the

methane series. J. Am. Chem. Soc. 53, 3077 (1931).

A. M. Patterson, L. T. Capell, D. F. Walker. The Ring Index, 2nd Edition,
American Chemical Society, 1960. Supplement I, 1963. Supplement II,
1964,

W. T. Tutte. On Hamiltonian circuits. J. London Math. Soc. 21, 98 (1946).
L. A. Lyusternik. @onvex Figures and Polyhedra. Dover, New York, 1963.

M. Bruckner. Vielecke und Vielfléche. Teubner, Leipzig, 1900.
D. W. Grace. Computer Search for Non-Isomorphic Convex Polyhedra. Stanford
Computation Center Technical Report No. CS15 (1965).

P. G. Tait. Listinge fTopologie. Phil. Mag. (5), 17, 30-46 (1884);
Scientific Papers, Vol. II, 85-98.

K. Biemann and W. McMurray. Computer-aided interpretation of high
resolution mass spectra. Tetrahedron Letters No. 11, pp- 647-653 (1965).

J. Lederberg. Computation of Molecular Formulas for Mass Spectrometry.
San Francisco, Holden-Day Inc., 1964.

National Academy of Sciences - National Research Council: Survey of
Chemical Notations Systems. Publication 1150. 196).

J. H. Woodger. The Axiomatic Method in Biology. Cambridge, The University
Press. 1937.

D. Perry. The number of structural isomers of certain homologs of methane
and methanol. J. Am. Chem. Soc. 5h, 2918 (1932).

W. @. Tutte. A non-Hamiltonisn planar graph, Acta Math. Acad. Sci. 11,
371 (1960). a St .
oye

Ge NY2S.0-

Sg ee . hoe

>
4.

a—C)—

 

 

‘,

Sd

 

Gee NY2/G

et

oy N™

|
7 Fig. 1. ° Centers of trees: r (radius-center), and m (mass-center). Two
| examples, A., methionine, and B., leucine; The diagrams were plotted by a

| computer program from punch cards coded for each structure as indicated.

\
  

 

 

 

 

 

 

(a) (b)
BENZOPERYLENE .
| .WITH MAPPING OF
BENZOPEAYLENE
, . a . e .
‘| ZN
,———“
. ‘ ‘ . * . ,
asc ~tC~<“S*‘i‘é aE. tC*Ct«é ABEFW
OA Coy OR... OR
AW rr | oo ACOGHJ
(<)
Fig. 2. (a) Benzoperylene and its mapping on a polyhedron (b) which has

four isomorphic planar meshes, i.e. four kinds of faces, as labelled. (c) is
the equivalent Hamilton circuit. Do not confuse the lettered labels of the
nodes with abbreviated code for this graph whibh is 10BCC. The reader may

enjoy satisfying himself that these graphs are indeed isomorphic (equi-connected). :
 

Fig. 3. Hamilton's Hamilton circuit. The abstract dodecahedron, represented

as a planar map of 20 nodes.
UY)

17522663 31752266 63175226

Q

26631752 27663175 ” © 52266317

or A
€8AEB)
Sork
. 2orsB
75226631
Fig. 4 Symmetries and encoding of a cyclic trivalent graph with 8 nodes.

There are 16 syometry operation (8 rotational X 2 reflection). Shown are

8 rotations, and a reflection that could be combined with each of these. With
each figure is also a span list; the canonical choice of the 16 (not all distince)
is the lowest valued span list, 17522663, calculated with the upper rightmost node

as the initial. Thia can then be reduced to the code AEBB, or even more econo-

mically AEB, as outlined in the text.

T5OPKS 21

OVUG

66317522

OO

q aang iy

SAOTTOJ uotqdeg
(a) 0) te) (a)

Figs 5.

.Non planar graphs. (a) and (b) are Kuratowski's fundamental forns,

_4-valent and 3-valent respectively. At least one of these must be included in .

any nonplanar graph. (c) is a projection of (b) as a tetrahedron with an additional.

internal chord,’ and (d) ie a hypothetical molecular structure that maps on to (c).
Figure 6
Caption follows

 

 

 

 

 

 

 

 

 

-©
e

GC esse
9G OU
OC FS Pe
S 9 &s

 

 

 

 

8CEC 8(6AC:8,1:2)

Fig. 6 The cyclic, trivalent planar graphs with 8 or fewer nodes. Where
possible, these are represented as Hamilton circuits, the nodes of the graph
being projected as vertices of a polygon which constitutes the circuit, the
remaining edges shown as chords. Each of these figures can also be drawn as a
planar map. The codes are abbreviated forms from which the graph can be recon-
structed. Note that 8BCC and 8BDD are tsomorphic deepite tha incongruence of the
Hamilton circuits. The abatract polytiedra of thie list include two degenerate
forma (-, circle; 2, hosohedron) and 4B, tetrahedron; 68C, prism; 8 CEC, cube;
8BCC = 8BDD, pentagonal wedge. One of these graphe, 8(6C:8,1:2) has no Hamilton
circuit, and is classified as a union which splices the 8'th edge of graph 6AC
with the l'st edge of graph 2. Complete lists of the graphs through 12 nodes

are presented in Lederberg (1965).
 

(4) : Cb)

 

 

Ce) | : 7 (4)

Fig. 7. A graph with special edges and two HC-free polytiedra. (a) has 16
nodes. The marked edges are included in any HC of- the graph. Hence the 3-cut
(b), with 15 nodes, obligates the marked edge as part of an HC of any graph in
which (b) is inserted. This leads to a contradiction, i.@., no Hamilton circuit

in (c) Tutte'’s graph, with 46 nodes and (d) with 38 nodes.
eS

 

 

 

CANONICAL
MAP

HAMILTON CIRCUIT REPRESENTATIONS |

“Fig. 8. Morphine nucleus: symmetry. and choice for coding. The dashed edge
7---8 stands for the spiro~ (quadrivalent) center in the morphine ring; however,
4 permutations are possible under the symmetry operations. In the canonical form,
after account is taken of the mapping of the chemical graph onto the abstract graph,
this edge is labelled 2—-=3, The canonical map would be coded as
(8BDD-N.3,$, » 23503,,C) each comma marking the next edge of the map. This code

~

ds sufficient input for the computer program to reconstruct the molecular structure
10

l2
13
14
15
°16
17
18
19
20
21
22
23
24
25

table 1. Enumeration of isomeric alkanes (disregarding stereoisomerism), from
methane to pentacontane. The values marked * disagree in some digits with the
values calculated manually by Henze and Slair (1931) and Perry (1932). While

this is an amusing exercise for the computer, the discrepancies, needless to say,

will have no pragmatic chemical significance. In any case, a proportion of the

OoONnN PW PWN

ENUMERATION OF THE ALKANES
STERETSOMERISM DISREGARDED

Wee ee am emeanee |

355

802’
1858
4347°

10359

24894

60523°
148284'%
366319

910726

2278658

5731580
14490245

36797588

93839412:
240215803

617105614

1590507121:

4111846763.
10660307791:
27711253769

72214088660
188626236139
493782952902

1295297588128

3404490780161.

8964747474595
23647478933969

*

eee Tena Taal

165351455535782.
438242894769226'

1163169707886427

30914610118368656'

8227162372221203

21921834086683260
58481806621986230:
156192366474587200
417612400765371900
1117743651746931000

 

structures will be unrealizable owing to steric hindrance.
oe lel OeNeO eCMNeS Cee00 oOelel OeNeC oCe®lO Nee

eCeell OeNeO oNeCal C2000 eOeC aC Ce#NO oComlO OeCeN
oCoC2l Ne020 oNeCal OeC20 eOeCeC CaeNO oletlO OaNol

oCeoCal OeNeO oNeoCel OeOe? oCeelO CeN20 eCe2CO CoeNO
eCeCal OeOeN eNoCae €os00 oCeeCO CaNed eCe2CO No oCO
eCe Cel Neo e0O oN@CeS Ce0e0 eCeelO NeC#O eCaeCO CoNeO

oCaCeC NeOe0 eNeCeC 04620 eCeelO Nel oO . e€%5eCO CoO.N

oCelCeC OeNeO eNeC el Oe00C eCeelO O.CEN eC2eCO NeCeO

oCaCel O200eN oNaCel Coe90 eCoelO ONC eC=eCO NeOod

eS BC ol No e0D eleB®CN C2020 oCesCO Ce®NO oC%eCO OeCeN

eCe®CC NOLO eCeo®CN O6Ce0 oCeelCO CeeNO 0SEeCO OeNeoC

eCeoelC OeNeO eCosCN OeOeC eCoC30 CoNeO , eCezeCO CoeNO

eCe®CC OeOeN eCeoBCN C€e200 oSeC#O CeoOeN eC2eCO Nee lO

eCe*CC Nee OO eCBeoCN C2000 eCeC20 NoCeO BCeC el NeOeO

eCeCeN O4C#O eCHeoCN 04640 oCeol 20 NeOeC ECeoelC NeOeO

eCeCeN Ce200 eCeeCN O606C oCeC#O OeCeN BCeCeN CeOeD

eCeNel O6C20 eC*eCN Ce200 oCeC®O OeNeC EC eCeN Coe00

eCeNel Coe0O eCeCeO CaN20 eC eC2z0 CeeNO HECeNeC CeOe0

eNeCeC OeC20 oCeCeO CHNeD eCeC2O NoelO =ECeNeC Ce 200

oNeed Ce209 of el eO NeC#O C2020 CeNeO aNeCeC Ce0.0

eCeoeClN O4C€20 0Fel oO N2#C.0. C2000 CeOeN TMNeCeoC Coe00

eCeelN Co2#00 oCeCoO O4CON eC2#CeO NeCeO BCeeCN Ce000

eNeeCC O-Cz0 oC e CeO OeNsC eC=CeO NeOel =CeeCN Cee00

oNeeCC Co200 eCeCeO Cao®NO eC=CoO OeCeN BC eC eO CoNeO

eCeCEN CeOe0 eCeleO C2eNO oC=CeO OeNeo BE eC 00 CoON

oC ean OeC20 oC eOed CeN2O eC2#C.0 CoeNO 20 eC 20 NeC.O0

oCeoCaN O4600C Cee C2N.0 Cal eO Noe lO HO eCeO No Oe

eCeC2N Coe00 eCeOel NeC2O oOeoC#C CaNeO #CeCeO CoeNO

ef2CeN C0020 eCeOeC NECeO eOeCHC CeOeN =ZCeOeC CoNeO

eCeCaN OeCe0 oCeOed OeC2N eOeCFC NeCeO BC eel CoOeN

eC#CeN O00eC eCeOeC OeN2zC eOeCBC NeOeC BCeOeC Ne CeO

eCaCeN Cee00 oCeOeC Co®NO oOeC#C OeoCeN BCeO el Noe

eCeNal CeO00 oCeOed CaeNd oOo CFC OeNe Cl #CeOeC CeoeNO

oCeNal OeCeD eOelel CoNzO oOeC2C CoeeoNO BC eeCfO CoNeO

oCeNal 0206€ eOelel CaN.O e0,.C2C NeeCO eC oeCO CeOLN

oCeNal Co200 0OeC eC NeCH#O oCe®CO CoNeO BCeeCO NeCeoO

eCeNeC Ce000 eOeCeC NaCeO eCet®lO CoOeN BCeelO Noe

eCaNelC O0Ce0 eOeCed OsCaN eCe®CO Need ™CeelO CoeoNO

eCAaNeC OeOeC

Cosel CaN 0,0 Ceeel C20 OLN CooeD Cal OWN Cet®eO CoN CeO
CoeeC NaC 060 Ceol O6C NoO Coe®O CoC NO Cote CoN OC
CeomeC CoN O20 CmeeC OeC DoN CoteO CoC NsO Cot=eO NeC CeO
Come Nel O09 CaeeN CaCl 020 CateO CoC OeN Come Net O0€
Cmeel CoN 020 CoteN Cel 940 CzeeO Col NO C2#..0 CaN C20
Cmeel Nel O20 CaeeN Cel 040 C#—ee0 CoC ON CeeeD CeN O8C
Ceoael CeO N2O CeaoN €.0 C20 : CeeoeO CoN CHO CreeD NoC CeO
C oe0f OC NeO CeeeN C#O O6C CoeoeO Net CzO CaeeO NeC Ool
Coeel C#O NeOD Coo®N CeO C60 CoeeO CEN CeO NeeeC C#C 020
Cooel C#O0 ON CatPeN CeO O4C CoeeO CBN O6€ Newel CoO C20
Coot CoO NeO CmeeN CeO C20 Coeed NaC CeO Neoel CaO OC
Coe®l OeC NeO CmeeN CeO OC CoeoeD NaC 06€ : NeoeD CoC CeO
Come CoO NeO CeeeN O6C O4C Coe2®O CoN CeO NeoeD C#C CeO
CoBel CoO OaN Coane CoC N20 CoetO NeoC CoO NeeeDd CHC Oe€
Caeel CoO NeO CooeO CHC NeO

CoeesC CO NaO Cooeet OO NaC

Cooeel NO CHO CoooeN OO CHC

CeooesC OO CAN

Table 2, The isomers of alanine (.C..CN C.=00 ) systematically ordered in DENDRAL-64

: Notation. Each "." or "=" stands for a single or double bond respectively which

' must be satisfied by a trailing atom or radical. This will be the first previously
unreferenced item in the list to the right of the bond. A leading bond constitutes

a central link, which must then be followed by two radicals. A space is used

to separate the primary radicals for convenience in reading but has no coding
significance. Some 25 of these topological possibilities are recognized chemical

forms; an equal number are their tautomers. Most of the remainder are either p®roxides

' or Schiff bases or similar unstable forms. A few, like hydracrylaldoxime, (.C.C.0 C=N.0)
might be realizable but were not found in a cursory search of the literature.
, ‘

- oS oo Loe “*uoapayftod | B JO saoez jo Aaqunu s9yQ vey. ssaz suo SF svuL
he, oot = oe a | (S961) 22829 03 3uqpaossy * .
ne et So - (C961) S2eqzopoy UZ porszT zy

 

 

  

 

 

 

 

 

 

 
 

 

: Se Se fat (s96t) Szeqzepeq t UZ uneap sain3yzg y
. . re / ve oe “*quno9 STYI worz Pepny[oxs are sMIOJ oatdg
. “TI® sayyyusys y “xapuy Buzz ayq mors sa,dmexa maou jo eseues jo Junos aya aie s20yoe7q ay siequng
“. : Jee - os as . = 7 : : - 7 - a : _ " - . 7 : L 4 a . : . / / | ; *¢ sTqel
ce EET ATES Dr - (6 _ (zy. a Do
: . er 7 ” tS oC ed oS : a ‘ . i . a 7 , . . , - 7 ae , ee, " ee me wee . ) a " . * : .
— cl Ree mo} 7 in on aT we:
De preg SE ta) to (1h. Lee . Os
gag ; store oe gp “
oe lg eres wow e tide ty Ege ' [zi cz - To 6
a pes ee [<9] lel! os S20 et
ee see eee my cme aaa ue ee tees ote . Cee
+ [2] o€ 1 [oJeet : Ise} lest \ {¢] aU: 1 ae 2
4 ‘. Ste bepre teen t 1 ° oa tae ED
. . . i . . : - : a“
1 fas ; [olst (oz) ce. Iv}s 4 9
1: Sc ’ a ° : Ttrvelties 0th,
: ' us . CL Magee de eA eels
| [o)é. | [6] ot wot Ps og
. . wetter een ee , Wap , Po
; — TO ot 9 a
CO). ae ‘poe _ et ‘ € eo, Pee
:. : a? ne ae | 7 Bone ef
} pte . o ire al \ z eo Bs
a ae Sos Lo . Doe! . ‘ . re we *
Lg : j Fo so :

 

 

 

 

suojug aeuelg (aeuelTq—uony (aeueTa) eiposys[og soupy jesyusqD $9} I104
- sulog ayaney suojug . + jo 2z2qung

 

 

 

{NIAID USI] Fuey TNOWTTA , SIPRoTFO UOIATFWeH WIM nn oe arn an *
: [sydeaZ ,eozmayd usocuy jo ezaua3 pue) - 7 mo a. ” -
° SHAVUD INGIVATEL SIT9A9 40 INDOD