* has tes the Interim Report to the: © , NASA STAR No. N66~14075 National Aeronautics and Space Administration Grant NsG 81-60 - NASA CR-68899 : Instrumentation Research Laboratory Technical Report No. 1040 SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY AND HAMILTON CIRCUITS ak. A General Outline of the DENDRAL System submitted by Joshua Lederberg Professor of Genetics School of Medicine Stanford University Palo Alto, California _ ? ‘January 12, 1966 Studies related to this report have been supported by research grants from the National Aeronautics and Space Administration (NsG 81-60), National Science Foundation (NSF G-6411), and National Institutes of Health (NB-04270, AI-5160, and FR-00151). . i FOREWORD. This contribution is intended as an introductory survey of the topological concepts that underlie the DENDRAL system for chemical structure notation. The main purpose of the system is to provide a language in which a computer program can frame hypotheses of organic chemistry. For example, a program to generate all the isomers of a given formula has already been imple- mented. This introduction is especially intended for users who wish only a general outline of DENDRAL rather than its full details of syntax. Some notation is necessarily used. This resembles the definitive DENDRAL forms, but the complete manual should be used as a definitive statement of the lan- guage. - “. SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY | Sys AND HAMILTON CIRCUITS Joshua Lederberg Genetics Department , Stanford University School of Medicine ‘. Palo Alto, California The structural formula for an organic molecule is a paragon of a topological graph, that is, the connectivity relations of a set of atoms. True, we recognize more than one type of connection, doubie, triple, and non-covalent “bonds, as well as single bonds. However, from an electronic standpoint the special bonds could just as well be denoted as special atoms. The structural graph . does not specify the geometry, that is, the bond distances and bond angles of the molecule. In fact, this is known for only a small proportion of the enormous number of organic molecules whose structure is very well known from a topological standpoint. Most of the syllabus of elementary organic chemistry thus comprises & survey of the topological possibilities for the distinct ways in which sets of atoms may be connected, subject to the rules of valence. The student then also learns rules which prohibit some configurations as unstable or unrealizable (and may later earn his scientifie reputation by justifying or overturning one of these rules). The field of organic chemistry has, however, reached its present stature without many benefits from any general analysis of molecular topology. These benefits might arise in applications at two extremes of sophistication: the teaching of chemical principles to college undergraduates, and to electronic computers. They may also apply to the vexatious problens of nomenclature and systematic methods of information retrieval. Although the topological. character of chemical graphs was recognized by 2.- the first topologists, very little work has been done on the explicit classifi- cation of the graphs having the most chemical interest. Some difficult. problems, e.g., the enumeration of polyhedra, remain unsolved. However, the main obstacle may be the seeming triviality of the problems, many topologists being quite unsatisfied with systems restricted to 2= or 3-dimensional space. This article will review some elementary features of graphs that may be used for a systematic outline of organic chemistry. The same theory has the broader significance of classifying the possible nets of relationships among the members of a set of objects. For present purposes, our graphs will be undirected, that is, any connections are reciprocal and unpolarized. Further- more, our atoms have a maximum valence of 4. When we come to cyclic structures we shall have occasion to study an even more restricted set of graphs, those in which every node has a valence of 3. | 7 A problem statement might be: enumerate all the distinct structural isomers ‘of a given elementary composition, say C,H,NO,. This is tantamount to oy producing all the connected graphs that can be constructed from the atoms of | the formula, linked to one another in all distinct ways, compatible with the valence established for each element (4, 3, and 2 for C, N, 0, respectively). For compactness, H can be left implicit, being later restored at every unused. valence. ’ Our main approach throughout this article is mapping, a rule of correspon- dence between a part of the chemical structure and a part of some abstract graph. Thus, each atom may be mapped on to a node: each pond to an edge or link of the graph. For further analysis, however, it will be important to map : from complexes ‘of the structure to elements of a graph. The abstract graphs lend themselves to canonical forms, i.e., @ choice among equivalent representations ro ir) ‘ 3. - according to precise rule. Since the root problem is generally not that of producing all possible combinations of atoms, but recognizing which forms are “unique, this is of utmost importance. Chemistry will re-emerge after a few levels of abstraction. | | These principles have been elaborated in a computer-oriented language -"Denaral-6h" which is described more fully elsewhere for the purpose of possible implementation in programming systems (Lederberg, 1964). Trees are l-connected graphs, i.e., can be separated into two parts by cutting any link, They correspond to the acyclic structures of organic ‘chemistry. How may we establish a canonical form for a tree, after first noting its order (number of nodes). | | _ The first step might be to find some unique place to begin the description. A tree must have at least two terminals, and may have many more if highly branched; these are therefore not very suitable. However, each tree has a unique center. In fact Jordan (1869) showed that any tree has two kinds of center, @ mass-center.and a radius-center. Each center has a unique place in any tree; the two may or may not coincide. ( | To find the radius-center, the tree is pruned one level at a time, being cut back one link from every terminal at each level. This will leave, finally an ultimate node or node-pair (in effect, edge) as the center; the radius of the graph is the number of levels of pruning needed to reach the center. To identify the mass-center of a tree, we must consider the two or more _ branches that join to each non-terminal node. The center is the node whose '" Dranches have the most evenly balanced allocation of the remaining mass (node= count) of the tree. This is the same as to say that none of the pendant branches exceed half the total mass. A mass of even number allows the possi= vility of the center being a node pair or edge which joins equal halves, 4. Either of the centers (Fig. 1) is unique, and so could solve our problem of defining a canonical starting point of a description. The center of mass | s more pertinent to finding a list of isomers, which of course enjoy the same mess. The radius-center is ill-adapted for this, but matches con- vantional nomenclature, which is based: on finding the longest linear path, i.e., a diameter. The diameter is not necessarily unique. For example, urea nas three diameters, N - C -N andN-C=0 (twice), but just one radius- canter, the c atom. The problem of generating isomers is the main justifica~ tion for adopting the mass-center ‘over the radius-center to work out canonical POEMS In chemical terms, the center divides the graph into two or more radicals. These radicals can be ordered by obvious compositional principles, giving rise to @ canonical description of the whole graph in a linear code. Thus arginine becomes (C~C-N~C(N)-N C-C(N)-C(0)-0} or, in a parenthesis-free notation with some abbreviations »2.N.C.:NN 2..NC. 200 | . Any linear code hes an implicit number syaten: each atom is numbered according to when it is denoted in the string. Some thirty years ago, Henze and Blair (1931) showed how Jordan's principle could be used for the enumeration of isomers of saturated hydrocarbons and some simple derivatives of them. Here, the nodes are all the same (carbon atoms) and the enumeration can proceed by recursion from smaller to larger complexes. For example, for the isomers of undecane, C5 4oy> one atom is desig~ nated as center, leaving 10 to be allocated among 2, 3 or 4 branches. Only the following partitions satisfy the rules (leaving dissymmetry out of account): co ° boo - Ng De BRANCHES | LJ. 2 oF 545 5 of 1 3'3°3 —~E] | 2.4. . 3,3,4 ae aes Qo 22s | a ~15243,8 2,2,2,4 | 2,2,3,3 some Oe ee, To complete the solution, one must have calculated the number of alkyl radicals “Co, C,, etc. To illustrate with Cai The radical must have an apical atom, leaving the rest to be partitioned dn ali distinct ways among 1, 2 or 3 pendant branchés, the radicals of the next level. Thus we have: — ok < Ja - c<—[ J]. 1,1,2 NO . The count of C, radicals is thus derived from the table for “C,, taking i from ‘ 4 lton-«~ 1, and the process may be itcrated as far as needed, i.e., until partitions into units, Clo» prevail. | No deep mathematical insight is needed to 6. verify that the first steps of the alkyl series C,, C,, C3, ¢, . have 1,3,2,4 forms respectively. No closed algebraic expression has been found for this enumeration. sawever, the recursive expansion was done by hand (Henze and Blair, 1931) with a fav trivial errors found by a computer check; no organic chemist will be surprised by the enormous scope of his field. (Table 1). The total range of acyclic compounds is of course very much larger than these subsets. ‘At each step, instead of partitioning a mere number of nodes, an allocation to constituent radicals takes account of the kind as well as number of unused atoms. However, the specification of a hierarchy of ordering, which may be done almost arbitrarily to suit computational convenience, permits the same principles to be applied to a complete enumeration of structural isomers of a given composition, for example of alanine, C,H/NO.» (fable 2.) Cyclic Structures Cyclic graphs are much less tractable, since every path will return back an to the complex, and a center is less easily defined. Sufficient reminder of the taxonomic difficulties posed by rings is the popularity of the Ring Index 1964) wherein the "11524 rings known to chemistry" are laid out, together with a profusion of synomyous and alternative numbering systems to map them as nodes. _ For example, naphthoyl pyridine would ultimately form a tree, Ry - a Ro » Ry and Roe ° 0 We now consider the domain of strictly cyclic structures. These are 2~ connected graphs, since at least 2 (sometimes more) links must be cut in order . 40 separate the graph. For further analysis, we distinguish the trivalent vertices of the structure ‘ atoms thet join 3 paths, or branch points. We can then construct the full set of T abstract, trivalent graphs. Define a path as q link or an unbranched chain of links and atoms. The paths between vertices of the structure can then be mapped onto the edges of an abstract graph which is regularly trivalent or trihedral. To illustrate, observe how pyrene is mapped onto an abstract graph of 6 vertices, indeed, the abstract prism. -CCC- -CCC- -CC- . oe ' Pyrene (a) (b) (ce) (a) Some vertices are hevalent, in so-called spiro forms, but these graphs can be mappedoonto 3-valent graphs by expanding each k-valent node into a pair 7 of 3-valent nodes. That is, a becomes ‘ —\ There is an obvious "relationship between the number of vertices and the number of rings conventionally ascribed to a structure. We start with, say, benzene, O vertices, and-1 ring. Then naphthalene, 2 vertices and 2 rings. Each additional ring entails 2 more vertices. Hence, for r rings andn vertices . rail * n/2 , | and for these trivalent graphs, n must be an even integer. Recalling that a he valent vertex maps into 2 3-valent nodes, we can write | rsl+tn/2+q for q 4evalent vertices. This calculation agrees with the Ring Index rule which i ‘counts rings as the number of cuts needed to convert a ring structure into a tree. As each edge joins 2 nodes, a trivalent graph of order n will have - 3n/2 edges. Enumerating the trivalent graphs. A trivalent graph may have several 8. representations, and some effort may b> required to relate them to one another, and to decide which form is to be regacded as a canonical reference for mapping purposes. Thus, the graphs of Figure 2 are all topologically equivalent or isomorphic. This is to say, they all represent the same connections of node to (three) nodes. A meaningful enumeration must unify these isomorphisms. Fure thermore, it should relate to a convenient code by which to refer to each graph, better still, to embody a reconstruction. Finally, it should generate - an obvious numbering of the nodes and edges. . Hamilton circuits. A practical key to the solution of this problem, as to many other network problems, taxes advantage of the Hamilton circuits found in most of the abstract graphs having chemical interest. ‘A Hamilton circuit (HC) is a round trip through the g-aph that traverses each node just once. It therefore uses n edges, leaving out n/2 edges. Figure 3 is Hamilton's own example, the dodecahedron, proposed by him as 4 parlor game, each node representing a city that the round-the-world traveller would not wish to. revisit. The utility of HC representations will become evident. ) Finding all HC's of a graph may be a challenging game, but it is reduced to a merely tedious algorithm on the computer. Start from an arbitrary node. Trace a path as through a maze, each node presenting a binary choice of aaa different edges. If the chosen path reverts to a node already visited, pack= track one step. A successful path has n correct choices. Thus, at most 27scarch steps will exhaust all possible paths; in practice, closer .to 1/n times this number will be needed to identify all the HC's. Even for n up to 20 this is a modest task. And if the work has been done once, finding any HG, at perhaps n-fold less effort, will enable @ given graph to be related to the - 9. previously established set. A typical problem in graph manipulation is to establish whether two complicated graphs are isomorphic. In the long run, this might require testing all possible permutations of nodes, with a scope of Factorial (n). _ Aton = 20, this number is an utterly uncomputable 2.4 x 1018 steps. On the other hand, if two graphs are isomorphic, they must have the same HC's, found vith at most 279 = 10° steps. | oA convenient representation of a HC maps the nodes and edges of the circuit as Vertices and vounding edges of a regular polygon. The remaining n/2 edges then form chords, each node being one of the two termini of one chord. A description of the graph then needs only some notation for the n/2 chords. First, we should canonicate the orientation of the polygon, having chosen to initialize the HC arbitrarily among n nodes and 2 directions (the rotational and reflectional symmetries of the polygon). Each node is joined by some chord having a certain span. The span list can be put in cyclic order, where it is invariant under rotation; i.e., immaterial which node is selected as starting point. The effect of reflection is also easily computed. If the span list is regarded as a number, its minimum value under rotation/reflection becomes the canonical form. For example, an 8-node graph might be represented (Figure 4 : by any one of the span lists 17522663, 31752266, etc., or the reflections 75226631, etc. Of these, one quickly finds that 17522663 is the lowest-valued, “hence the canonical form. Similarly, when other HC's are found for the same graph, they can be compared, and the .lowest-valued of them chosen as the ~ | reference graph. The same procedure establishes a canonical ordering of the nodes and 10. edges. For the latter, we take the HC sequence (the polygon) first, then each chord in order of first reference. The spon list has n terms. Only n/2 are necessary, since each chord ai scscxaed to twice in the span list. For an abbreviated code, simply omit the second reference. 17522663 becomes 1522. Indeed, one less character still suffices, the last chord being completely determined by the ones previously built. he chord list (152), or an alphabetic equivalent (8AEB) whose leading numeral merely reminds us of the order of the graph, then encodes the graph in a canonical form (Figure 4). Furthermore, the graph can be reconstructed from «he code by retracing the steps just recited. Caution: Unlike span lists, the abbreviated chord lists cannot be freely rotated. Chord lists can be computed by an obvious combinatorial procedure, with the help of a few tricks to save some fruitless effort. Most arbitrary lists become internally inconsistent after a limited number of initial characters; the number of combinations that must be tested is therefore considerably less than muy eppear. Additional restrictions can also be put on prospectively. In this way, exhaustive lists of trivalent graphs have been computed -~ Table 3 {taken from the DENDRAL report) shows their scope. To unify isomorphisms, the complete list of HC's is computed for each chord list. Apart from the rotation of the polygon, two or more incongruent HC's may be present in a Graph. No general principle is known, except that graphs with high symmetry tend to have the fewest incongruent HC's. Tutte (1946) proved that any edge of a polyhedron must be involved in an even number (not excluding 0) of HC's, and that if a polyhedron admits one HC, it must admit at least three. | Classification of trivalent graphs. Two important, independent criteria i. \ il. of abstract graphs are (1) planarity, and (2) level of connectedness. A planar graph is one that can be represented on the plane without edges crossing over one another. The graph need not be drawn as an HC-polygon, which rarely lacks crossing chords: Figure 3. is certainly planar» Kuratowski has ~ shown that any trivalent non-planar graph must contain 6CC (Figure 5b). . Fortunately, this condition is easily recognized in the building of span lists. As the surface of a polyhedron can be mapped onto the plane, planarity is a necessary condition for an abstract polyhedron. In practice, nonplanar graphs are so far unknown in organic chemistry (barring coordination complexes); however, they might in principle be realized, @.g., by the hypothetical Figure 5d. ‘ ae Connectedness is the least number of cuts that will anywhere separate: the graph. The 3-connected planar graphs are the abstract convex polyhedra. Intuitively, it is obvious that a region bounded only by 2 edges would be unable to enclose a volume. Steinitz (see Lyusternik, 1963) showed that every 3-connected planar trivalent Graph could be realized as a polyhadron. These graphs have, naturally, attracted some interest as a meeting point of topology and classic Greek geometry. Nevertheless, a complete enumeration is still unknown. In 1901, Brickner published figures of the trivalent polyhedra for n < 16; in an abstract and unpublished manuscript (1928) he also showed 1250 for n 218. This work, done by hand over several decades, . was xepeated on the computer by Grace (1965) who found some errors in Brilckner's 12. listings, and found 12h9, However, even this census admits some possibility of being incomplete, though this is remote. Grace generated the polyhedra by induction as all possible slicings of the faces of smaller polyhedra. This produces many isomorphisms which must be unified; for this, Grace used a criterion, “equisurroundedness", which is already known to be too weak, albeit for much larger graphs. Therefore, it cannot be rigorously shown that the list or 129 has not excluded additional forms, equisurrounded, but not isomorphic with the stated set. The analysis of HC's could afford an independent avenue of corroboration at relatively low cost. The polyhedra play an important role in the classification of cyclic graphs but have no remarkable chemical significance except that they represent the most tightly caged polycyclic structures)’ Note that many unfamiliar iso- morphisms are generated by portraying a polyhedron as a planar mesh, i.e., as projected within an arbitrarily chosen face, called the base. The projection can be visualized as the view of the polyhedron from a point just outside the place of the face chosen as base (Figure 2). / - HC-free graphs. These are promptly encountered in the 2-connected series, starting with ng (8(AC:8,1:A) Figure 6). An analysis of the conditions for no-HC illuminates some of the combinatorial processes involved in building graphs. -- Since all the graphs for n = 6 have HC's, an HC-free graph is generated by & particular mode of union of HC's of lower order. The simplest mode is bilineal, one edge is cut on each of two smaller graphs and reunited. If either of the edges involved is barred from any HC of its graph, the bilineal union will be HC-free. This follows, since the union introduced nodes which must be traversed by a path known to be forbidden. 13. In general, an HC-free graph can be canonicated by dissecting it into the - largest cirbuits it contains. The dissertions are first completed across the | bilineal (2-connecting) unions. If any resulting subgraphs are still HC-free, we must consider HC-free polyhedra as a mathematical, if not a pragmatic chemical . possibility. | HC-free polyhedra. Tait believed that all convex trihedral polyhedra - contained HC's and his conjecture was indéed unchallenged for over 60 years. However, Tutte (1946) refuted the conjecture with an example ingeniously proven to be HC-free, though with 46 vertices it would defy exhaustive search. Chemical graphs of this order (24 rings) are out of range of systematic prediction, but the argument gives further insight into the combinatoric of abstract graphs. We deal here with the process of trilineal union. This can be done in all possible ways by extracting one node from any source polyhedron, leaving 3 cut edges. This 3-cut graph can then replace one node of another graph. However, to influence the possibility of forming an HC, the edges must be subject to some restrictions distinguishing the 3-cut complex from a single , node. The node poses no restrictions. That is, its 3 edges are available in any pairwise combination, thus any one of 3 ways. If the corresponding edges of the source graph have the same property, i.e., none of the 3 edges is either_ compulsory or forbidden, then the Zcut graph will not influence the occurrence | of an EC. By induction, the lower order polyhedra that already contain some 3-connected regions can be passed over in looking for special graphs. A systematic survey of the few 4~connected, ~ i.e., b-connected except for the isolated nodes which are, of course, 3-connected, - graphs (Table 4) shows the polyhedron (16CGDIGDF), the smallest with a special edge, namely that the 14, ones marked are obligatory in any HC of the polyhedron (Figure 7). Tutte then replaced 3 nodes of a tetrahedron with a 3-cut graph from (16CGDIGDF) leading to the contradiction that all three edges from one node must be included in any HC; hence there can be no HC in this graph of 46 = 4 + 3(14) nodes. The cut graph can also be planted at two mutually-exelusive edges of the pentagonal prism to give an HC-free polyhedron of 38 = 10 + 2(14) edges NY This is clearly the smallest HC-free polyhedron with two 3-connected regions. A smaller HC~free polyhedron may yet be found by analogous studies of b-lineal'e’ana d-lineal unions, and if so, is just within the bounds of reasonable computational effort. If Grace's list of polyhedra is correct, every one through ng has an HC. This conclusion is corroborated by a detailed consideration of the. properties of the graphs 216 of table 3. By the inductive argument , forms with any triangular face -- indeed, any 3-connected region ~- could be. passed over, Greatly reducing the computational effort. Of course, from the smallest HC free polyhedron, larger ones can be generated by replacing a node with a triangle or larger 3~connected region. The HC-free polyhedra can be classified by the same principles used for bilineal unions, as complexes of the iargest cireuits united over the least _ levels of connectedness. — 15. While distant from chemical graphs of any reasonable size, these studies - do furnish a clearer indication of the sufficiency of HC representations, and _ of the sources of exceptions. Recapitulation: the scope of anticipation and recognition. There is no "perceptible Limit except the computation of HC's and of alternative dissections to | | restrict the encoding of abstract graphs either as HC's or - ecanonicated unions of HC's. These assignments also facilitate the recognition of isomor= phisms between given graphs. The anticipation of all possibilities poses a greater burden. However, all the graphs up to Ny» (7 rings) have been tabulated together with their isomorphisms and symmetries. The series expands so rapidly that further extension would tax the output-printer, and before long the computer itself. Mapping and symmetry. Having explored the trihedral graphs, we now return to mapping chemical atoms on their nodes and bonds or linear chains on their edges. Many graphs have substantial symmetry, and the corresponding by redundant °- operations must be considered to decide on a canonical representation. Here again, the HC's are helpful. If an HC is present, it can also be projected on _ the same graph after any symmetry operation: ’ Therefore, the whole set of symmetry operations is included within the list of the HC's, giving remarkable ot economy of computational effort to the search for the symmetries, as well as a straightforward expression of the operators. To describe a molecular structure, it can be mapped on an arbitrary choice of form, and the result then subjected to the symmetry operators.:- The canonical representation satisfies some rule, say the highest order listing,. of the mapped elements. Thus, for | | 216. the morphine nucleus, we would have to choose among the 4 symmetries of its " underlying grapht (Figure 8). Since this choice is readily computable, the human user may be relieved of the burden to make these tedious calculations. Besides the linear paths of the cyclic structure, the mapping may also include specifications for fused edges (l-hedral centers), heteroatom replacements of vertices, and specifications of sterecasymmetry of vertices. The details are inevitably fussy and are given elsewhere. After the mapping, each atom is numbered in the order of its reference. Merging cycles and trees. Each cyclic structure is now fully definea, with ‘rules for a canonical code and sumbersng of every atom. The structure can then be handled as a node in a tree, the rumbering system allowing precise reference for the point(s) of connection. . Applications This development was needed for a continuing effort to program the ‘automatic computation of structural hypotheses to be matched against various | sets of analytical data, especially mass spectra. The growing sophistication of | : instrumental methods has already begun to outdo the chemists capacity to interpret the fesults. a mass s spectrometers are. now _ commercially :: i 17. available that can generate 10,000 spectra per second, the need for computational assistance to make full use of such devices is self~evident. (Biemann & McMurray 1965; Lederberg 1964>) Such devices are also being considered for the automated ‘exploration of the planets, which puts even heavier demands on the local | intelligence available to the systen. These applications relate primarily to the possibility of anticipating _ hypothetical structures. The language also provides a format for expressing synthetic insights, i.e., the elementary reactions by which functional groups “ean be altered or exchanged. We might then expect the ultimate development of * computer programs which have been taught a few thousand unit processes, and their limitations, and could be challenged to anticipate a synthetic route from given precursors or to a given end product. Such programs might at ’ least assist the chemist by reminding of a few among myriad possibilities of | combining the unit processes learned from the same chemist, or better, from | a diverse school. For the moment we leave out of consideration the empirical testing in its own laboratory of a few thousand routes chosen on the computer's own initiative. _The nomenclatural applications of any system of canonical forms are also self-evident. We are very nearly at the point where linear notation may again ‘be dispensable, since the computer should be able to interpret structural graphs as such. However, a mathematically complete system of classification of structures is still important, regardless of the notation in which the structures are expressed. | The simple graph-theoretical: ideas of DENDRAL could be implemented with a | number of possible notations. The one adopted for DENDRAL « 64 aims to emulate 18. traditional notation for all linear chains, only the most obvious abbreviations, like "3." for "C.C.C.", and a "repeat" symbol, arbitrarily "/", being laid on. The user must of course understand the principles and notation for the abstract eyclic graphs. However, it would be quite reasonable to produce an abridged version of the Ring Index which would list the carbocyclic equivalents of expected forms, and allow the most unskilled assistant to transcribe structural ‘data in a form readily matched to DENDRAL. Some examples of structural codes the isomers of alanine, Table 2 are appended as a challenge to puzzle-ninded readers. Hopefully the tedious manual of detailed specifications (Lederberg 196%a) is not required reading for pragmatic understanding of the system. | There are of course many alternative approaches to notation reviewed by a National Academy of Sciences Comnittee (1964) and appearing from time to time in the Journal of Chemical Documentation. As far as I know none of them has been addressed to the exhaustive prediction of cananical forms and most of them are too complicated to be easily adaptable to this end. | Syntax and induction. One of the motives for this study was to uncover the kinds of problems that would be encountered in computer-emulation of the process of scientific induction from experimental data. A necessary step is a means of generating a set of relevant hypotheses. I have been impressed with both the difficulty and the utility of establishing a precise syntactical framework for the range of hypotheses ‘even ina field as well structured as organic chemistry. ) Some years ago, Woodger (1937) attempted to axiomatize developmental and genetic biology. His efforts were perhaps too remote from the experimental gata now available. However, he may have pointed the way to a more feasible ‘enterprise, to establish a precise syntax for hypothetical statements in piology. This is a more modest ain, since it does not purport to deduce whieh : statenents are correct. However, there is every good reason why computers should compete very successfully in the exercises of model-building that _ preoccupy many biologists today, and with advantage to Le rigor with which | _ they are put together. : FUOTANUTEO SYSTEMATICS OF ORGANIC MOLECULES, GRAPH TOPOLOGY AND HAMILTON CIRCUITS Footnote p. 9. liunile this paper was being revised, another algorithm requiring only about 10 n® steps was discovered and programmed for routine use. It depends on (1) growing a subgraph, adding one node at a time, (2) defining the list of possible circuits at each level by recursion from the list of previous level, and (3) looking ahead some steps to choose nodes which close facets of the graph s0 as to minimize the size of the list that must de maintained. Footnote to p. 12. eme speculative "polyhedrqnes” have been discussed by Schults, H.P.: Topological Organic Chemistry. Polyhedranes and Prismanes. J. Org. Chem. 30, 1361 (1965). Footnote to p. 13. 3rhis is no longer true. With a new algorithm’, Tutte's graph was exhausted in 29 seconds of 7090 time. The same algorithm is also very. apt for finding the largest circuits and for forbidden edges. Footnote to p. 1h. Monts had already been found by other workers as disclosed in private communications: D. Barnett, University of Washington and J. Bosak, Bratislava. Footnote to p. Lh. >putte (1960) quotes an example. with 22h nodes! If any HC-free polyhedron has fewer than 38 nodes it probably has one 3-connected region. My own investigations leave no encouragement for such an example at less than N36° FOOTNOTES CONTINUED 2 Footnote to p. 15. 6, note the following conjecture, that the symmetries of any abstract convex trihedral polyhedron can be realized in a geometrical polyhedron in 3-space with reflection, i.e. can be assigned to a point group. However, this conjecture is not a premise of the method indicated for finding the symmetries. The conjecture is plainly inapplicable to 2-connected or to non=-planar graphs, I would be grateful for any refutation, or a formal proof, new or otherwise. : ‘ i. "2. 3. k, De 6. Te 8. 9. 10. 12. 13. 14. 15. REFERENCES . J. Lederberg, DENDRAL-64, A System for Computer Construction, Enumeration i and Notation of Organic Molecules as Tree Structures (NASA CR 57029). 1964): C. Jordan, Sur les assemblages de lignes. Journal fiir die reine und angewandte Mathematik 70, 185 (1869). H. R. Henze and C. M. Blair. The number of isomeric hydrocarbons of the methane series. J. Am. Chem. Soc. 53, 3077 (1931). A. M. Patterson, L. T. Capell, D. F. Walker. The Ring Index, 2nd Edition, American Chemical Society, 1960. Supplement I, 1963. Supplement II, 1964, W. T. Tutte. On Hamiltonian circuits. J. London Math. Soc. 21, 98 (1946). L. A. Lyusternik. @onvex Figures and Polyhedra. Dover, New York, 1963. M. Bruckner. Vielecke und Vielfléche. Teubner, Leipzig, 1900. D. W. Grace. Computer Search for Non-Isomorphic Convex Polyhedra. Stanford Computation Center Technical Report No. CS15 (1965). P. G. Tait. Listinge fTopologie. Phil. Mag. (5), 17, 30-46 (1884); Scientific Papers, Vol. II, 85-98. K. Biemann and W. McMurray. Computer-aided interpretation of high resolution mass spectra. Tetrahedron Letters No. 11, pp- 647-653 (1965). J. Lederberg. Computation of Molecular Formulas for Mass Spectrometry. San Francisco, Holden-Day Inc., 1964. National Academy of Sciences - National Research Council: Survey of Chemical Notations Systems. Publication 1150. 196). J. H. Woodger. The Axiomatic Method in Biology. Cambridge, The University Press. 1937. D. Perry. The number of structural isomers of certain homologs of methane and methanol. J. Am. Chem. Soc. 5h, 2918 (1932). W. @. Tutte. A non-Hamiltonisn planar graph, Acta Math. Acad. Sci. 11, 371 (1960). a St . oye Ge NY2S.0- Sg ee . hoe > 4. a—C)— ‘, Sd Gee NY2/G et oy N™ | 7 Fig. 1. ° Centers of trees: r (radius-center), and m (mass-center). Two | examples, A., methionine, and B., leucine; The diagrams were plotted by a | computer program from punch cards coded for each structure as indicated. \ (a) (b) BENZOPERYLENE . | .WITH MAPPING OF BENZOPEAYLENE , . a . e . ‘| ZN ,———“ . ‘ ‘ . * . , asc ~tC~<“S*‘i‘é aE. tC*Ct«é ABEFW OA Coy OR... OR AW rr | oo ACOGHJ (<) Fig. 2. (a) Benzoperylene and its mapping on a polyhedron (b) which has four isomorphic planar meshes, i.e. four kinds of faces, as labelled. (c) is the equivalent Hamilton circuit. Do not confuse the lettered labels of the nodes with abbreviated code for this graph whibh is 10BCC. The reader may enjoy satisfying himself that these graphs are indeed isomorphic (equi-connected). : Fig. 3. Hamilton's Hamilton circuit. The abstract dodecahedron, represented as a planar map of 20 nodes. UY) 17522663 31752266 63175226 Q 26631752 27663175 ” © 52266317 or A €8AEB) Sork . 2orsB 75226631 Fig. 4 Symmetries and encoding of a cyclic trivalent graph with 8 nodes. There are 16 syometry operation (8 rotational X 2 reflection). Shown are 8 rotations, and a reflection that could be combined with each of these. With each figure is also a span list; the canonical choice of the 16 (not all distince) is the lowest valued span list, 17522663, calculated with the upper rightmost node as the initial. Thia can then be reduced to the code AEBB, or even more econo- mically AEB, as outlined in the text. T5OPKS 21 OVUG 66317522 OO q aang iy SAOTTOJ uotqdeg (a) 0) te) (a) Figs 5. .Non planar graphs. (a) and (b) are Kuratowski's fundamental forns, _4-valent and 3-valent respectively. At least one of these must be included in . any nonplanar graph. (c) is a projection of (b) as a tetrahedron with an additional. internal chord,’ and (d) ie a hypothetical molecular structure that maps on to (c). Figure 6 Caption follows -© e GC esse 9G OU OC FS Pe S 9 &s 8CEC 8(6AC:8,1:2) Fig. 6 The cyclic, trivalent planar graphs with 8 or fewer nodes. Where possible, these are represented as Hamilton circuits, the nodes of the graph being projected as vertices of a polygon which constitutes the circuit, the remaining edges shown as chords. Each of these figures can also be drawn as a planar map. The codes are abbreviated forms from which the graph can be recon- structed. Note that 8BCC and 8BDD are tsomorphic deepite tha incongruence of the Hamilton circuits. The abatract polytiedra of thie list include two degenerate forma (-, circle; 2, hosohedron) and 4B, tetrahedron; 68C, prism; 8 CEC, cube; 8BCC = 8BDD, pentagonal wedge. One of these graphe, 8(6C:8,1:2) has no Hamilton circuit, and is classified as a union which splices the 8'th edge of graph 6AC with the l'st edge of graph 2. Complete lists of the graphs through 12 nodes are presented in Lederberg (1965). (4) : Cb) Ce) | : 7 (4) Fig. 7. A graph with special edges and two HC-free polytiedra. (a) has 16 nodes. The marked edges are included in any HC of- the graph. Hence the 3-cut (b), with 15 nodes, obligates the marked edge as part of an HC of any graph in which (b) is inserted. This leads to a contradiction, i.@., no Hamilton circuit in (c) Tutte'’s graph, with 46 nodes and (d) with 38 nodes. eS CANONICAL MAP HAMILTON CIRCUIT REPRESENTATIONS | “Fig. 8. Morphine nucleus: symmetry. and choice for coding. The dashed edge 7---8 stands for the spiro~ (quadrivalent) center in the morphine ring; however, 4 permutations are possible under the symmetry operations. In the canonical form, after account is taken of the mapping of the chemical graph onto the abstract graph, this edge is labelled 2—-=3, The canonical map would be coded as (8BDD-N.3,$, » 23503,,C) each comma marking the next edge of the map. This code ~ ds sufficient input for the computer program to reconstruct the molecular structure 10 l2 13 14 15 °16 17 18 19 20 21 22 23 24 25 table 1. Enumeration of isomeric alkanes (disregarding stereoisomerism), from methane to pentacontane. The values marked * disagree in some digits with the values calculated manually by Henze and Slair (1931) and Perry (1932). While this is an amusing exercise for the computer, the discrepancies, needless to say, will have no pragmatic chemical significance. In any case, a proportion of the OoONnN PW PWN ENUMERATION OF THE ALKANES STERETSOMERISM DISREGARDED Wee ee am emeanee | 355 802’ 1858 4347° 10359 24894 60523° 148284'% 366319 910726 2278658 5731580 14490245 36797588 93839412: 240215803 617105614 1590507121: 4111846763. 10660307791: 27711253769 72214088660 188626236139 493782952902 1295297588128 3404490780161. 8964747474595 23647478933969 * eee Tena Taal 165351455535782. 438242894769226' 1163169707886427 30914610118368656' 8227162372221203 21921834086683260 58481806621986230: 156192366474587200 417612400765371900 1117743651746931000 structures will be unrealizable owing to steric hindrance. oe lel OeNeO eCMNeS Cee00 oOelel OeNeC oCe®lO Nee eCeell OeNeO oNeCal C2000 eOeC aC Ce#NO oComlO OeCeN oCoC2l Ne020 oNeCal OeC20 eOeCeC CaeNO oletlO OaNol oCeoCal OeNeO oNeoCel OeOe? oCeelO CeN20 eCe2CO CoeNO eCeCal OeOeN eNoCae €os00 oCeeCO CaNed eCe2CO No oCO eCe Cel Neo e0O oN@CeS Ce0e0 eCeelO NeC#O eCaeCO CoNeO oCaCeC NeOe0 eNeCeC 04620 eCeelO Nel oO . e€%5eCO CoO.N oCelCeC OeNeO eNeC el Oe00C eCeelO O.CEN eC2eCO NeCeO oCaCel O200eN oNaCel Coe90 eCoelO ONC eC=eCO NeOod eS BC ol No e0D eleB®CN C2020 oCesCO Ce®NO oC%eCO OeCeN eCe®CC NOLO eCeo®CN O6Ce0 oCeelCO CeeNO 0SEeCO OeNeoC eCeoelC OeNeO eCosCN OeOeC eCoC30 CoNeO , eCezeCO CoeNO eCe®CC OeOeN eCeoBCN C€e200 oSeC#O CeoOeN eC2eCO Nee lO eCe*CC Nee OO eCBeoCN C2000 eCeC20 NoCeO BCeC el NeOeO eCeCeN O4C#O eCHeoCN 04640 oCeol 20 NeOeC ECeoelC NeOeO eCeCeN Ce200 eCeeCN O606C oCeC#O OeCeN BCeCeN CeOeD eCeNel O6C20 eC*eCN Ce200 oCeC®O OeNeC EC eCeN Coe00 eCeNel Coe0O eCeCeO CaN20 eC eC2z0 CeeNO HECeNeC CeOe0 eNeCeC OeC20 oCeCeO CHNeD eCeC2O NoelO =ECeNeC Ce 200 oNeed Ce209 of el eO NeC#O C2020 CeNeO aNeCeC Ce0.0 eCeoeClN O4C€20 0Fel oO N2#C.0. C2000 CeOeN TMNeCeoC Coe00 eCeelN Co2#00 oCeCoO O4CON eC2#CeO NeCeO BCeeCN Ce000 eNeeCC O-Cz0 oC e CeO OeNsC eC=CeO NeOel =CeeCN Cee00 oNeeCC Co200 eCeCeO Cao®NO eC=CoO OeCeN BC eC eO CoNeO eCeCEN CeOe0 eCeleO C2eNO oC=CeO OeNeo BE eC 00 CoON oC ean OeC20 oC eOed CeN2O eC2#C.0 CoeNO 20 eC 20 NeC.O0 oCeoCaN O4600C Cee C2N.0 Cal eO Noe lO HO eCeO No Oe eCeC2N Coe00 eCeOel NeC2O oOeoC#C CaNeO #CeCeO CoeNO ef2CeN C0020 eCeOeC NECeO eOeCHC CeOeN =ZCeOeC CoNeO eCeCaN OeCe0 oCeOed OeC2N eOeCFC NeCeO BC eel CoOeN eC#CeN O00eC eCeOeC OeN2zC eOeCBC NeOeC BCeOeC Ne CeO eCaCeN Cee00 oCeOeC Co®NO oOeC#C OeoCeN BCeO el Noe eCeNal CeO00 oCeOed CaeNd oOo CFC OeNe Cl #CeOeC CeoeNO oCeNal OeCeD eOelel CoNzO oOeC2C CoeeoNO BC eeCfO CoNeO oCeNal 0206€ eOelel CaN.O e0,.C2C NeeCO eC oeCO CeOLN oCeNal Co200 0OeC eC NeCH#O oCe®CO CoNeO BCeeCO NeCeoO eCeNeC Ce000 eOeCeC NaCeO eCet®lO CoOeN BCeelO Noe eCaNelC O0Ce0 eOeCed OsCaN eCe®CO Need ™CeelO CoeoNO eCAaNeC OeOeC Cosel CaN 0,0 Ceeel C20 OLN CooeD Cal OWN Cet®eO CoN CeO CoeeC NaC 060 Ceol O6C NoO Coe®O CoC NO Cote CoN OC CeomeC CoN O20 CmeeC OeC DoN CoteO CoC NsO Cot=eO NeC CeO Come Nel O09 CaeeN CaCl 020 CateO CoC OeN Come Net O0€ Cmeel CoN 020 CoteN Cel 940 CzeeO Col NO C2#..0 CaN C20 Cmeel Nel O20 CaeeN Cel 040 C#—ee0 CoC ON CeeeD CeN O8C Ceoael CeO N2O CeaoN €.0 C20 : CeeoeO CoN CHO CreeD NoC CeO C oe0f OC NeO CeeeN C#O O6C CoeoeO Net CzO CaeeO NeC Ool Coeel C#O NeOD Coo®N CeO C60 CoeeO CEN CeO NeeeC C#C 020 Cooel C#O0 ON CatPeN CeO O4C CoeeO CBN O6€ Newel CoO C20 Coot CoO NeO CmeeN CeO C20 Coeed NaC CeO Neoel CaO OC Coe®l OeC NeO CmeeN CeO OC CoeoeD NaC 06€ : NeoeD CoC CeO Come CoO NeO CeeeN O6C O4C Coe2®O CoN CeO NeoeD C#C CeO CoBel CoO OaN Coane CoC N20 CoetO NeoC CoO NeeeDd CHC Oe€ Caeel CoO NeO CooeO CHC NeO CoeesC CO NaO Cooeet OO NaC Cooeel NO CHO CoooeN OO CHC CeooesC OO CAN Table 2, The isomers of alanine (.C..CN C.=00 ) systematically ordered in DENDRAL-64 : Notation. Each "." or "=" stands for a single or double bond respectively which ' must be satisfied by a trailing atom or radical. This will be the first previously unreferenced item in the list to the right of the bond. A leading bond constitutes a central link, which must then be followed by two radicals. A space is used to separate the primary radicals for convenience in reading but has no coding significance. Some 25 of these topological possibilities are recognized chemical forms; an equal number are their tautomers. Most of the remainder are either p®roxides ' or Schiff bases or similar unstable forms. A few, like hydracrylaldoxime, (.C.C.0 C=N.0) might be realizable but were not found in a cursory search of the literature. , ‘ - oS oo Loe “*uoapayftod | B JO saoez jo Aaqunu s9yQ vey. ssaz suo SF svuL he, oot = oe a | (S961) 22829 03 3uqpaossy * . ne et So - (C961) S2eqzopoy UZ porszT zy : Se Se fat (s96t) Szeqzepeq t UZ uneap sain3yzg y . . re / ve oe “*quno9 STYI worz Pepny[oxs are sMIOJ oatdg . “TI® sayyyusys y “xapuy Buzz ayq mors sa,dmexa maou jo eseues jo Junos aya aie s20yoe7q ay siequng “. : Jee - os as . = 7 : : - 7 - a : _ " - . 7 : L 4 a . : . / / | ; *¢ sTqel ce EET ATES Dr - (6 _ (zy. a Do : . er 7 ” tS oC ed oS : a ‘ . i . a 7 , . . , - 7 ae , ee, " ee me wee . ) a " . * : . — cl Ree mo} 7 in on aT we: De preg SE ta) to (1h. Lee . Os gag ; store oe gp “ oe lg eres wow e tide ty Ege ' [zi cz - To 6 a pes ee [<9] lel! os S20 et ee see eee my cme aaa ue ee tees ote . Cee + [2] o€ 1 [oJeet : Ise} lest \ {¢] aU: 1 ae 2 4 ‘. Ste bepre teen t 1 ° oa tae ED . . . i . . : - : a“ 1 fas ; [olst (oz) ce. Iv}s 4 9 1: Sc ’ a ° : Ttrvelties 0th, : ' us . CL Magee de eA eels | [o)é. | [6] ot wot Ps og . . wetter een ee , Wap , Po ; — TO ot 9 a CO). ae ‘poe _ et ‘ € eo, Pee :. : a? ne ae | 7 Bone ef } pte . o ire al \ z eo Bs a ae Sos Lo . Doe! . ‘ . re we * Lg : j Fo so : suojug aeuelg (aeuelTq—uony (aeueTa) eiposys[og soupy jesyusqD $9} I104 - sulog ayaney suojug . + jo 2z2qung {NIAID USI] Fuey TNOWTTA , SIPRoTFO UOIATFWeH WIM nn oe arn an * : [sydeaZ ,eozmayd usocuy jo ezaua3 pue) - 7 mo a. ” - ° SHAVUD INGIVATEL SIT9A9 40 INDOD