Page ! of 2

Research Proposal Submitted to the National Science Foundation

*

Proposed Amount_113,544 Proposed Effective Date 6/1/76 _ Proposed Duration_24
(months )

Title MOLGEN: A Computer Science Application to Molecular Genetics

 

Principal Investigator Submitting
Institution Stanford University

Soc. Sec. No.

 

Department _ Computer Science

 

 

 

 

Institution . .
M

(if different from submitting institution) Branch/Campus__Main

Address Address Stanford, California 94305

Branch/Campus

 

Co-Principal Investigator [Edward A. Feigenbaum Soc. Sec. No.

Co-Principal Investigator Joshua Lederberg Soc. Sec. —

4

If renewal request previous NSF grant No.

 

Stanford University

Make grant to
(name of institution or organization to which grant should be nade )

 

 

 

Endorsements:
Principal Investigator(s) Dept. Head Institutional Admin. Official -
Name Edward A. Feigenbaum Robert Floyd D'Ann Downey
creates tn h. Fg bee Seb t SGA A
Title Professor of Chairman, department Asst. Sponsored
Computer Science of Computer Science Projects Officer

 

Telephone Number

(415) 497-4878 (415) 497-2274 (415) 497-2883

Date
Page 2 of 2

Research Proposal Submitted to the National Science Foundation

Provosed Amount_ 113,544 Proposed Effective Date 6/1/76 Proposed Duration 24
(months)

Title MOLGEN: A Computer Science Application to Molecular Genetics

 

 

 

 

 

 

 

 

 

 

 

 

 

Principal Investigator Submitting
ti i Stanford Universit

Soc. Sec. No. Tnstitution tty

Department Computer Science
Institution .

Main
(af different from submitting institution) Branch/Campus

Address Stanford, California 94305
Address
Branch/Campus
Co-Principal Investigator Edward A. Feigenbaum Soc. Sec. No.
Co-Principal Investigator Joshua Lederberg Soe. Sec. No.

 

 

\

If renewal request previous NSF grant No.

 

Make grant to Stanford University
(neme of institution or organization to which grant should be made)

Endorsements:

 

 

 

 

 

rincipal Investigator(s) Dept. Head Institutional Admin. Offici -
Name Joshua Lederberg oshua Lederberg D'Ann Downey
signature at .
Title préicheor & Chairman profesloy & Chairman Asst. Sponsored
Department of Genetics Department of Genetics Projects Officer
Telephone Number
(415) 497-5801 (415) 497-5801 (415) 497-2883

 

ch

O
wp
1 PROLOGUE

The scientific text of this proposal is identical with that of
the companion from the University of New ‘“exico. The budget is
separate and it will be administered by Stanford University as
indicated in the budget page. We realize that this is an unusual
procedure but we feel it is justified as a means of bringing
geographically separate scientists together to work on common problems.
The personnel involved in the proposals began working together on the
organization of the project in June, 1975 when Prof. Martin visited
Stanford for the summer. We are proposing that Martin spend the major
part of the first year at Stanford, working closely with the Stanford
group on all phases of the foundation of the systen. This is strongly
supported by Prof. D. Morrison, chairman of the Division of Computing
and Information Science at UNM. The second year Martin will return to
UNM to continue working on special subparts of the system with the aid
of a UNM graduate research assistant. Constant connunication can be
maintained through the excellent SUMEX-AIM time sharing installation as
outlined in the Resource Section. of this proposal. At regular
intervals all participants in the project will meet at Stanford to

review progress and suggest new directions.
Section

Table of Contents

Subsection

Introduction ° . . ° . ° e . ° .

Background and Related Work . - © © © © © 8

2.1 Relationship to Other Research in Artificial
Intelligence. - . . . . . - «

2.2 Related Work in Computers Applied to Genetic
Biochenistry . . . . . . . . . .

2.3 Supporting Research at the Stanford Heuristi
Programming Project . . . . ° . . °

Detailed Plans for Work During the Proposal Period

3.1 Introduction . - . . 8 ~ 6 e
3.2 An Experimental Planning Program . ° . .
3.3 An Enzyme Simulation Program . ° - 8
3.4 Knowledge Base ., . . . . . . . . .
3.5 A DNA structure entry and editing systen .
Resources ° . e . e ° . . ° °
Bibliography . . . - . . . . .
Personnel . . . . . . ° . . .
Budget . . . ° . . . . . .

Page

16

18

23
23
26
31

35

39

41

43

47

48
1 Introduction

The general goal of the research proposed here is to develop a
computer program (termed MOLGEN) to assist a motecular geneticist in
planning laboratory experiments. It is to be an interactive systen,
drawing on both the expertise of the hunan geneticist and the expert
knowledge stored in the data base. The systen nust be convenient and
confortable for the geneticist, and will need to be sophisticated and
powerful from the point of view of computer science. The program will
itself be an experiment in the development of intelligent systens.

The need for such a. program is suggested by the recent growth
of technology in molecular genetics. In the past three years many new
site-specific restriction enzymes have been discovered (Nathans 1975).
These enzymes have been applied to physical mapping of chronosomes,
nucleotide sequence analysis of DMNA, isolation of genes, and
restructuring of DNA molecules. Such rapid growth extends throughout
the field, with continuing advances in separative and physical
techniques. With the many new procedures, there are possibilities for
novel combinations of techniques which broaden the horizon of possible
experiments. At the same time, it has become increasingly difficult
for any one scientist to keep track of what is available, as well as
the limitations and use of the newer techniques. Many "ingenious"
discoveries, in fact, can be viewed as a judiciously selected
combination of well established unit processes. Hence an intelligent

computer system is likely to be pragmatically useful in this area.
An immediate apvlication of the proposed svsten will be the
simulation of reconbination of randon segments of bacterial DNA cut by
restriction endonucleases. In experiments currently underway in
Professor Lederberg’s laboratory, geneticists would like to have
quantitative predictions of reaction component concentrations at any
particular time, given initial concentration of Bacillus subtilis DNA
and site-specific restriction enzymes. The conception of the eueyne
simulation program (see part III) in the MOLGEN systen was directed
precisely toward this kind of need. Part of this effort will involve
the creation of a knowledge base of restriction enzycves with associated
kinetic data, recognition site information, and characterization of
availability and degree of purity. Convenient access to a
conprehensive source of this information would itself be of
considerable value.

A subproject which should also prove very useful to the
geneticist will be the sunnarization of all enpirically established
sequences of ‘DNA and RNA, along with restriction enzyne site
specificity. Such information might play an important role in the
current effort of Professor Lederberg’s laboratory in dissecting the
genetic structure of Bacillus subtilis. Recognition sites in a wide
variety of naturally occuring nucleic acid species has been sunnarized
in a preliminary way by Sobell (Sobell 1973).

An important factor in selecting this domain was the

availability of a great many tangible problems covering a wide range of
difficulty. This means that it will be possible to develop the svysten
incrementally, and still be able to perform useful tasks even at the
outset.

That branch of computer science known as artificial
intelligence (AI) has already developed techniques for conputer-
assisted problem solving in the experimental sciences. Exanples of
successful projects are: the DENDRAL project at Stanford in the field
of mass spectrometry; the organic synthesis systems developed by Vipke,
Corey, and others; the MYCIN project at Stanford in the field of
chemotherapeutic advice; and Pople and Myer’s DIALOG system for nedical
diagnosis (Pople 1975). The joint effort of computer scientists and
experts in the experimental sciences has allowed the creation of
excellent computer systems. These systems are presently being used to
facilitate research in the experimental sciences. The ideas and
technology behind the projects are now available for application to an
experimental planning program for molecular genetics.

There appear to be two factors common to those task donains to
which computer-assisted problem solving techniques have been applied
successfully:

1) the development of systematic and logical methods of
analysis of problems and application of tools by the
experimental scientists themselves,

2) the possibility of solving problens in the donain
using a strongly constrained and consistent body of

knowledge.
In the current state of the art, heuristic problen solving prograns
containing a few thousand facts and heuristics about a task domain can
be built and used effectively. The existence of these two factors in a
domain suggests that the computer scientist can build an intellectual
bridge into the domain in a period of time that might be measured in
months rather than years.

Some problems of experimental molecular genetics currently
share both of these factors. For example, the growth of knowledge about
restriction enzymes has been important to genetics because of their
many applications to manipulation and separation of DNA structures.
Despite the large number of enzymes, each can be represented by a small
set of descriptors. This makes the task of representing the knowledge
far more limited than it might first appear.

One major issue of artificial intelligence that relates to this
research is how to plan experiments involving the choice of several
steps in a large space of possibilities. This includes such
fundamental points as when to design an entire experimental plan first
and then fill in details later (top-down), and when to design a
conplete step at a time (bottom-up). A second major issue will be the
development and management of large knowledge bases. The MOLGEN system
will need extensive files of knowledge about DNA structures, enzymes,
separative techniques, and physical processes. For exanple some of
this knowledge will involve mathematical models of kinetics; some will

be detailed sequences of base pairs; some will be "fuzzy" and inexact
descriptions of a particular process or structure. The knowledge bases
must be designed to allow as much of the data as possible to be handled
in a uniform manner. A third major computer science problem will be
the creation of a simulation program which can simulate the action of
enzymes on DNA structures. Many of the current techniques developed
for simulation in other domains may prove useful here. The NOLGEN
system will also have to deal with cases of incomplete and inexact
knowledge in the planning and the simulation stages. Molecular
genetics, as a task domain, offers problems which are limited and
structured enough to be feasible with available techniques, as well as
those which are challenging enough to advance the art.

The overall purpose, then, of providing computer assistance in
the planning process is to make it possible for an investigator to
explore a wide range of possible experiments. Calculations of the
effectiveness of computer assisted planning can only be approximate,
but the following is offered for a rough estimate of the potential.
Professor Lederberg has suggested that the knowledge actively used in
the planning of experiments in molecular genetics could be encoded as
approximately a thousand computer "rules". (This fits confortably in
the range used by current knowledge-based intelligent systems.) It has
also been observed that most experiments in molecular genetics involve
on the order of seven levels of formal decision once the context of a
particular goal has been set. We believe that fairly crude heuristics

can reduce the likely choices to perhaps ten. Ten major alternatives at
each level of planning yields a total "search space” of about ten
million alternatives. A search space of this size fits within the
limits of current AI programs but still demands a definition of more
powerful search-reduction heuristics for high performance. The purpose
of these estimates is not so much to give a true measure of program

performance, as to illustrate the notion that the nunber of

possible
experiments is indeed much greater than could actually be performed.
Prof. Lederberg has suggested that a nunber of important and perhaps
novel experiments are not being considered because of the difficulties

in keeping current in a rapidly expanding discipline. Meeting this

need is precisely the goal of the MOLGEN project.
2 Background and Related Work

2.1 Relationship to Other Research in Artificial
Intelligence
There are two primary reasons for proposing the MOLGEN project:
(1) to develop a high performance conputer progran
augmenting human intelligence in important research
areas in molecular genetics,
(2) to provide a context for testing and extending ideas in
the design of intelligent systens.
This section will discuss the existing scientific and technological
base and some proposed extensions and show the relationships between

MOLGEN and other AI programs.

2.1.1 Planning and Heuristic Search

The problem of planning an experiment fits into the paradiga of
heuristic search, a well established framework for conputer based
protiem solving. Problems in this framework can be represented as a
tree of subproblems, with solutions existing at unknown depths along
unknown paths. Judgmental rules and procedures, called heuristics, are
applied to direct the search. When we speak of the expertise or good
judgment of an expert, we are often referring to the heuristics he has
developed to search effectively.

There are typically a number of alternative procedures to

choose fron at each step in the course of an experiment in molecular
genetics. These alternate steps can be viewed as a family of

possible
experiments. This family may then be represented as a tree structure
so that the planning of a particular experinent is equivalent to
choosing a path in the tree. Each node represents a certain “state” in
the experiment where the physical entities being used by the
experimenter are in a particular condition and certain quantities are
known. For example, the physical entities may be a mixture of
oligonucleotides whose terminal base sequences are known. The activity
in the planning process centers around selecting an operation to carry
the experiment to the next state. These operations are chosen fron the
program’s knowledge base and are naturally expressed as transformations
from one state to another. For exanple, in the the context of
molecular genetics, a chronatography step could be represented as a
transformation which "transforms" a mixture (initial state) iato its
separated conponent parts (final state). Each particular type of
chromatography would be more specifically described by the parameters
of the physical entities which are active in the separation, For
example, gel electrophoresis would be described in terms of its ability
to separate oligonucleotides according to length. Part of the
knowledge embedded in the transformation would be a rule for estimating
the effectiveness of gel electrophoresis on any particular mixture of
oligonucleotides for the various gels which are available. Finally, a
completed plan is a sequence of transformations which when applied

successively to the initial state of the experinent transform it to the

desired goal or final state.
This model of problem solving based on the selection of a path
according to heuristic knowledge is often called the heuristic search
paradigm. It has been used as a model for a variety of problems since
such early projects in AI as the "Logic Theory “fachine"™ (1956) by
Newell and Simon. One dimension of progress in AI has been the

£

development of techniques for the creation of plans within the

heuristic search paradigm.

2.1.2 Strategies for Planning

Several algorithms have been developed to assist in choosing a "nininun
cost" path in a heuristic search tree (Nilsson 1971), where an
estimating function is available which can measure how "close" any
intermediate point in the experiment is to a final state. In order to
guarantee that the algorithms will find a ninimal path, the estimating
function must never overestimate the distance to the goal. For conplex
problems, a practical difficulty continues to appear in many contexts.
Simply stated, it is sonetimes best to retreat fron a goal in order to
get closer to it. In mathematical theorems, this arises in those cases
where it is easier to solve a more general theorem than a specific one.
In organic synthesis, it is sometimes better to build up a rather
complicated structure which seems "farther" from a target compound than
some current step in the synthesis, but from which a direct and often
elegant reaction will transform the complicated structure almost
directly to the desired product. |

In complex scientific domains, the distance estimating
functions are almost impossible to find, Even if such a metric can be
found, it is unlikely we can guarantee that! it never “overestimates” the
distance to a goal state. Nevertheless, in special cases, quantitative
distance measures can sometimes be found when discrinination among the
products of the alternative transformations is statistically testable.
Otherwise, planning programs must rely on domain specific or even goal
specific knowledge to guide the selection of the next step.

Some more general techniques for planning have been developed
for robot problem solvers. For example, the STRIPS systen (Fikes
1972), in the course of building long plans, attempts to assemble short
sequences of operators which may later prove useful. The idea is that
it is more efficient to assess the effect of the smaller plans and
reuse bike iuan Lo generate them again from scratcn tater. ine
aiyoritnms inctude metnodas ror disentangiing tne Local errectS or Ctnat
piece or a pian trom its immediate context and predicting its etrtrect in
a naw context.

Another planning idea of general application has been developed
in Sacerdoti’s work on hierarchical planning (Sacerdoti 1973). This
can be seen as a_ formalization of the common sense notion of planning
major steps first, and attending to details later. In our terms, it
means recognizing that some preconditions for use of a transformation
are more important than others. This ordering allows the nost
important criteria to be considered first. In contrast to earlier

methods for generating plans, which proceeded a step at a time toward a

10
goal, hierarchical planning involves sketching out a plan in several
passes of refinement of detail. This approach is applicable where it is
Poessibie tu andicatce an uruering ror tne criteria used in choosing tne
next step of a plan.

Our examination of several experiments in molecular genetics
suggests that this idea has a great deal of relevance. For example, in
dealing with enzyme data, it is far more important for most purposes
that an enzyme be applicable, i.e. that it have the desired effect on a
substrate, than that its most active pH be in a certain range. In
terms of planning, this suggests that "“applicability" is a more
important piece of planning information than is "pH range". Hence, the
adjustment of pH in an experiment is a detail which can be considered
during a later pass in the planning process. This ordering of
importance need not be assumed constant for all experiments, but may be
considered as part of the planning heuristics. The application and
development of these planning issues will be part of the interesting
computer science questions to be investigated in the MOLGEN project.

The problem of experiment planning can be compared to the
automatic generation of computer programs. The planner’s task is to
find a sequence of instructions that will transform the input into the
desired output. However, the tools that. the experiment planning
program will work with are conceptually simpler and the range of
possible "programs" or experiments is more restricted than those

utilized in the general problem. Although there are many unit

11
processes, the planning will be facilitated by needing little or no
recursion or looping, limited "data types", few steps, and in addition
will have a detailed knowledge of the subject matter. Thus the

heuristics necessary to plan experiments are more manageable.

2.1.3 Acquisition of Knowledge

One trend in the development of intelligent systems has been an
increase in the size of the knowledge base used by _ the prograns.
Domain specific knowledge is used to guide and hence speed their
search. The process by which a system acquires this knowledge is one
of the central design issues for an intelligent systen. The DENDRAL
programs, for instance, which are used in the analysis of mass spectra
of organic compounds (Buchanan 1969), acquired their chemical and nass
spectral knowledge by a painstaking process in which an AI scientist
worked with a chemist, eliciting from the chemist the theories, facts,
rules, and heuristics applicable to reasoning from mass spectra. This
same approach has been used successfully in protein structure
determination using x-ray crystallography and in diagnosis of glaucoma
eye disease (Kulikowski 1973).

More recent versions of the DENDRAL program have special built-
in editor programs for acquiring knowledge and data from the chemical
domain. For exanple, a chemical structure editor is used ‘in entering
Fragments of Chemical scetuccures, aud other editors speciairze in
acquiring chemical structure constraints and fragmentation rules.

Another approach, typified by the SECS progran for organic chenical
synthesis (Wipke 1975), involved use of a special Language (ALCHEM) for
describing organic reactions. For each of the areas of organic
synthesis expertise, a chemist specialist was responsible for encoding
in ALCHEN the particulars about the useful reactions in his specialty.
The idea was that a user of the SECS program could then draw upon the
combined knowledge from all the experts who had participated. This
process not only contributed to the growth of the knowledge base, but
also resulted in changes and additions to the ALCHEM language which
made it both more powerful and more convenient.

The MYCIN program, used for diagnosis and treatment planning of
infectious disease using antibiotics (Shortliffe 1973, 1975) uses a
more elaborate method for the acquisition of knowledge. In this systen,
knowledge is embedded in a collection of some 309 decision rules. New
rules can be added via a rule acquisition progran. This program
accepts rules in a stylized form of English, converts then to an
internal form, and then prints them out again in English to indicate
its understanding of the rule. The system also has a model of its
knowledge, in the form of abstracted descriptions of classes of rules.
It uses these both to help interpret the rule the physician has typed
in, and as acheck on its content. If, for exanple, the new rule
differs from most others in that class, the physician is warned, and
can then alter the rule if he wishes.

One of the more interesting components of the MYCIN program is

its explanation system. This concept was introduced earlier in

13
Vinograd’s SHRDLU progran (Winograd 1972), and refers to the ability of
a program to explain its activities to its user. For example, NHYCIN
may ask the physician a particular question about the results of a
culture. If the physician wants to know why the question was asked, he
can ask the system. It will then examine its reasoning, and indicate
the purpose of the question by displaying the relevant rule. The same
rules which guide the consultation are thus being used to explain the

program’s behavior.

2.1.4 Applications to MOLGEN

It is our goal in the development of the MOLGEN system to draw
upon and extend some of the ideas and techniques mentioned above. With
regard to planning, we will be experimenting with ways to relate
hierarchical notions of planning to the concepts of molecular genetics.
Furthermore, we will investigate techniques to make the NOLGEN programs
handle the diversity of goals in molecular genetics experiments, Note
that when a physician starts to use MYCIN, the program does not have to
ask him what he wants to do. The physician always wants to diagnose a
disease and prescribe treatment. For MOLGEN, however, the experimental
goals are as diverse as the experiments in molecular genetics.

We expect MOLGEN to draw upon the same knowledge base in its
efforts to work on a variety of experimental planning problems . In
spite of the diversity of experiments in molecular genetics, we have
categorized three main types of experiments as follows:

l. Given a structural hypothesis, plan experinents and

14
procedures which would verify the hypothesis. (This is
an example of structural analysis.)

2. Given starting and target DNA structures, plan
experiments and procedures which will transform the
initial structure to the final structure. (This is an
example of structural synthesis.)

3. Given an initial DNA structure, hypothesized final
structure, and a set of steps in an experiment to be
Carried out, Verify tite prouvabie currectness of ctne
final structure or show any other possible final
structures. (This is an example of a “proof” checker.)

An initial goal towards meeting the first two objectives is the
development of a planning program which will allow a user to “sketch”
out an experiment, and allow MOLGEN to fill in the details . An example
showing how this might proceed is given in section 3 of this proposal.

A second area for developing on the themes mentioned above is
in the atea vt Kuowleage acquisition and representation. ine nain
extension to previous ideas has to do with the handling of a diversity
of data types. MOLGEN must deal with a broad range of types of
knowledge --~ including DNA structures, enzymes, chromatographic
techniques and various physical processes. For each of these types of
data, MOLGEN is expected to assist the user in entering and modifying
the data and knowledge base. This should be done in a form convenient

to the user’s view of each particular type and yet at the same time in

15
a way which keeps” the program’s diverse collection of knowledge and
data types fron becoming too complex to be manageable. Similarly, the
explanation system must be implemented in a way that makes the
reasoning of the program accessible to the user, through all levels of
the knowledge base. A successful treatment for a broad spectrun of

diverse knowledge will be a major design goal for the MNOLGEN project.

2.2 Related Work in Computers Applied to Genetic
Biochemistry

We propose to provide in this section a summary limited to work
directly related to the simulation and planning aspects of the MOLGEN
project.

Some of the most advanced prior work in simulation has been
done in the area of enzyme kinetics. The basic approach has been to
set up the differential equations describing the chemical reactions ,
and then solve those equations on either a digital (Chance and Shephard
1969, Bates 1973, and Garfinkel 1968) or analog (Chance 1967) conputer
system. It is interesting that attempts have been made to facilitate
use of the systems by working geneticists, either by means of
interactive display mechanisms (Bates 1973), or an English-like
language for describing the chemical systems (Garfinkel 1968). We
intend to expand upon these user-oriented techniques during our system
development.

We also know of a system (Green 1970) which uses matrix

16
representations of the steady-state equations describing enzymes to
permit rapid calculation of the distribution of all enzymatic forms at
any given moment. We feel that the accurate modelling of this steady-
state behavior will prove vital to our enzymatic simulation systen,
although we still have to investigate the general applicability of the
matrix representation technique for multi-enzyme systems.

Crothers (1968, 1971) has developed several computer programs
to simulate various aspects of the process of de/renaturation. The
detailed mathematical models available on conformational changes
(Bloomfield 1974) will facilitate the computer simulation of these
processes,

The problem of representing the often “fuzzy" knowledge about
DNA structures has been considered in a model for the heterogeneity of
base composition (Elton 1974). The point is that the exact conposition
is rarely known for more than a few hundred bases. Elton’s model
classifies the DNA structures into "segments" with different underlying
base compositions, each segment categorized by an estimate of Length
and G-C content. Our representation model for DNA structures will
allow such classification of portions of DNA, as well as many other
factors like the distribution of important features (nicks, gaps, etc.)
along parts of the DNA strands.

Finally, we find of direct relevance to our design of
experimental plans the use by Khorana’s group (Powers 1975) of conputer

programs to help plan the detailed steps involved in gene syntheses.

17
The genes are synthesized in small segments and then joined by some
chemical or enzymatic means. The program, using knowledge of the
chemical environment, determines optimal segment lengths and order of
joining. This will be one of many experimental paradigms we hope the

MOLGEN system will encompass.

2.3 Supporting Research at the Stanford Heuristic
Programming Project

The Heuristic Programming Project at Stanford, of which the
MOLGEN research will be a part, is one of the few AI research groups
that emphasizes development of intelligent systems for use in the
sciences. This section discusses some of these intelligent systems
which are being developed and actively used. These interdisciplinary
projects serve to illustrate the range of experience that the MOLGEN

research can draw on.

26501 The DENDRAL Project

The Heuristic DENDRAL project was started by Professors
Lederberg and Feigenbaum in 1965, and has diversified and natured
considerably since that time. The DENDRAL program (Buchanan 1975 and
Appendix A) is an application of artificial intelligence to bionedical
molecular structure determination problems. The program is designed to
explain enpirical data, in particular, to interpret experimental data
from amass spectrometer, an analytical instrument used in organic

chemistry. It uses heuristic search techniques to explore plausible

18
candidate explanations of the empirical data in much the same way that
other artificial intelligence programs explore spaces of hypotheses.

A major contribution has been demonstrating that a large amount
of scientific knowledge can be managed and used by a conplex problen-
solving program. Another contribution has been showing that ATL
techniques can be successfully applied to scientific problems.

The DENDRAL molecular structure building program, CONGEN, is
currently used by chemists as a tool for building molecular structures

froa stcvucturai pieces wnicen wne cnemist uas anfercea Crom wnatever

ur

datd ate avaliabie. Tne DENDRAL FuaAnnen (Smatn L943) uses a Large
amount of knowledge specific to mass spectrometry to infer structural
constraints directly from empirical data. The DENDRAL PREDICTOR is a
simulation model of the mass spectrometer that allows the program to
make testable predictions of candidate hypotheses. META-DENDRAL is a
program for discovering new inference rules in the field of mass
spectrometry.

From the start, major emphasis has been placed on avoiding the
"outsider-insider" distinction by having trained chemists work on the
programs and make them useful to working scientists. This emphasis is

"problems conducted without

in contrast to other research on "toy
reference to the needs of working scientists. Since May 1974, the
project has moved to the interactive computing environment of the NIH-

funded SUMEX-AIM facility (Carhart 1975). Because of this, many

scientists outside this university have been able to use the DENDRAL

19
conputer programs to further their own research. These programs are
currently receiving heavy use from local users and outside users who
are investigating mass spectrometry problems for a variety of different

compound classes.

2.3.2 The MYCIN Project

The MYCIN project (Davis 1975 ) at Stanford is another
interdisciplinary research group including members of the Heuristic
Programming Project, whose goal is to apply artificial intelligence
reasoning programs to a scientific problem. The objective of this
MYCIN research is to develop ‘a computer-based system capable of
using the judgmental knowledge of experts to assist physicians in
selecting appropriate antimicrobial therapy for patients with
infectious diseases. The work concentrates initially on the use of
antimicrobial agents in the treatment of bacteremias.

The current MYCIN program utilizes data available from the
microbiology and clinical chemistry laboratories, plus the physician’s
response to computer-generated questions, to provide physicians with
consultative advice on antimicrobial therupy. Therapy selection takes
into account both the patient’s infections and the range of possible
identities of the organisms causing these infections. One contribution
of the MYCIN system has been developing general techniques’ for
explaining its reasoning steps to permit the program to explain all of
its actions and reasoning, including, for example, the deduction of the

identities of pathologic agents. Another important part of the MYCIN

20
systen is a rule-acquisition program for computer acquisition of
judgmental knowledge about those concepts which the program uses in
making deductions. This permits experts in the field of infectious
disease therapy to teach the system those therapeutic decision rules
which they use in their clinical practice. Making rules, exploring
MYCIN’s reasoning, or adding new knowledge to the system is done in a

stylized subset of English.

2.3.3 Protein Structure Modeling Project

The Protein crystallography project involves scientists at the
University of California at San Diego as well as members of the
Heuristic Programming Project. The general objective of the project is
to apply problem solving techniques in artificial intelligence to the
"phase problen" of x-ray crystallography, in order to determine the
three~dimensional structures of proteins. Specifically, the task is to
develop a computational system which can infer the tertiary structure
of a protein molecule in the absence of phase information normally
obtained fron nultiple isomorphous replacement procedures.

In the context of artificial intelligence, project objectives
center around knowledge acquisition and program organization. The
knowledge acquisition task involves formalizing the knowledge and
heuristics used by expert protein crystallographers to infer the
tertiary structure of proteins from x-ray crystallographic data, and to
formalize this expertise in appropriate data structures and heuristic

procedures. The program organization plans are variations on the

21
HEARSAY (Reddy 1973) model of cooperating "expert" modules in

program.

22
3 Detailed Plans for Work During the Proposal Period

3.1 Introduction

This section presents our overall view for the molecular
penetics system as well as a detailed list of specific goals for the
first year covered by this proposal. The systen will be conposed of
three major, interacting parts: an experiment planning systen, an
enzymatic action simulation program, anda collection of knowledge
bases containing the rules and heuristics of molecular genetics.

The experiment planning program will collect information about
a problem from the user, select an appropriate methodology (information
retrieval, simulation, hierarchical planning, or some conbination) and
then work interactively with him to solve the problen. Some questions
can be answered by information retrieval from a knowledge base, such as
a question about the enzymes” that function under certain pH or salt
concentrations. On the other hand, if a geneticist asked for a
prediction of the ratio of linear to circular DNA in a test tube after
10 minutes application of ligase, then a straightforward simulation
would be in order, In verifying that a given set of steps in an
experiment produces the final result, a combination of sinulation and
retrieval would be used by the planning progran. In more difficult,
higher order problems, such as structural analysis or synthesis,

hierarchical planning is needed, perhaps involving both retrieval and

simulation as well.

23
A simulation program will provide detailed nodelling of the
action of enzymes on DNA structures. Our goal is a program which will
be able to produce accurate enough results to answer questions like the
one posed above, i.e. what will happen after application of ligase for
10 minutes, 2 hours, etc.

The knowledge bases will be composed of collections of the
rules and heuristics used by geneticists, as well as facts about
enzymes, experimental methods, and physical processes like
renaturation. They must allow access in retrieval, simulation, and
planning modes, so provision for the representation of many types of
knowledge must be supplied.

Along with the major system components discussed above, certain
themes will remain dominant during all phases of system design.
Primary consideration will be given to making the system an easy and
natural tool for the geneticist to use. DNA structure entry, editing,
and display will be by way of an interactive progran with nunerous
user~oriented features. Explanation facilities (in the model of the
MYCIN system) will be provided whenever possible, and all knowledge
bases will be made easily extendable and modifiable. Building the
trust of the user is considered necessary to the continued developnent
of the system. The best way to provide for this will be to insure that
it is used vy geneticists, ana that a Strung interaction petween
computer scientists and genetics experts continues.

Tic Vetiainder OF tCuas patt of vur proposat wliit aiscuss in

24
decait our plan ror system devetopmenc in tne rirst year. preeiiy
sunmarized., our four major goals are:

1. Begin the construction of a rudimentary experiment planning
program and an associated planning knowledge base.

2. Build a simulation program which will apply basic enzymatic
actions to DNA structures and summarize the resulting
Structures.

3. Design the knowledge base containing rules and heuristics
used by the geneticists, enzyme knowledge, and knowledge of
experimental methods and physical processes.

4. Complete the construction of an interactive DNA structure

entry and editting systen.

25
3.2 An Experimental Planning Program

The central, and most interesting part of the Molecular
Genetics project will be the experiment planning program (PLANEX).
PLANEX is meant to be an interactive progran which conbines the
intuition and expert knowledge of a molecular genetics investigator
with the thoroughness of a conputer having a detailed knowledge base.
The investigator will sketch the initial conditions for an experinent
and the desired final condition. PLANEX will be developed to allow the
user to specify required or suggested intermediate steps. PLANEX will
Suggest intermediate steps, additional options, and verify the expected
results within the limits of the knowledge base The progran will
initially be designed to check the steps of an experiment, and possibly
Fill in the details between small steps. The direction of development
of PLANEX will be toward a program which can eventually take bigger
steps, interpret less precise requirements by the experimenter, and
offer more useful alternatives based on the knowledge base. The
investigator could request varying degrees of detail and, at all tines,
the heuristics and reasoning tools used by PLANEX as it evaluates
alternatives would be accessible to a user via an explanation systen.
Freed from the constraints of checking all details, the experimenter
could explore the possibilities of many experiments before choosing one
and also have novel experiments presented for his consideration.

A scenario for a possible run of PLANEX might be the folloving.

We suppose that a molecular biologist has a new restriction enzyme

26
(call it R12) and that he wants to consider alternative experiments for
determining its specificity, i. e.. the specific site on the DNA
molecule for its application. His first step might be to create an
enzyme description of R12, using the procedures for entering
information about any enzyme in the MOLGEN knowledge base. The
description would include such information as its nane, enzyme
classification ("“endonuclease” if nothing more specific is known),
IUPAC number, cost, availability, stability, salt activity tables,
substrate description, names and concentrations of impurities known to
be present.

When the available information about the new enzyme was
entered, the user would then call in PLANEX. He would tell PLANEX that
he wanted an experiment to determine the specificity of the enzyme R12.
PLANEX would ask if the user has digested some DNA to exhaustion and
determined the initial and final segment sizes. (Fron this, the length
of the restriction sequence can be estimated. We assume that this
length is estimated to be five nucleotides.) PLANEX now interprets the
user’s experiment as the following:

Given the initial state as follows:

Initial State: Segments of unknown base sequence (resulting fron
complete digestion of phage DNA by R12).

construct a sequence of steps to the final state where:

Final State: 1. The identities of the last few nucleotides on the
5” ends of the fragments have been determined.

2. The identities of the Last few nucleotides on the
3° ends of the fragments have been determined.

27
For simplicity, let us assune that the user is willing to limit
his initial goal to determining the identity of the terminal nucleotide
and that he wishes first to do this for the 5” ends of the fragments.
The cuvices fur doing tuis rtueiude

A) Label the 5° termini of the fragments with radioactive
phosphate groups followed by a separation procedure,

B) Convert the 5% end to hydroxyl using phosphotase. The
terminal base can then be distinguished from the other
nucleotides by chromatographic means after a 3% to 5’
exonuclease digestion.

Successful reasoning by PLANEX at this level will depend on
characterizing the available options. The nore general the
classifications and heuristics, the more apt PLANEX will be at
generating new conbinations of techniques. For example, the two
methods of terminal nucleotide analyses mentioned above would fit into

the following general scheme:

To determine the identity of the 5° terminal nucleotide on an
oligonucleotide,

1) "Label" the end nucleotide.

2) Break the oligonucleotide into pieces which can be
separated.

3) Identify the pieces which are labeled.

In this context, a "labeling" means any technique which makes
the piece containing the terminal nucleotide distinguishable in sone
separation and identification procedure. It would include the above
techniques as well as, for example, replacement of the terminal base in
a predictable way by a base analog (as in the "turnover" technique

using Polymerase).

28
Let us suppose that one of the experiments that the user wants
to consider at this point is method (8) fron above, that the Snake
Venon 3° to 5% exonuclease has been chosen to break the oligonucleotide
into oieces, and that the separation technique is a type of
chromatography capable of distinguishing nucleotides from nucleosides
and determining their identity. The experimental plan at this point
looks like the following:

Initial State: Mixture of oligonucleotides of average length
200 nucleotides with unknown 5° terminal

nucleotides.

Operation: Apply Phosphotase.

State: Mixture of oligonucleotides of average length
200 nucleotides with unknown 5° terminal
nucicusides.

~ . , oUt a + ad
Vpertairuus AppayY Olkane VettunM o Ly oO eCKAUMUL LEASES»

wlatc: raAtcuce ut fluGteurctiucs anu (cetmattar)

Nutacues.uecse

Vperacsvuns oO¢partate uutreusiues ifrum nuctievtiaes ana
determine identity of nucleosides.

Final State: Nucleosides have been identified. (Identity
of 5° terminal nucleotides of fragments has
been determined.)

At this point, the experiment is thoroughly outlined, although
there are a number of smaller steps still to be determined. The user
asks PLANEX to fill in some more details. This means that PLANEX should
generate the intermediate steps so that the "required input” for each

operation is matched by the "output" of the previous step, that is, so

that there is a complete sequence of states and operations fron the

29
initial state to the final state. In this case, PLANEX suggests the
use of Pancreatic endonuclease after the Phosphotase step and before
the Snake Venom step to reduce the length of the oligonucleotides as
required for more rapid action by the Snake VYenon exonuclease,
Similarly, a denaturation step may be inserted before the Phosphotase
step. The generation of both of these steps is caused by the
interpretation of the enzymatic knowledge base for the enzymes used in
the operation. At a finer level of detail, PLANEX will consider steps
which adjust the pH or ionic concentrations to maximize the reaction
yield. Heuristics, under user control, weigh the various
considerations which lead to the generation of these subgoals into a
hierarchy - so that the “more important" criteria are considered first.
Finally, the user may ask PLANEX to estimate the yields, costs, and
time required to perform the overall experiment.

At any point in a session, a user could backtrack to explore a
different possibility. The ability to compare several different
experiments is useful in cases where confirming experiments are used to
guard against experimental error. In many cases, planning would not
proceed to the end of an experiment -- as when the results ata
particular step dramatically affect the selection of the following
step. An example of this occurs in the prologue of the scenario
experiment, when PLANEX asked the user for information necessary to
estimate the length of the recognition sequence of the enzyme. Had the

user elected to determine more of the sequence than the end nucleotide,

30
this information would have been essential in choosing between methods
which use overlapping sequences. The user’s choice to identify only

the end nucleotide greatly simolified the experiment.

3.3 An Enzyme Simulation Program

Enzymes form the primary tools geneticists use to manipulate
DNA structures. The nost common types include exonucleases, which
break the backbone phosphodiester bond starting fron an end, gap, or
nick; endonucleases, which break a internal backbone bond; ligases,
which seal a break in the DNA backbone; and polymerases, which add
bases to a primed single DNA strand and fill in gaps in double strands.
As mentioned, a special type of endonuclease, the restriction enzyne,
functions to break the DNA backbone at very precisely specified sites.
All of these processes must be simulated to provide accurate modelling
of enzymatic action. One of the first processes that we will model will
be the ligation of endonuclease~generated DNA fragments into linear and
circular structures.

The sinulation program will operate in the following manner.
The program is given the detailed action to be carried out (e.g. apply
a 3° to 5° exonuclease) and the initial pool of the various types and
concentrations of DNA structures present. [It will decide, using advice
from the user, what structural features are important in this
experiment, and focus on those types as the simulation proceeds. The

program will choose an operator function and apply it to a structure

31
selected stochastically from the pool, producing a possihly new
structure. This may either increase the concentration of one of the
present structures (decreasing that of another) or add a new structure
to the pool. The process will continue until all structures are inert
to enzymatic action, or until specified time interval has passed.

One major representation difficulty for the simulation program
is that the nunber of DNA structures present in an actual experinent is
often in the billions. Offseting this problem is the fact that many of
the DNA structures can be considered essentially identical, but only
within the context of a particular experiment. That is, the criteria
under which structures may be considered to be identical are dependent
upon the particular experiment. For example, topology and lengths of
segment are most important in the ligation experiment mentioned above
and precise nucleotide sequences interior to the DNA chains are of
little significance. In other experiments, doninant features involve
the locations of nicks and gaps. During the simulation, a structure
must be “instantiated" from a description in the pool of structures to
a level of detail consistent with the intent of the experiment. Then
the enzyme action is carried out on the structure resulting perhaps in
several changes. Finally, the resulting structure must be
reincorportated into the pool. If it is "equivalent" to another
structure, then it is a simple matter to increase the appropriate
concentration. Otherwise, a new structure must be added to the pool.

The idea is to pick “equality criteria” and “instantiation details”

32
broad enough to keep conputations reasonable but narrow enough so that
the results of the simulation correspond to laboratory results.

A second problem in simulation is the handling of

impure
enzymes, as for exanple, an exonuclease with endonuclease impurities.
This may involve the construction of an event queue type of simulation
in which the minor enzymatic action occurs as often as the relative
concentrations indicate.

Finally, a difficulty occurs when not only qualitatively
accurate answers are required from a simulation progran, but also
precise values of DNA structure concentrations at any monent in
experimental real time. This means careful checking, probably by our
geneticist consultants in the laboratory, of all contradictory rate
constants, as well as possibly adding a level of mathematical rigor to
sone already designed models of physical processes, e.g. probability
of DNA ends ina test tube solution coming close enough to join.
Again, we wish to emphasize the human engineering aspects we intend to
‘puild into all of our programs and probably we can rely on our
experience with DENDRAL and MYCIN. Full facilities for exanining
intermediate results in a natural manner to geneticists will be
provided, as will powerful interactive methods of control. The user
will be able to easily modify rate constants, starting DNA
concentrations, and physical properties like temperature and pH during
the simulation, and he will be able to trace a process’ backwards and

restart from any point with new parameters. The simulator will

33
interact with the MDNA structure editor to allow facile entry,

modification, and display of all structures.

34
3.4 Knowledge Base
The knowledge which must be represented in a problen solving
systen can be classified into three major categories:
1. knowledge which can be computed using a fornal
algorithm
2. knowledge (rules or procedures) for which no well-
defined algorithm exists but for which good heuristics
(based on expertise in the field) exist or can be
deveivpeu.
3. factual data
A strong attempt will be made to represent knowledge in a
uniform manner. Every item in the base could be viewed by the systen
in terms of a transformation at sone level of detail. Sone
transformations combine, separate or modify substances, sone merely
increase knowledge. A planning progran could view all data in this
manner. Certainly much of the knowledge in the first two categories
can be represented by procedures or rules, while nany different data
structures will be used for the representation of factual data. Sone
of the factual data may be incorporated into an algorithn or heuristic
procedure, The knowledge base will be organized in a hierarchical
manner so that it is easy for the system to access specific subclasses
of inivrcmatitun, such as enzyme knowledge, speciric experimental
tecuniques, Or YUNA Structural aata.

Central to the design of the knowledge base will be ensuring

35
that data entry and modification by the expert geneticist is done ina
way natural to him. This means providing a descriptive language which
allows the geneticist to express the diverse types of knowledge in a
language that is appropriate to the problem domain. The MYCIN systen
offers an excellent example to follow. It translates the input of the
expert to an internal representation and then gives the expert a
paraphrase of the input. The expert can correct the paraphrase
interactively until he is satisfied that the program has understood
correctly. With the diversity of knowledge MOLGEN is intended to
handle, we may ultimately have several different language subsets for
specialized use.

It is particularly important in a rapidly growing field such as
molecular genetics, that the knowledge base be easy to modify and
expand. Again, the MYCIN example is an excellent one. Any user can
add new rules to his own working space. If these rules prove useful,
the system staff adds them to the MYCIN program.

A cdifiacull, suporcanct prooiem 1s tne cnecking tor internal
consistency of the knowledge base. Eventually, we hope to develop
methods to check the internal consistency of subsets of the knowledge
base. For example, inconsistency in the enzyme descriptions could
cause application errors which would appear as incorrect planning
steps. Checking for consistency of the enzyme subset of the knowledge
base could alleviate this problen.

Another feature of our knowledge base will be a Literature

36
reference or other source identification for each item represented.
This source documentation will be referred to by the explanation systen
and will also be directly retrievable.

All of the design criteria outlined for the knowledge base in
general apply to the enzyme knowledge. It can be ordered
hierarchically: by enzyme function, initial substrate, product
substrate, pH levels. There is knowledge that fits into each of the
three general categories mentioned above. Furthermore, the type of
information needed for each enzyme is similar: name, reference, basic
type, substrate description, reaction catalyzed, and modifying
information about parameters such as pH, salt concentration and
inhibitors.

We expect the design of our enzyme knowledge base to bea
dynamic process lasting at least a year. The description language will
surely change as geneticists attempt to supply information using it.
Building a reasonably complete file for basic experiments will take
time and effort for both computer scientists and geneticists.

An exanple of how an enzymatic description might be used by the
simulation and planning programs would serve to clarify the need for
conprehensive data. Ligase, and its simple function of "sealing" a
nick in the DNA backbone by making a single phosphodiester bond, has
been briefly mentioned previously. A straightforward simulation
problem would be to determine relative populations of circular and

linear DNA after given periods of time of application of ligase to

37
known DNA Structures. For this simulation to be accurate, vrecise rate
constants of ligase action, and how they are affected by conditions
like pH, salt concentration, temperature, etc. must be provided in the
enzyme knowledge base. In general, the simulator will be accessing the
chemical details of the enzymatic mechanism. The planning program,
however, requires more information on applicability of enzymes’ to the
problem being considered--what substrate will a given enzyme act on,
what types of DNA will compete with, or inhibit the desired enzynatic
action. For example, if the geneticist wished a plan for inserting a
segment of foreign DNA into a host molecule for replication, the
planning program would have to pick an appropriate ligase from a
selection of possible candidates. Discriminating factors would be
those just discussed, substrates and inhibitors, as well as how well
experimental conditions would fit in with the rest of the plan. To
summarize, the simulator needs "acting" information; the planner
requires "discriminating" information.

The organization of the knowledge base is central to the design
of the system. The enzyme knowledge base will be used to test the
ideas sketched here. Of course, we will need to add other types of
knowledge concerning heuristics for planning, information about
laboratory techniques and physical processes in order to have a

workable system.

38
3.5 A DNA structure entry and editing system

One of the basic routines proposed for the molecular genetics
progscu is ani Gdatur, Lut Una (EuwA), alireauy pattrairy Cumpaeccecu.e iiie
auUca a>) tu Have) al AMLYLatLiive routine whith atcepts “text eurtor =
Styrze CUunmmanus aLLtuwing casy Mabpubab ion OL VNA SLructures which are
Presented Lo tne user itt PpLCLOLLaAL LOLWs ine inspiration roe such an
editor is drawn from an analogous routine in the DENDRAL project which
facilitates the viewing and manipulation of chenical structures. The
creation of the chemical structure editor has brought the internal
representations of chemical structures out to the expert chemist user
in a form that is natural to hin and easy to use. The result has been a
tremendously increased use of DENDRAL by chemists and an innediate
incorporation of the tool by other programmers working on various parts
of the project. We expect the EDNA routine to be used as a basic tool
in many programs within the molecular genetics project.

In its completed form, EDNA will provide the user with the
ability to edit DNA structures, build large structures fron smaller
ones, view them with several optional levels of detail, and save then
on file. In many cases, structures and parts of structures will be
referenced by name. It would be a simple matter, for exanple, for a
user to read a "T6-phage” DNA structure from a file and print out its
genetic map or any other level of detail to the extent that it is known
by the system. New details could be entered using easy “insert

segment" or “edit segment" commands. EDNA would be called by other

39
prograns, for example, by the simulator. The simulator will call EDNA
so that the user can specify the initial DNA mixtures and again to
print out the results of the simulation or in explaining the actions on
structures.

Underlying the pictorial representations created by EDNA is an
internal list structure representation of DNA. For exanple, a
nucleotide is represented by a node which contains information to
distinguish between DNA and RNA, the pyrimidine and purine bases, as

ugen ug7u

well as their methylated derivatives. The node includes > :

and "H" pointers to other nodes in the structure representation
corresponding to the naturally occurring chemical bonds of the same
names. Nicks and gaps in the DNA can be represented implicitly in the
list structure. Other formalized types of nodes are used to represent
Seetivaias vi wna where tune intormation 18 Less complete, tnat 1S, wnere
the bases or the exact locations of particular features are not known.
The EDNA program is already partially written and tested. At
this time various routines for drawing structures at different levels
of detail are running as are the basic routines for manipulating the
nodes in the list space. Several trial structures have been drawn and
saved on files including some structures with hairpin configurations
and cthers inVulViug matnds anu gapse Tie sSstruccure euating cCunmanas
are currently being implemented and the methods for superimposing

higher biological orders of structure, for exanple, the superstructures

of genes and special codons, are still in the design stage.

40
4 Resources

The principal computer science personnel involved in the design
and construction of the system components described in part III of this
proposal will be Professor Nancy Nartin at the University of New
Mexico, and two computer science doctoral thesis students at Stanford
University, Peter Friedland and (ttark Stefik. Molecular genetics
knowledge, expertise, insights, techniques, and experimental heuristics
will be provided by the researchers in Professor Joshua Lederberg’s
laboratory at Stanford, particularly post-doctoral fellow Stanislav
Ehrlich, and graduate student Jerry Feitelson. Professor Lederberg
himself will provide substantial amounts of time on a regular basis for
directing the project from the genetics viewpoint. Professor Edward
Feigenbaun and Dr. Bruce Buchanan will direct the computer science
aspects of the project.

Offices for the MOLGEN project will be provided within the
Stanford Heuristic Programming Project so as to foster interaction and
exchange of ideas with workers on similar projects. Active projects
within the Heuristic Programming Project include DENPRAL, a knowledge-
based system for the analysis of organic conpounds from spectroscopic
data, MYCIN, a system for the diagnosis and treatment of infectious
disease, and a project for the determination of protein structures fron
x-ray diffraction data. Approximately thirty workers including
faculty, research associates, and graduate students are involved anong

the projects. All of these projects are active in the design of
intelligent systems for specific application areas and there has been
considerable benefit from exchange and comparison of ideas.

The superb computing facilities of the NIH-supported SUMEX~AIM
timesharing installation (Carhart 1975) will be available at no charge
to this project. The SUMEX~AIM facility, with Prof. Lederberg as
principal investigator, is a national resource for the application of
artificial intelligence techniques to problems in biology and medicine.
Resources to be provided will include all CPU-time and storage
required. Those involved at Stanford will be operating through hard-
wired or dial-up equipment to the SUMEX PDP-10, while those at the
University of New Mexico will access the system through either the ARPA
network or TYNMNET.

The SUMEX-AIM facility is a powerful interactive conputing
system open to a national community. SAIL (Stanford Artificial
Intelligence Language) and other high level languages are available and
supported by a large system staff. Many convenient text editors for
developing programs are provided. The TENEX operating system supports
flexible file handling and sophisticated storage management’ for a

highly interactive computing environment.

42
5 Bibliography

Bates, D. J. and Frieden, C., 1973. "A Small Computer Systen for the
Routine Analysis of Enzyme Kinetic Mechanisms," Comp. and Bioned.
Res., 6, op. 474-486. .

Bertazzoni, U., Ehrlich, S. 9D., and Bernardi, G., 1973. "Radioactive
Labeling and Analysis of 3’-terminal Nucleotides of DNA Fragnents,”
Biochimica et Biophysica Acta, 312, pp. 192-2901.

Bloonfield, V. <A., Crothers, D. M., and Tinoco, I., Jr., 1974.
Physical Chemistry of Nucleic Acids, Harper and Row.

Buchanan, B.G., 1975, "Applications of Artificial Intelligence to
Scientific Reasoning,"”, 2nd USA-JAPAN Conp. Conf. Proc., opp. 189-
194

Buchanan, B. G., Sutherland, G. L., and Feigenbaun, E. A., 1969.
"Heuristic DENDRAL: A Program for Generating Exploratory Hypotheses
in Organic Chemistry,” Machine Intelligence 4, pp. 121-157.

Carhart, R. E., Johnson, S. M., Smith, D. H., Buchanan, Be. G., Droney,
R. G., and Lederberg, J., 1975, "Networking and a Collaborative
Research Community: A Case Study Using the DENDRAL Prograns," to
appear in Computing Networking in Chemistry, Peter Lykos, ed.,
American Chemical Society Symposium Series, No. 19, 1975.

Chance, E. M., 1967. "A Computer Simulation of Oxidate
Phosphorylation," Conp. and Biomed. Res., 1, pp. 251-264.

Chance, E. M. and Shephard, E. P., 1969. "Automatic Techniques in
Enzyme Simulation,” Comp. and Biomed. Res., 2, pp. 321-328.

Corey, E. J. and Wipke, W. T., 1969. "Conputer-assisted Organic
Synthesis,” Science, 166, pp. 179-191.

Crothers, D. M., 1968. "Melting Curves for DNA,” Biopolymers 6, pp.

1391-1404.

Crothers, D. M., 1971. "Theory of the Influence of Oligonucleotide
Chain Conformation on Double Helix Stability," Biopolymers, 10, pp.
1809-1827.

Davis, R., Buchanan, B. G., Shortliffe, E. H., 1975, "Production Rules
as a Representation for a Knowledge-Based Consultation Progran,",
Conputer Science Department Report No. STAN-CS-75-519

Dugaiczyk, A., Boyer, H. W., Goodman, H. M., 1975. "Ligation of Eco Rl

Endonuclease-generated DNA Fragments into Linear and Circular
Structures," J. Mol. Bio., 96, pp. 171-184.

43
Ehrlich, S. D., Torti, G., and Bernardi, G., 1971. "Studies on Acid
Deoxyribonuclease. IX, 5’-hydroxy-terminal and Penultinate
Nucleotides of Oligonucleotides Obtained fron Cal Thymus
Deoxvribonucleic Acid," Biochemistry, 19, 2000-2009.

Elton, R. Ase, 1974. “Theoretical Models for Heterogeneity of Base
Conposition in DNA," J. Theo. Bio., 45, pp. 533-553.

Fikes, R. E., Hart, P. E., and Nilsson, WN. Je, 1972. "Sone New
Directions in Robot Problem Solving," in Machine Intelligence 7,
Edinburgh University Press, pp. 405-430.

Garfinkel, D., 1968. "A Machine-independent Language for the
Simulation of Conplex Chemical and Biochemical Systems," Conp. and
Biomed. Res., 2, op. 31-44.

Green, S. B. and Garfinkel, D., 1970. "Simulation of Enzyme Systems
Using a Matrix Representation,” Comp. and Biomed. Res., 3, pp. 166-
173.

Harris-Warrick, R. M., Ehrlich, S. D., Elkana, Y., and Lederberg, J.,
1975. "Reaction and Purification of Bacterial Genes by
Segmentation of DNA with Eco RI Endonuclease and Agarose~gel
Electrophoresis," Proc. Nat. Acad. Sci., USA, August 1975, pp.
2207-2211.

Hart, P., Nilsson, N., Raphael, B., 1968. “A Formal Basis’ for the
Heuristic Determination of Minimum Cost Paths," IEEE Trans. Sys.
Sci. Cybernetics, Vol. SSC-4, 2, op. 100-197.

Kelly, T. J., Jr. and Smith, H. 0.1, 1970. "A Restriction Enzyme fron
Hemophilus Influenzae. II. Base Sequene of the Recognition Site,"
J. Mol. Bio., 51, op. 393-409.

Kulikowski C A, Weiss S, Saifr A., 1973. "Glacoma Diagnosis and
Therapy by Computer," Proceedings of Annual Meeting of Ass. for
Reserch in Vision and Opthamology.

Nathans, D. and Smith, H. O., 1975. "Restriction Endonucleases in the
Analysis and Restructuring of DNA Molecules," Ann. Rev. of
Biochem., 44, pp. 273-193.

Nilsson, N. Je, 1971. Problen Solving Methods in Artificial
Intelligence, McGraw-Hill.

Pople, H. E., Myers, J. D., and Miller, R. A.~, 1975. "DIALOG: A Model
of Diagnostic Logic for Internal Medicine," Fourth Int. Joint Conf.
on Art. Intel., 2, pp. 848-855

44
Powers, G. J., Jones, R. L., Randall, G. A., Caruthers, M. H., van de
Sande, J. H., Khorana, H. Ge, 1975. "Optimal Strategies for the
Chemical and Enzymatic Synthesis of Bihelical Deoxyribonucleic
Acids," J. Am. Chen. Soc., 97, pp. 875-888.

Rau, D. and Klotz, Le. C., 1975. "A more Conplete Theory of DNA
Renaturation,"” J. Chem. Phys., 62, pp. 2354-2365,

Reddy, D. &., Erman, L. D., and Neely, R. Be, 1973, "A Model anda
Systen for Machine Recognition of Speech". IEEE Transactions on
Audio and Electroacoustics, AU-21, p229.

Sacerdoti, E. D., 1973. “Planning in a Hierarchy of Abstraction
Spaces," Third Int. Joint Conf. on Art. Intel., op. 412-422.

Shortliffe, E. H., Axline, S. G., Buchanan, B. G., Merigan, T. C., and
Cohen, S. N., 1973. “An Artificial Intelligence Program to Advise
Physicians Regarding Antimicrobial Theory,” Comp. and Bioned.
Res., 6, pp. 544-560.

Shortliffe, E. H., Davis, R., Axline, S. G., Buchanan, B. G., Green, C.

C., .and Cohen, S.N., 1975, "Computer-based Consultations in
Clinical Therapeutics: Explanation and Rule-acquisition
Capabilities of the MYCIN System," Conp. and Biomed. Res., 8, 303-
320.

Siklossy, L. and Dreussi, J., 1973. "An Efficient Robot Planner which
Generates its own Procedures,” inird Int. Jo1ne cone. on Art.
intel., Dp. 445-45uU.

DKLar, Jeo, YOC, Pes ana Weissman, De Me, iv7>. “petermination of
Cenes, Restriction Sites, and DNA Sequences Surrounding the 6S RNA
Template of Bacteriophage Lambda," Proc. Nat. Acad. Sei, USA, 72,
pp. 1817-1821.

Smith, D. H., Buchanan, B. G., Engelmore, R. S., Aldercreutz, H., and
Djerassi, C., 1973, “Applications of Artificial Intelligence for

Chemical Inference IX. Analysis of Mixtures without Prior
Separation as Illustrated for Estrogens,” J. Am. Chem. Soc., 95,
6078

Sobell, H. M., 1973, “Symnetry in Protein-Nucleic Acid Interaction

Advances in Genetics, Academic Press, pp 411-490

Sulkowski, E. and Laskowski, Me, Sr., 1962. "Mechanism of Action of
Micrococcal Nuclease on DNA," J. Bio.w Chem., 237, pp. 2620-2625,

Winograd, T., 1972. Understanding Natural Language, Academic Press.
Wipke, W. T., Gund, P., and Friedland, P., 1975. “ALCHEM: A Language
for Describing Chemical Transformations,” in preparation.

Wong, A. K. C., Reichert, T. A., Cohen, D. N., and Aygun B. O., 1974.

"s Generalized Method for Matching Informational Macromolecular
Code Sequences," Conp. in Bio. and Ned., 4, pp. 43-57.

46
7. BUDGET

48
 

NATIONAL SCIENCE FOUNDATION
fashington, D. C. 20550

RESEARCH GRANT PROPOSAL BUDGET (TWO YEAR TOTAL)
2Year Beginning 6/1/76

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Institution: Stanford University
Principal Investigator(s): E. A. Feigenbaum, J. Lederberg NS¥ Funded
Program Name: MOLGEN: A Computer Science Application Man—months Proposed
to Molecular Genetics Cal Acad Sum Amount
A. SALARTES AND WAGES:
1. Senior personnel: a
a. (Co) Principal Investigator J. Lederberg — = = -
(list by name) ..E..A..Feigenhaum............ 18 11,954
b. Faculty Associates (10%) (100%) ~
(list by name) cece cece cece cece sew encceees
(Sub-total) Sees esses sesaeses Cw eee nce ae
2. Other personnel (Non-faculty ) /
a. Research Assoc. (Post-doctoral)
(list separately by name if available,
otherwise give numbers
Bruce, G; Buchanan, Research Computer Scientist | }oc9) 6858.
b. Non-Fac. Professionals (Other)
(list separately--by category, giving number,
€.g. one computer programmer )
e. ( 3) Grad Students (Res. Asst.) ............ 32,478
d. ( ) Pre-Baccalaureate Students ...eseeceeee
e. ( ) Secretarial-Clerical ..... sea c ens eeune
f. (  ) Technical, Shop & Other ........ vee eeae
Total Salaries and Wages wo... eee eee ee eee 21,270
B. STAFF BENEFITS: .....000s Cece rte ncaa eee e cence aaa 9,71]
C. TOTAL SALARTES, WAGES AND STAFF
BENEFITS (A + B) .iccccccccececcceccreeeeeeseenesees 60,981
D. PERMANENT EQUIPMENT:
(List as Required)
_ Purchase of, two, computer terminals 3, 55180
Total Permanent Equipment 5,180
E. EXPENDABLE SUPPLIES AND EQUIPMENT ...........ee0000: . 1,000
F. TRAVEL: ,
1. Domestic .....04. Pree meet tener eee eet n eas eene 2,000
2. Foreign (list as required) ...... Lene eeee Lec eeeee
Total Travel wee ec c ec cc ccc cence cece nee ccnes ~ 2,000"
G. PUBLICATION COSTS ~ 400
H. COMPUTER COSTS (if charged as direct costs)
T. OTHER COSTS:
(itemize by major type) Terminal Maintenance ; 960
Communigatians .(tenminalzto-computer sy project, business 1,500
Total Other Costs 2,460
J. TOTAL DIRECT COSTS (C through I) ..... eee c cae e ee eeee . 72,021
K. INDIRECT COSTS: +
1. On Campus ..... OL weve ccc cccccccuccvecvuceaues hy 523
2. Off Campus ..6...%8 Of ceecceccccccccccvcccuccvaces
Total Indirect CostS ..secceseeseeecceceevees WY 523°
L. TOTAL COSTS (J plus K) wavs. ccc e ee eee eee See e cece eee 113,544
M. TOTAL CONTRIBUTIONS FROM OTHER SOURCES ........ce0ee%
N. 113,544

TOTAL ESTIMATED PROJECT COST ......... Geet eseeeenscas

 

 
 

NATIONAL SCIENCE FOUNDATION
Washington, D. C. 20550

RESEARCH GRANT PROPOSAL BUDGET
Year Beginning 6/1/76

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Institution: Stanford University .
Principal Investigator(s): E. A. Feigenbaum, J. Lederberg Bee Ended
Program Name; MOLGEN: A Computer Sctence Application Man—months Proposed
To Molecular Genetics Cal Acad Sum Amount
A. SALARTES AND WAGES:
1. Senior personnel: — seek
a. (Co) Principal Investigator J+ Lederberg ~ ~ ~ -
(list by name) .....E..A..Feigenhaum..(LQ%)... o 2,753
b. Faculty Associates
(list by name) Lees cece ccc eee eee etre tenance .
(Sub-total) ..... eee beeen ence eee ees
2. Other personnel (Non-faculty)
a. Research Assoc. (Post-doctoral)
(list separately by name if available,
otherwise give numbers)
bd. Non-Fac. Professionals ( ther)
(list separately--by category, giving number,
e.g. one computer programmer)
ec. ( 3) Grad Students (Res. Asst.) ............ 16,224
a. ¢ ) Pre-Baccalaureate Students .......-e eee
e. ( ) Secretarial—Clerical .icsceceeaee seen
f. ( ) Technical, Shop & Other ...........eaee a
Total Salaries and Wages .....- ec eee eee eee 18,977.
B. STAFF BENEFITS: ........00c00- eee eee ess essseereuees 3,409
C. TOTAL SALARIES, WAGES AND S'TAFF a
BENEFITS (A + B) ..ccecccccccnccceucceeseeeecuneeeucus 22 , 386
D. PERMANENT EQUIPMENT:
(List as Required) teak
_ Purchase ,of ,two computer. terminals. lee, 5,180
Total Permanent Equipment 5,180
E. EXPENDABLE SUPPLIES AND EQUIPMENT .....ccceeeee ta aeae 500
F. TRAVEL:
LT. Domestic we. cece ccc ccc cece eee tee eter e ne nene 1,000
2. Foreign (list as required) eee e een ceases teens
Total Travel .. cee ccc ce ccc cer een e en eeee 1,000
G. PUBLICATION COSTS 200
H. COMPUTER COSTS (if charged as direct costs)
T. OTHER COSTS:
(itemize by major type) Maintenance of computer terminals 480
Communicatiens ,(terminal-to-computer,, project, business 750
Total Other Costs Phone, postage 1,230
J. TOTAL DIRECT COSTS (C through I) we. ee eee ec eee eee 30,496
K. INDIRECT COSTS: -
1. On Campus ..... Of vec ccccccccvccacececcevecees 17,438
2. Off Campus .1...%8 Of Lecce cece cece cee cence ee eees
Total Indirect Costs wesc e cece cece cence ee eenee 17,438
L. TOTAL COSTS (J plus K) veces ece cee cec cece eee eeeneee . 47,934
M. TOPAL CONTRIBUTIONS FROM OTHER SOURCES ..... ve eeeeeee
N. 47,934

TOTAL ESTIMATED PROJECT COST 4.1... eee eeaes ae nee aee

 

 
 

RAYIONAL SCLBNCE FOUNDATLON
Washington, D. C. 20550

RESEARCH GRANT PROPOSAL BUDGET
Year Beginning 6/1/77

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Institution: Stanford University .
Principal Investigator(s): E. A. Feigenbaum, J. Lederberg NSF Funded
Program Name; MOLGEN: A Computer Science Application Man-months Proposed
to Molecular Genetics Cal Acad Sum Amount
A. SALARTES AND WAGES:
1. Senior personnel: setek
a. (Co) Principal Investigator J. Lederberg — _ se a
(list by name) ..E..A..Feigenkaum..........., 9 2 9,201
b. Faculty Associates (10%) {T00%)
(list by name) co.cc cece cece ee cece ee ee enes :
(Sub-total) ..cseesccccc cece tence cece cease
2. Other personnel (Non-faculty)
a. Research Assoc, (Post-doctoral)
(list separately by name if available,
otherwise give numbers) ;
Bruce, &., Buchanan,, Research .Compyter Sclentist 11 6 838
b. Non-Fac. Professionals (Other) (25%)
(list separately—-by category, giving number,
e.g. one computer programmer)
ce. (3 ) Grad Students (Res. Asst.) ...cceeeeeee 16,254
a. ( ) Pre-Baccalaureate Students ..........06. —
e. ( ) Secretarial-Clerical ..eseeeeeceeceeee ; _
f. ( ) Technical, Shop & Other ..........e000.
Total Salaries and Wages ............. cece eee 32,293
B. STAFF BENEFITS: ......... ee . 6,302
C. TOTAL SALARTES, WAGES AND STAFF
BENEFITS (A +.B) .csecccceccecccccececeeeeeeeenes Leeee 38,595 _
D. PERMANENT EQUIPMENT: -
(List as Required)
Total Permanent Equipment
E. EXPENDABLE SUPPLIES AND EQUIPMENT ....... see eee beweee . 500
F. TRAVEL:
LT. Domestic cece ccc cece cece ne ne eee enter aren sneeas 1,000
2. Foreign (list as required) ...ee ccc ce cece eu eecees
Total Travel ...... ee eee ee eee ewe eee wae 1,000
G. PUBLICATION COSTS , 200
H. COMPUTER COSTS (if charged as direct costs)
I. OTHER COSTS:
(itemize by major type) Terminal maintenance 480
Fommunications. (terminals to-computer. project, business | 790
Total Other Costs 1,230
J. TOTAL DIRECT COSTS (C through I) wi... ee eee cece ee ees AY, 525
K. INDIRECT cosTs: +
1. On Campus ..... Of cece ccc cccccecceuccucvcenene 24,085
2. Off Campus ..... b OL cece cece cece cet e ese eeuenn
Total Indirect Costs ........ Lecce eee eee ees 24,085
L. TOTAL COSTS (J plus K) ........ sate cece eee tereace . 65,610
M. TOTAL CONTRIBUTIONS FROM OTHER SOURCES ...........00.%
N. TOTAL ESTIMATED PROJECT COST .... ccc eee cece eee __ 65,610

 

 
BUDGET NOTES

Salary increases estimated at 10%, effective Sept. 1.

* Equal to 2/9 academic year salary.

** Qver two-year period, lease price exceeds purchase price plus maintenance.
However, leases can be arranged if administratively more convenient to NSF.

*&4Professor Lederberg's activity on this project will be done without charge
to the budget.

+ INDIRECT COSTS: On Campus

56% of Total Direct Costs thru 9/1/76
58% of Total Direct Costs thereafter.