Research Proposal Submitted to the National Science Foundation Proposed Amount $294,476 Proposed Effective Date 6/1/78 _ Proposed Duration 24 (months ) Title MOLGEN: A Computer Science Application to Molecular Genetics Principal Investigator Edward A. Feigenbaum supmitting Institution Stanford University Soc. Sec. No. Department Computer Science Institution Branch/C (if different from submitting institution) en/vanpus: Address Stanford, California Addres ’ ess 94305 Branch/Campus Co-Principal Investigator___Joshua Lederberg Soc.. Sec. » iim Co-Principal Investigator Soc. Sec. No. ‘ If renewal request previous NSF grant No. MCS 76-11649 Make grant to Stanford University (name of institution or organization to which grant should be made) Endorsements: Principal Investigator(s) Dept. Head Institutional Admin. Offic: Name Edward A. Feigenbaum / . . Signature Edit A, Fljende— Edin 4h Fecaebe 7 : ¢ Title Professor and Chairman Telephone Number 415-497-4079 Date October 27, 1977 October 27, 1977 Table of Contents Section Page Subsection 1. Introduction - ee leh elle ll 1 1. 1 Motivations e * e * e e e . a a e e a 4 1.2 Progress To Date . . «1 « © © «© © «© «© «© e 4 1.3 Plans for the Remainder of this Grant Period . . . 12 1.4 An Example of the Genetic Utility of Automated Experiment Design . . « «© » »© © « «# «© «© JY 2. Research Plans a 2.1 Building and Maintaining the Genetics Knowledge Base ee le le ee 2.2 Recognizing and Abstracting and Saving Successful Plans. . .« «© « «© »© «© «© «© « « 22 2.3 Understanding Experimental Discrepancies by Hypothesis Formation . .« « « «© »© «» «»© «© « 25 2.4 Reasoning by Analogy . . « «© 2 © «© «© «© « « 33 2.5 Performance Evaluation and Improvement of AI Knowledge-Based Systems > 8 8 @© e© ew eh hehe (C8 3. Significance oe ee ele le le le le ll 3.1 Significance to Computer Science . . . + «+ + « 44 3.2 Significance to the Conduct of Experimental Science and to Science Policy . . . +6 « « « 46 4. Budget . , . . ° ° e * . . e. @ . e . « 49 5. Resources : Appendix I _ GLOSSARY . Appendix II EDNA -- The Editor for DNA ‘Appendix III A Genetic Planning Example References . ii October 27, 1977 50 52 59 63 67 October 27, 1977 1 Introduction This application addresses the continuation of research on the applications of artificial intelligence (AI) (1) to exper imental molecular genetics. It is an extension of a longstanding effort to cultivate attention to ongoing laboratory research as a domain of explorations in artificial intelligence. Our major effort in this field had been in the DENDRAL project, with analytical organic chemistry as the object discipline. The present effort, "MOLGEN", focuses on a new object domain, namely molecular genetics. However, for reasons that will be elaborated, the focus is on programs to suggest experiment- planning sequences needed to solve a given structure, rather than on hypotheses about the structures themselves, which characterizes DENDRAL, Our primary motivation is to deepen our knowledge of the art and science of creating programs that reason with symbolic knowledge to aid human problem solvers. The task domain--molecular genetics--serves as a rich intellectual and scientific environment in which to develop and test our ideas. The major computer science issues we are addressing are: Present Grant period (June 1976 - June 1978) (1) Creation of a knowledge representation system with a knowledge acquisition package. The system, known as the Units Package, may be used to build a knowledge base in any suitable domain. It provides an object- centered approach for storage of both declarative and procedural information concerning all entities in the domain. Section 1.2.1 (2) Structured representation of process information, Procedures which simulate the action of the various processes in the domain form an integral part of the knowledge base. Moreover, the representation framework allows for inspection and acquisition of those procedures. Section 1.3.3.1. (3) Creation of program schemata and instances for general problem solving steps. Domain-independent knowledge about general problem solving methods also fits into the knowledge representation structure we have devised. Section 1.3.3.3. (1) For definitions of many of the genetic and computer science terms used throughout this proposal, see Appendix I, October 27, 1977 (4) Domain Specific Critics. Mechanisms for the activation of various domain specific strategies when certain predefined situations occur during the course of experiment design. Section 1.3.3.4. (5) Development of a specific planning strategy designed to provide high-performance for the class of genetic experiments known as discrimination experiments. The idea is based on indexing abstracted experimental designs to the types of structural features for which they have proven useful. Section 1.3.4.° Grant Renewal Period (June 1978 - June 1980) (6) Creating a more comprehensive genetics knowledge base. Expanding the knowledge base within the area of DNA Structural manipulation problems. This work will be done mainly by the Stanford Genetics Department. Section 2.1. (7) Abstracting and Saving Plans. Recognizing when newly- created experiment designs are worth saving and then generalizing those plans so they are useful for more than the specific problem environment which caused their generation. This work will be done in the Computer Science Departments of both Stanford and UNM. Section 2.2. (8) Making use of the process of hypothesis formation to help debug MOLGEN-produced experiment designs. This process is especially important in a domain like molecular genetics where incomplete knowledge about objects and processes is the rule rather than the exception. This work will be done in the Stanford Computer Science Department. Section 2.3. (9) Experiment planning by analogy. MOLGEN provides an excellent environment for exploring various types of analogical reasoning. We integrate problem-solving by analogy into’ the experiment design system as one of the possible tools for solving subproblems. This work will be done in the UNM Computer Science Department. Section 2.4 (10) Performance evaluation as an integral part of the knowledge representation and acquisition system. We view the process of evaluating a system's performance and suggesting improvements as an AI problem solving task. The strategies for this evaluation will be October 27, 1977 stored within the same framework as all other MOLGEN planning strategies. This work will done in both the Stanford and UNM Computer Science Departments. Section 2.5. We have designed our effort to facilitate generalization to other domains beyond genetics in future research and applications. A second motivation behind the proposed research is to develop tools that can benefit molecular geneticists. We believe there is substantial. benefit to be derived from programs that act as "intelligent assistants" to scientists. First of all, the sheer amount of detailed knowledge a scientist is expected to know makes it likely that good experiments are being missed. Second, we believe that an intelligent planning assistant can offer some help in reasoning about the consequences of combining experimental facts in many possible ways. A third motivation for applying artificial intelligence techniques to an experimental science like molecular genetics is to help us better understand the scientific method. The rigorous detail required for creating computer programs that assist in the performance of scientific tasks forces us to explicate concepts and procedures much more carefully than practicing scientists usually do. During the current grant period, the basic MOLGEN system will be completed. (See Section 1.2 for details). This system will have a modest genetics and problem solving knowledge base, an ability to design a subclass of genetics experiments and a representation and acquisition system capable of handling all.of the knowledge used by an experiment design system. During the renewal period we will expand the system's expertise and use the system to explore new problem solving capabilities. (Readers can skip to Section 2 for details of work to be completed during the renewal period). ‘The basic MOLGEN system will provide an excellent framework for exploring more sophisticated experiment design activities and problem solving techniques. We have found in our analysis of the activity of geneticists that designing experiments is intimately coupled with forming hypotheses about the failure of the experiment design in the laboratory, then testing the hypotheses and, as a result of the tests, gaining new knowledge wtich effects the design. The new hypothesis formation component . of MOLGEN will extend problem solving capabilities in this direction. Another important element in the repertoire of an expert human problem solver is the ability to form analogies between problems and to use the analogy to solve new problems. Our exploration of analogical reasoning within the MOLGEN system will lead to a new problem solving component integrated with the rest of the design system. New expertise will come in a direct manner from the expansion of the knowledge base by geneticists. At the same time we will increase our problem solving knowledge base by adding a component which monitors the problem solving October 27, 1977 experience of the system and abstracts new experiment designs for the design process. The generalization process developed for this monitor will be used by the analogical reasoning component. Our final extension to the MOLGEN system is the ability to evaluate its own performance. This evaluation will aid the development of all of the other extensions, ll Motivations The successful initiation of and perseverance with interdisciplinary research of this kind can be hindered or aided by many intangible factors in the local context, as well as by the personal idiosyncrasies of the participants. We need only point to the history of the DENDRAL project as evidence of our proven capability, for whatever reasons, to engage in such research in the Stanford University environment. It is inevitably true that no one person can embrace a professional level of critical insight over all the fields represented, but this has been surmounted by dint of hard work on mutual communication from all sides. More problematical is the problem of communicating our findings to a disparate group of disciplinary experts outside, e.g. to computer scientists and to molecular geneticists who customarily have had little need of perusing each other's disciplines. The best solution is to get these parties engaged not just in reading proposals or publications, but in the actual manipulation of the working programs -—- particularly with the help of computer networks that link dispersed sites. Even so, an amount of technical detail that could well try the patience and skill of the most tolerant reviewer is unavoidable if we are to offer a concrete, fair picture of the status of our efforts, especially in the formative Stages. We trust that, as is often true, the maturation of these efforts can be accompanied by a simplification in presentation, and Meanwhile we must appeal to the patience and good will of constructive critics. The supporting documents attached to this proposal offer a sample of our working papers and publications. Further documentation on any aspect of the present system is, of course, available. 1.2 Progress To Date The following sections describe our progress towards developing an experiment planning program. The first part discusses a knowledge management system. Most of our design and programming effort has been in this area. A. package of programs for the representation and acquisition of knowledge is discussed and our progress in using it in creating a genetics knowledge base is also described. 1,2 October 27, 1977 The next section focuses on our examination of genetics knowledge. It discusses various studies we have made of genetics experiments and what we have learned from them. A short section describes some work we did in representing the action of the ligase enzyme and how this was of benefit to work in Prof. Lederberg's laboratory. 1.2.1 Knowledge Base Research The success of MOLGEN as an experiment planner will depend on the quality of its knowledge base. Therefore, much of the research effort to date has been in the design and implementation of a knowledge representation and acquisition system. All of the information relevant to the planning process will be an explicit part of the knowledge base. The motivation for this aspect of the design is the necessity to expand the program capabilities in a modular fashion and to explain the rationale behind the program's planning behavior. We need to represent concepts (e.g. enzyme), instances (e.g. EcoRI), relationships among concepts, and relationships among instances. In addition, we need to represent processes. Two recent papers [42] [3] have delineated problems which arise in interpreting a knowledge base when the semantics of the representation are not clearly specified. We have purposely limited the expressive power of our representations to enable us to clearly define their semantics. The result of this work is the Unit Package. Although this package has been designed in the context of our genetics application, the package does not contain any genetics knowledge. At this time one other group within the Stanford Heuristic Programming Project, the AGE group, is using the package. Our knowledge base design has been reported in the literature [24]. Other documents relating to its use are included in the appendices. Also included is a sample session illustrating the acquisition of a DNA structure and a current summary of part of the knowledge base dealing with enzymes. Section 1.2.1.1, Section 1.2.1.2, and Section 1.2.1.3 describe the implemented portions of the representation package. Section 1.3.1 discusses some of the further developments for the knowledge base that are planned for the rest of the current grant period. 1.2.1.1 The Unit Package: Background and Relation to Other Research Design of the representation package (termed the Unit Package) was started in November 1976 although most of the programming has been done since March 1977. The design of the Unit Package profited from 1.2 October 27, 1977 the experience of other efforts e.g. KRL-O [1], SMALLTALK [20], and the work at the MIT AI Lab on frames [27] [21]. One important difference in perspective between the Unit Package research and many other efforts is that it is at once a knowledge acquisition system and a knowledge representation system. The knowledge acquisition system is in the same spirit as Davis' work on TEIRESIAS [7] where schemata were used to guide the acquisition of knowledge. Another distinguishing feature of the Unit Package is that it is also used to represent and acquire process information, that is, action and strategy knowledge (see Section 1.3.3.1). The package performs many other knowledge Management tasks for the system. The basic element of our representation is an entity called a unit. Units are either "prototypes" or "instances". Prototypes are used to represent Woods' intensions [41], Brachman's concepts [3]. That is, a prototype defines a class, that is, the knowledge expected for a particular concept in terms of slots with the appropriate fields (e.g, role, value/specification, default). (2) Individuals are defined as instances of prototypes. Subclasses can be formed with a special link called generalization/specialization. Concepts can only have one generalization and instances only one prototype. Other relationships among units can be expressed by the value/specification of a slot or by creating an explicit relation unit. The knowledge acquisition system is used to acquire both prototypes and instances. This system is being used by system builders to create the bootstrap network described in the next section and by Jerry Feitelson, our genetics research assistant, for defining and instantiating knowledge about several families of enzymes. (See Appendix II and the supporting documentation for examples of knowledge acquisition.) Other knowledge base management chores include automatic bookkeeping and knowledge integration. The bookkeeping chores simply keep a record of all pertinent information about the creation or modification of a unit. 1.2.1.2 The Bootstrap Knowledge Base and Object-centered Perspective One important aspect of the design of the system is that the knowledge base contains knowledge about its own data representations. We have provided what we term a "bootstrap knowledge base." It contains domain independent knowledge about commonly used data types. When (2) The slots in a prototype are similar to Brachman's DATTR links. A limited version of Brachman's MODALITY link is implemented, MODALITY specifies the necessity of the specific DATTR in instances. 1.2 , October 27, 1977 using our knowledge base in a new domain, an artificial intelligence researcher would probably start with the bootstrap knowledge base and then proceed to create units for the specific knowledge of his task area. Both the AGE and genetics knowledge bases have been started in this manner. . The bootstrap knowledge base serves to illustrate our approach to extensibility. Most of the bootstrap knowledge base is made up of the procedures which capture the knowledge in the system about primitive datatypes. To add a new datatype to our system, one needs to provide the knowledge base with procedures for some basic Operations -~ such as editing and printing. Actually, the same approach is used in the unit package for defining a new datatype as is used for defining a new enzyme. The process of defining new datatypes requires, however, an understanding of Interlisp because the primitive processes in the system are grounded in that lenguage. New datatypes must be defined together with their basic operations and entered into the knowledge base, The approach used for attaching the basic operations to the primitive data types is very similar to SMALLTALK [20] and KRL-0 {l]. As far as the knowledge base is concerned, in order to edit a DNA structure, one sends the DNA unit an edit message. Another way of Saying this is that the procedure for editing a DNA structure is attached to the DNA unit. Thus, indexing of the "edit procedure" is via the "DNA object" so that this is an application of the object- centered viewpoint introduced in the languages mentioned above. This Organization makes it easy to add new datatypes to the knowledge base. An object-centered viewpoint is used in our system for other applications as well. An example of this is the "inspector" idea discussed in Section 1.3.3.4 which illustrates the Significance of object~centered indexing during planning. In that example, particular knowledge about related planning situations is associated with the genetic object -- "pH", Generally only operations basic to primitive data types are represented in LISP code. Other process information is expressed as units. Instances of process units can also be associated with objects. 1.2.1.3 Considerations of Human Engineering Because we expect our knowledge base to be maintained by geneticists, it has been important to carefully consider the interaction expected between a geneticist user and our Unit Package. This effort has resulted in the crafting of a package which is particularly easy to use. This ease is verified by the acceptance of our system by both geneticists and users on the AGE project, 1.2 October 27, 1977 One part of human engineering has already been mentioned above. The system automatically records who has entered or modified any of the information in the knowledge base. It provides summaries of what is in the knowledge base and can indicate which areas of the knowledge base have been changed recently. (See the attached supporting documentation for a summary of a recent version of the genetics knowledge base.) Another example of this human engineering is the "User Profile" maintained in the unit editor. A user's profile contains such information as whether he is a programmer; whether he’ prefers verbose printouts, whether he likes to be informed when the system does things automatically for him (i.e. that he has not requested). This information helps the system to tailor its interaction with a user. In addition, terminal communication is handled via very general routines which check for type ahead and allow help information to be provided to the user at any point in the dialog. The result of this is that an experienced user can have very brief interactions with very little prompting from the system. An inexperienced user will be prompted at each step of the dialog and may type "?" at any time to get an explanation or example of what the program expects. (See the User's Guide to Units in the attached supporting documentation.) Many of the human engineering parts of the system take advantage of the facilities of InterLisp. For example, when a programmer is changing some procedures in the knowledge base, he need not explicitly request that the changed functions be recompiled. This part of the updating is performed automatically at the end of a session. Our final example of human engineering illustrates the fuzzy borderline between the technical problems of making the system easy to use and some more complicated issues of knowledge base management. Recognizing the importance of experimentation in representing knowledge, one of the facilities of the unit system is to merge knowledge bases. This makes it easy to update various knowledge bases when changes are made -- for example when the bootstrap knowledge base is updated. Part of the transfer protocol includes the facility for one unit to imply a requirement for another. For example, suppose that the bootstrap knowledge base has just been updated so that it has knowledge about the representation of an important data structure - say "trees", If this is useful in representing knowledge in the AGE system, they need only request to transfer the "tree unit" from the bootstrap knowledge base. Any other units essential to complete the knowledge about "tree representation or tree operations" will be transferred automatically at the same time. The more difficult aspect of merging knowledge bases concerns how the system can determine which units should be transferred when a particular unit is transferred. Restated -- how can the system find 1.2 October 27, 1977 all of the relevant knowledge about a unit? Our approach to this involves a method of inserting knowledge about relevance into the knowledge base itself, Considerations like those in the previous paragraph are simple enough from the implementation point of view and are described in the appendices. They are representative of the effort we have taken to make the units package easy for a geneticist (or a programmer) to use. 1.2.2 Studying the Process of Experiment Design in Molecular Genetics Considerable effort was given during the first fifteen months of the MOLGEN grant to the study of the process of designing experiments. Attempts were made to determine subsets of the domain which would provide realistic target areas for automated experiment design and to collect the strategies and domain knowledge used by experts for those subsets. Detailed descriptions of the results of this investigation are given in the appendices. The work included four major efforts: 1. Designing as many rational experimental plans as possible for an experimental problem from Prof. Lederberg's laboratory, the determination of the presence of poly-A sequences in DNA. (See attached supporting documents: "The Embedded Sequence Problem") 2. Collecting "skeletal" or abstracted plans for a wide variety of structural elucidation problems. These plans are meant to capture the re-usable aspects of an experimental plan for use in planning other experiments. They are an essential component of an approach to planning described in Section 1.3, 3. Analyzing in detail the logic and development of a single experiment performed in Prof. lLederberg's laboratory. This study considers both the knowledge which led to the successful experimental design, and also the process of hypothesis formation to account for experimental failures. (See attached supporting documents for a copy of [14]) 4. Analyzing a collection of experiments suggested by several geneticists as good examples of logically produced experimental designs in order to extract the knowledge needed for MOLGEN to design the experiments. 1.2 , October 27, 1977 Our design of the system has been strongly influenced by these studies. Rather than give detailed results in the body of this proposal, we present a summary of these influences. 1. We will concentrate our initial efforts on experiment planning to the subfield of restriction enzyme experiments and then expand to applications to plasmid and sequencing experiments. 2. We will represent genetic objects so that they may be viewed from a variety of different perspectives. Our studies showed that different perspectives led to different experimental algorithms. (See [40] for a Similar finding in program synthesis.) 3. While we are restricting the range of objects and processes as indicated in (1), we must represent many different types of knowledge about each object. Even in a single experiment, a great diversity of knowledge was needed to design the experiment [14]. 4, The planning system will be one component of an eventual genetics expert. The experimental design process is much more event-driven than anticipated. The eventual expert system will need a hypothesis formation component and techniques to exploit serendipity. For example, results of a particular transformation may indicate interesting sidelights which could be investigated in the context of the current experiment. 1.2.2.1 Simulating the Process of Enzymatic Ligation An attempt was made early in the first year of the current grant to simulate the action of enzymatic ligation on DNA structures. There were several motivations behind this effort. We wished to test our newly developed DNA structural representation and to learn something about representing the processes that modify those structures. Also, we were presented with an interesting problem where we could be of actual use to a laboratory geneticist (Dr. S. D. Ehrlich). The experiment. in question was one in which sticky ended DNA of uniform length was reacted with ligase to ‘join together ends. Each ligation could either join two different DNA structures or could circularize a single structure, In the former case the new structure would still have two sticky ends and the total number of structures in the sample (concentration) would be reduced by one. In the latter case 10 1.2 . October 27, 1977 the number of structures would remain the same, but one structure would be "inactivated" for further ligation. The problem was to predict what the total sample population would look like after some portion of the original monomeric structures had disappeared. That is, what percent of the total structures would be monomeric circles, dimers, dimeric Circles, and so on. These percentages would be compared with laboratory results in order to determine how many different ends types were present in the sample. (The more different end types, the greater the ratio of circles to linear and shorter structures to longer structures). The simulation we developed was based ona kinetic theory of ligation [11]. The simulation results correlated remarkably well with the actual laboratory experiment when a model with two different end types was chosen, This gave additional support to Dr. Ehrlich's conclusion that indeed two different structure types had been present. The program was then used for the related experiment of maximizing inserts of a small gene into a plasmid. The idea was to cut the plasmid with the same restriction enzyme that had produced the sticky- ended gene, join the two, and then recircularize the structure to produce a recombinant plasmid. The problem is that many other structures, monomeric circles, long linear concatemers, etc., can occur. The problem was to determine optimum starting concentrations of the two structures and the length of time to carry out the ligating process to maximize the desired product. The suggested values given by the simulation program were of considerable assistance in carrying out the laboratory operation. The major significance of this work for future MOLGEN development (besides giving a working geneticist assistance and encouraging his ccoperation in communicating problem-solving knowledge to us) was that this was our first attempt to capture the knowledge needed to simulate a genetic process. We made no effort to consider environmental side effects like temperature and pH, but the results were still quite encouraging. The major problems we discovered were twofold and related. First, the simulation program was hard to expand to other problems. Adding other relevant information, for example stereochemical effects, or about new processes such as cutting structures, would involve considerable additional programming. Second, all of the knowledge was hidden in the simulation program (written in SAIL). The program could offer no explanation of its predicted results, nor could the user make interactive changes to its guiding theory. These results led to our decision to make all knowledge, not just knowledge about static genetic objects, explicit in our knowledge base. A discussion of how this is being accomplished is given in Section 1.2.1 and Section 1.3.3. li 1.3 October 27, 1977 1.3 Plans for the Remainder of this Grant Period 1.3.1 Continuing Knowledge Base Management Research Our progress in developing a knowledge representation system for MOLGEN has been discussed above. Our continuing effort during the grant period is divided into (1) filling ina knowledge base with entries specific to a specialized subset of molecular genetics, and (2) expanding our representational capabilities to include knowledge about the process of experiment design. 1.3.2 A Genetics Knowledge Base for Restriction Enzyme Experiments The Units package described above will be used to acquire and store the knowledge needed to plan experiments. The initial knowledge base will be limited to the facts and heuristics necessary to reason about restriction enzyme experiments. This will include: 1. Units for basic concepts like DNA structure, enzymes, and samples. 2. Units for each individual restriction enzyme. 3. Units for laboratory techniques which make use of restriction enzymes. 4, Procedures which describe the action of restriction enzymes on DNA, After the restriction enzyme knowledge has been debugged, knowledge about the fields of plasmid technology and sequencing DNA will be added. 1.3.3 Knowledge About Processes and Planning The basic planning work in MOLGEN will be divided into two Major efforts. General domain-independent planning Strategies will be developed within the units knowledge representation system as part of the process of making an "AI toolbox." At the same time, work will proceed on a specific strategy tailored to provide high-performance for discrimination experiments in molecular genetics. The specific approach for discrimination experiments is discussed below after the discussion of our general approach to process representation and planning. 12 1.3 , oo October 27, 1977 1.3.3.1 Representation of Processes Extending the Unit Package to accommodate the representation of processes is the important next step in the development of the knowledge base system. We view both simulation (of a laboratory technique) and planning operations (e.g. selection of a subgoal) as processes. Our approach to the representation of processes has previously been discussed [24] [36]; the main ideas are reviewed below. We want to provide a facility that makes process description and debugging as easy as possible. For example, the system will provide an appropriate set of primitive operations for the operations on DNA models. The geneticists and the computer scientists will create prototypes for laboratory steps. Geneticists will then use these prototypes as guides during the acquisition of knowledge about particular laboratory tools. The same prototype/instance system will be used to represent planning operations. Many of the processes we want to represent can be grouped into classes such that there is a prototypical action for each class. That is, the individual processes (e.g. action of EcoRI) can be viewed as instantiations of a process concept (e.g. action of restriction enzymes) just as individual objects are the extensions of object concepts. Many researchers in program synthesis (such as [9]) have used a "program schema" to represent a prototypical program. The schema is then instantiated by the synthesis system to produce a concrete program. A program schema consists of a generalized program with abstract predicate, function and constant symbols, input/output specifications, and restrictions on possible instantiations of the abstract symbols. This notion of a program schema (we will use the expression "process schema" synonymously) can be implemented within the current unit representation. The abstract predicate , function, and constant symbols are slot names. The restrictions on possible instantiations are specified in the value/ specification field. The abstract function symbols may be restricted to be either system primitive functions or instantiations of other program schemata. I/O variables and relationships among them can also be specified via the slot mechanism and the "relation" units. Within a rule unit there is a_ special slot which contains the generalized procedure. Thus, a program schema can in fact be developed in a modular fashion: first create the simpler, or primitive schemata; then decide on the constant, predicate, and function symbols needed (i.e. the named component parts in a process description); then create the generalized procedure using these symbols as the named components. Implementation details of rule units are currently being examined. While in synthesis programs, the system creates the 13 1.3 October 27, 1977 instantiation from program segments, in the current design of MOLGEN, the user creates the instantiation. The acquisition system must be able to check instantiations to make certain that the restrictions and I/O specifications have been met. The use of process schemata and their representation as units has implications for planning also. The visibility of process components in slots is important so that other processes can examine and select them according to the values in the slots. General problems of such indexing methods and comparisons of systems which use these methods can be found in the literature [8] [36]. In many cases it is appropriate to associate the process elements of the system with definitions of the objects. This “object- centered" viewpoint has been expounded [1] and implemented as "attached procedures". Several different applications for attached procedures in MOLGEN are given [24]. These attached procedures can be system primitives or instances of process prototypes. 1.3.3.2 Representation of States and the Planning Network Any system for doing problem solving needs to be able to work with representations of domain worlds (often termed "world states") and representations of problem solving states. An example of a world state in MOLGEN is a sample which might contain a variety of hypothesized DNA Structures, enzymes, and other reagents. An example of a problem solving state is a sequence of laboratory steps and world states which represent several steps of an experiment. We find it convenient to consider a problem solving state as having levels of detail, with different kinds of problem solving information on the different levels. A description of our approach using such levels is given in [36], section V.3. Two main ideas have influenced our design. First, the world states and planning states are fairly complicated structures. It is necessary for a planning program to communicate with its users ebout its current planning state. In particular, it must be able to display its progress in an easily understandable form and it must be able to integrate suggestions from the user to alter its course. Rather than create separate programs for doing this, we have decided to represent the planning and world state knowledge using units in the knowledge base, The second idea is that the process of problem solving can be expressed adequately by a small number of basic operations. These operations are used repeatedly in the course of solving a complicated problem. This is discussed in the next section. 14 1.3 October 27, 1977 1.3.3.3 An Eclectic Perspective on Problem Solving -- The AI Toolbox When we study the solution to a problem in an unfamiliar domain, the first reaction is to be overwhelmed by the new detail and terminology. When one has mastered the terminology, the problem solving process can be viewed in better perspective. To be sure, many solutions remain brilliant and surprising. The majority of the solutions are easy to follow and we may recognize the solution process in the mind of the problem solver, Here he is sketching out some goals; now he is selecting an operator; now he is refining a step; now he has factored out a subproblem. We believe that the process of problem solving can be modelled by a small number of standard operations. These operations include such things as (1) Sketching out planning islands; (2) Proposing subproblems; (3) Testing for mismatch of goal statés; (4) Focusing attention on part of the problem; (5) Assigning time sequence to steps; (6) Selecting among competing choices for a given step; and (7) Splitting a problem into cases. Although these have been recognized as basic components of strategy, no existing system has integrated this breadth of strategy knowledge into a knowledge base. Dershowitz and Manna [9] have suggested the use of program schemata to represent general problem solving techniques such as Divide and Conquer. These program schemata are instantiated by the program synthesis system to obtain concrete programs. In our system the geneticist and/or programmer will create the instantiation. A major component of this research will be in the construction of such program schemata and instantiations. The package created will be termed the “AI toolbox", The creation of a library of schemata has been suggested by Gerhart [19]. The package will be built and tested in the context of a small number of genetic experiments. Computer science experiments with the package will include a measure of its performance and utility for expressing appropriate strategy for experiments. Of course, having a number of basic tools does not guarantee that a program will use them correctly. Competence in problem solving in the domain requires that the expert system know where and when to apply the standard methods. Two factors are of importance here. (1) Some tools govern the use of other tools. For example, "Focus processes" will determine where to concentrate effort in a problem and indirectly control the selection of strategies. (2) The specific knowledge for using a tool is precisely the -type of information which is left unspecified in the prototype and which must be supplied when concrete instances are acquired for the knowledge base. The AI tool box is discussed further in [36] in section V.4. 15 1.3 , October 27, 1977 1.3.3.4 Simplifying Process Specifications by Removing Exceptions Planning can be plagued by exceptional cases. If high level planning processes are burdened with the detail of the special cases, they can become cumbersome to update and debug. We are developing an object-centered approach, termed inspectors, for distributing the information about special cases throughout the knowledge base. This should help keep the planning processes clear and concise and also provide localized packets of information about the exceptions. The basic ideas of this process of removing exceptions can be illustrated by an example of a “selection process" in designing an experiment. The operation of selecting among available laboratory steps is a recurring operation during experiment design. The following bear on this process: l. The experimental goals -- e.g. to extract a section of a molecule. 2. World State specification -- a description of the laboratory sample, e.g. DNA structures. 3. The selection criteria -- e.g. availability, sensitivity, or functionality of the laboratory technique. 4. Verification criteria -- test for deciding after a simulation of the selected laboratory step whether the essential goals have been satisfied. 5. Failure instructions -- what to do if the chosen laboratory technique does not satisfy the goals. 6. Laboratory step specifications -- a list of the potentially applicable laboratory tools and their descriptions. Our approach to managing the information about a selection process is first to identify the general information and _ then to consider the alternatives for placing information about the special cases, For example, section V.4.4 of [36] considers the selection of an enzyme to make cuts around a region in a DNA molecule so that it may be extracted. The general information in this case is that the main determinants of selection are enzyme availability and information about where it will make ‘cuts. Special case information includes modifications on this basic idea according to unusual variations in the structure of the molecule -- e.g. “AT-rich regions", “hairpin loops", etc. It also includes information specific to the laboratory steps being selected -- e.g. nuclease contamination in an enzyme. 16 1.3 October 27, 1977 Interactions between goals often arise in the special cases (3). When action was taken to satisfy the preconditions of the enzyme (in this case, pH was changed), a side effect resulting from interaction with an unusual molecular feature interfered with the action of the enzyme. Specific advice about failures of this kind can be associated with the "pH inspector" -- To avoid conflicts of this kind an enzyme should be used at a suboptimal pH. Inspectors may be viewed as domain specific versions of planning critics as developed by Sussman [37] and Sacerdoti [32]. Generally there are many tradeoffs involved in deciding where to locate the information about special cases. The example above and the tradeoffs are described in detail in section V.4.4 of [36]. 1.3.3.5 Further Aspects of Knowledge Base Management One further area of work in knowledge base management that we will be pursuing is in developing a number of modest aids for a user for keeping track of a growing and evolving knowledge base. As discussed previously, we have already developed some facilities for automatic documentation of the knowledge base. We will be designing aids for a user in tracking down unexpected conflicts. For example, if two geneticists sharing a knowledge base have a somewhat different view of some aspect of the domain, one may make changes effecting the other's area. In the’ event of a failure, it would be useful in many occasions to get a summary of recent changes to the knowledge base. When achange is contemplated to the definition of some class of enzymes, it would be a simple matter to locate all of the rules which mention those enzymes. These rather simple aids are expected to be useful when fairly extensive changes to the knowledge base are contemplated because they will assist the user in being thorough about making his changes, 1.3.4 A Method for Designing Discrimination Experiments One of the applications of the genetics knowledge base will be the building of a high performance system for designing a variety of discrimination/analysis' type experiments. The goal in these experiments is to learn something about a given sample. of DNA structures. For example, are any poly A regions present, or do the structures carry tetracycline resistance? The basic method used for the experiment design system will be means-ends analysis combined with hierarchical planning as follows: (3) See Section III.2.5 of [36] for a discussion of recent techniques for handling interactions between goals. 17 1.3 October 27, 1977 1. A model structure containing the hypothesized featur (s) will be compared with a structure representing generalized DNA of the same type in order to ascertain exact differences. 2. These differences will be ordered and one selected as a basis for initial planning. 3. An experimental’ strategy, ranging from very specific (e.g. if a bubble is present then denature and use EM) to very general (e.g. label the feature and then look for the label) will be selected from a library of such "skeletal plans," using the difference selected. 4. The skeletal plan will be adapted to the specific problem environment, with hierarchical planning proceeding as deeply or shallowly as is desired by the system user. , 5. The completed design will be tested in a forward direction for completeness and consistency by an evaluation system. 6. If a successful design cannot be found from the feature selected in (2), then either a new feature will be selected from the difference list, ora generalization selected from a tree of structural features will be used. For example, if the selected feature was the. exact base sequence ATTGC, a generalization might be made to ‘known base sequence," The following major components are needed for this exper iment design method: 1. A problem analysis preprocessor which recognizes key features in the nucleic acid structures, nicks, gaps, hairpins and the like, and then compares these Structural features to find major differences between candidates in a discrimination experiment. This program will organize features into a hierarchy of importance for experiment design using a rule-based system for analysis and classification. 2. A tree structure for ordering structural features and providing links between features and the classes to which they belong. The links will point. upward as generality links to a more general class or downward as specificity links to amore specific one. The highest links will be to the complete class of 18 1.3 SO, October 27, 1977 structural features, the lowest ones to individual specific features, Intermediate levels will consist of important subclasses, e.g. "poly base sequences” which would point downward to "poly A," "poly T," "poly G," and "poly C," and upward to "AT/GC ratio", which would itself point upward to known base sequences." This tree will provide entry into genetic strategies classified by the features to which they specifically pertain. Failure to find knowledge about what to do with a specific feature will cause the system to search for knowledge about the subclass of features immediately above the specific feature in the tree, 3. A cohesive system for hierarchical planning in the domain of molecular genetics--selecting strategies as described above and refining them downward to specific laboratory tools. This of course involves the usual need for error recovery and backtracking facilities. 4. A system for evaluating completed designs at any level of generality to be sure the plan asa whole "fits" togethers, i.e. a forward-working system for plan evaluation. These components will be integrated into the basic MOLGEN representations framework of the Units system described above. The rules describing the problem analysis preprocessor will be individual units, aS will the descriptions of structural features which combine to form the tree structure of the feature hierarchy. Individual dynamic planning states will be represented as units within the MOLGEN system. After the experiment design system becomes operational for discrimination/analysis type experiments, we intend to adapt it to experiment-planning for synthesis experiments where the goal is to produce some desired DNA structure from available starting materials. 1.4 An Example of the Genetic Utility o£ Automated Experiment Design A good way to illustrate the potential utility of computer assistance in experiment design is to show how some recently published work from another laboratory might have been represented in a MOLGEN formulation, albeit human intelligence was the instrument. The work in question achieved the cloning of the gene for rat insulin ina bacterial plasmid vector [38]. The major goal of the experiment was the transfer of a gene coding for insulin from the rat to the common 19 1.4 October 27, 1977 intestinal bacterium, E. coli. An important subtask of the experiment was: given samples of two different linear DNA structures with "sticky" ends, produce acircular structure containing one molecule of each. The difficulty of this problem lies in the number of competing processes. Both structures can self-circularize, and many different linear and circular monomers can be produced. See Appendix III for further details. Previous attempts to cope with the problem were based the kinetic theory of ligation [11]. Using a model of the process based on concentration and molecular weight of the structures, one varied various experimental parameters to maximize the amount of the desired product. The new idea of the article was based on a different strategy--try to eliminate competing processes. This led to a method for modifying the sticky ends of the two structures so that they could no longer self-circularize. The particular method chosen was a simple biochemical step. The solution was retrospectively self-evident, but in fact, it was missed by many geneticists who had previously examined the problem (or related others). An intelligent experiment design system with the above mentioned heuristic probably would not have overlooked the solution. 2 Research Plans The bulk of this application comprises details of research strategies for the first two years of the renewal period. It is of course more difficult to forecast over longer periods; indeed there is every likelihood that unforeseen difficulties and opportunities will intervene to offer changes of perspective. However, enough progress may have been made to enable us to move from the initial stepping stones, and our current plan for years 3 and 4 is as follows: We will by then have invested substantial effort (mainly through the cooperation of investigators with direct support for molecular genetics research) in building and maintaining the knowledge base in the specified domain. We believe that this investment should be exploited before substantial further efforts are made to expand the domain ~~ e.g. in more biological aspects of genetics -- although a few easy opportunities will surely present themselves. Instead our emphasis will be on the evolution of MOLGEN to a hypothesis-oriented system (like DENDRAL). Typical questions at that level would be: from the data given, what are plausible hypotheses for the structure of a sample of DNA; as well as, what experimental steps should be pursued to verify the hypothesis. Intermediate steps have already been mentioned: the elaboration of hypotheses in the course of debugging, and the extension of MOLGEN from a sharp binary discrimination (between two 20 October 27, 1977 stated alternatives) to a corroborative mode (a given hypothesis versus all plausible alternatives). This effort will require a good deal of work on the planning components and on rules of plausible induction, and relatively little on the knowledge base of molecular genetics per se. For that reason it should have the greater value for extensions to other domains. 2.1 Building and Maintaining the Genetics Knowledge Base One important building block for MOLGEN's success is the creation and updating of a base of knowledge about genetics, This provides both a useful reference source for the user planning his own experiments, anda body of core data on which automated experiment planning strategies can be tested and evaluated. The knowledge acquisition and representation ideas have been delineated in detail previously. This section describes advantages of the resources available here to support this work, The strength of the Stanford community of biologists, biochemists, and geneticists offers a unique opportunity for collaboration in building a substantial body of knowledge about on- going research projects. Many of these projects may prove to be suitable sub-domains for initial development of the knowledge base, and may provide test domains for work on problems of knowledge acquisition and knowledge base management. They offer the added advantages of insuring that this work is solidly grounded in real-world experiments. . Updating and checking the knowledge base with respect to the information obtained from our collaborators will be the responsibility of the graduate student and post-doctoral fellow in genetics. On oppornunistic occasions they will also experimentally validate those experiment plans generated by the system which are relevant to current work in the lab. This has already been done in the case of ligation kinetics, where theoretical predictions of plasmid self-circularization were compared to electron-microscopic empirical tests (Section 1.2.2.1). One useful side-effect of this process that we anticipate is that the process of formalizing knowledge about the domain may help to organize what is currently an informal body of knowledge, and in doing so may even uncover gaps in our current store of knowledge ahout the field. 21 2.2 October 27, 1977 2.2 Recognizing and Abstracting and Saving Successful Plans As noted above, our efforts to create an experiment-design system center around the concept of a large knowledge base containing task-specific information. One approach to augmenting this knowledge base is via interactive knowledge acquisition as discussed in Section 1.2.1.3. A second form of knowledge-base improvement is based on giving the system the ability to save successful experiment designs it has generated. This involves two major functions: recognizing when a plan is worth saving, and abstracting a plan so that it is applicable to a wider range of problems than just the specific one which prompted its creation. For example, consider the design MOLGEN might produce for the problem of ligating two genes. An initial attempt would be made to produce "sticky" ends by cutting both genes with a single restriction enzyme. Suppose, however, that no restriction enzyme which satisfied this criteria could be found. One possible solution would be to cut the DNA near one gene with restriction enzyme A, and near the other with a different restriction enzyme B, then join the two segments by means of a small piece of artificially created DNA which had been cut on one end by A and the other by B. This general idea -- the concept of a “molecular adapter" [35] -- is very useful for problems of this sort. While recognizing and abstracting the relevant ideas from the specific experiment design above is difficult, the ability to do this would be an important form of knowledge base augmentation. Automating the function of recognizing when a plan should be saved will rely on several interrelated factors: 1. Was the plana "good" one, i.e. does it solve a problem with reasonable efficiency, cost, safety, etc.? This measure of "goodness" will be difficult for the system to judge alone. The system will rank alternate plans according to its heuristics which involve these factors, but an absolute measure of plan quality will be a very difficult measure. Until the knowledge base includes adequate metrics for the "goodness" of plans in a variety of subfields, we imagine this judgment will be mostly up to the user. 2. If the plan solves a problem for which other plans already exist in the knowledge base, is the plana significant improvement? Making this judgment involves use of the same "goodness" metrics discussed above as well as an analysis of why the new plan was not simply a copy of the old. If the only difference between the two plans is that the new one takes account of some detail of the environment for a specific problem, then it probably isn't worth saving. Again, we imagine this judgment will, at least initially, be mostly up to the user, 22 nan 2.2 October 27, 1977 3. is the cost of saving the plan less than the cost of regenerating it each time the same problem arises? Empirical criteria applicable to making the tradeoff decision might include CPU time spent in generating the plan, and, if the plan evolved directly froma previously saved one, the number of differences between the two plans. Another important measure is some judgment about how often the plan will be useful in the future. The more generally useful a plan, the more important it is to keep it around. 4. Can the plan be abstracted to more general purposes? In the example given above, this would involve recognizing that the concept of a "molecular adapter" would be useful any time the goal was to join two pieces of DNA which did not have convenient restriction sites in conmon, The problem of abstracting a plan that has been selected for saving has been considered before, in the robot planning domain, in the work on MACROPS in the STRIPS system [15]. STRIPS generalized successful plans in the following manner. Plans had variables which were bound to objects in the robot world. The idea was to "unbind" these variables as much as possible, with certain constraints imposed by the nature of the particular plan. For example, if one step of a plan told the robot to go to a block and the next step said pick up a block, the block in these two steps must be the same one. Initial references to specific doors, blocks, rooms, etc. were generalized to “any door," "any block," “any room,", etc., but took into account constraints of the sort noted above. We will begin by employing much the same generalization process. For the example plan given above, the specific restriction enzymes could be generalized to any two distinct restriction enzymes, and the two genes to any two DNA sequences lacking a common restriction site. This generalization process should be a natural consequence of the hierarchical nature of our knowledge base. The taxonomical classification of DNA structural features discussed above (Section 1,3.4) could be used to generalize plan utility~--e.g. if a plan solves a problem for nicks, maybe it can be generalized to solving that problem for the immediately more general parent of nicks, DNA excisions. A plan which was useful for recognizing a specific base sequence might be generalized to one which was useful for all "known base sequences." The generalization of variables within a plan will be done by moving up generality links (see Section 1.2.1.1) in the knowleige base. A particular exonuclease .could be parameterized to mean any exonuclease--the parent of al] specific exonucleases in the knowledge base. One difficulty lies in knowing how far it is possible to go up the generality links in the knowledge base before losing plan utility. A specific enzyme could be unbound to refer to "any enzyme," but that 23 2.2 October 27, 1977 would yield a plan which said "pick any enzyme", and would probably be far too general. The problem lies in detecting the important (i.e. important to this particular plan) features of each object used in the plan, and trying to retain those features while generalizing out all “irrelevant details." In the plan discussed above, the important feature of the chosen restriction enzymes is that they cut at specific Sites, not that they were derived from some particular organism or that they operate optimally ata certain temperature. Note that such information (the degree of relevance of the features) will be available from the hierarchical planning phase, since it will have been used to choose among competing laboratory "tools". In this case, for example, restriction enzymes were chosen because they cut at specific sites. So, in unbinding the notion of a particular restriction enzyme, if the knowledge of the importance of specificity is maintained, progress up the generality links will end at “restriction enzyme." 2.2.1 A Casebook of Unsolved Problems A related area of investigation -- recognizing novel combinations of laboratory techniques -- is based on the observation that new tools are continuously being developed. Sometimes a particularly useful combination of tools is available. for quite some time before it is recognized as such, as inthe use of alkaline phosphatase to inhibit self-ligation mentioned in Section 1.4. One approach for doing this is to keep on file a casebook of important or unsolved genetic problems. Periodically as new laboratory techniques and planning strategies are added to the knowledge base, we will run the planning program on the test cases. The process which we would like to model is embodied in the situation where a scientist, after hearing of a new laboratory technique, recognizes a laboratory problem for which the technique could be profitably applied. Thus, the first step of the process is to select a problem (from the casebook) after being presented with a new technique. Then the existing planning methods in the system would be applied on the problem again with the new laboratory technique encoded in the knowledge base. The next phase is deciding whether a new solution using the new technique offers any advantage over previous solutions. This would make use of the work described in Section 2.2 for abstracting and evaluating plans. In summary, this process will make use of other developments in the MOLGEN project for applying newly discovered techniques and for evaluating the plans produced. These methods will be applied to form a discovery process by augmenting them with a casebook of experiments and a selection process for picking experiments. 24 2.3 October 27, 1977 2.3 Understanding Exper imental Discrepancies by Hypothesis Formation Plans for molecular genetics experiments have something in common with computer programs, summer vacation trips, and almost any kind of plan devised -- they don't always work. Although published reports about experiments usually say little about the techniques which failed, our analysis of the actual development of experiments has shown us that debugging is an essential and anbegral part of successful laboratory experimentation [14]. Pioneering work in the debugging of computer programs [37] used a process of program development and correction using a knowledge base about bugs, Sussman has advocated creating a rough version of a program on a first pass followed by local corrections by a gallery of critics. In Sussman's work, the ignoring of detail during the first pass is a source of power for the approach. In experiment debugging, we find an additional reason for attending to details during a second pass ~—- the knowledge necessary for deciding which of the details are in difficulty is not available at the time the experiment is designed. For example, many assumptions about the input samples are made when the experiment is designed and the validity of these assumptions cannot practically be tested until the experiment is performed. Section 2.3.1 describes several distinct sources of error in experiment design. Section 2.3.4 describes a number of research issues, 2.3.1 Sources of Bugs The complexity of most laboratory techniques is such that there are many ways for an experiment to go awry. The following diagram will be used to categorize the sources of bugs. 25 2.3 October 27, 1977 Real Laboratory Step Parameters V Measured -> | Laboratory Step | -> Measured <-- Input. no Output | State State | | | | Laboratory Model -- MOLGEN Simulated Laboratory Compare | Planned Step Parameters | | | V | Hypothesised 0 ser Computed <-- Input -> | Laboratory Step | —-> (Expected) SEACC nt rere erent Output State In this diagram, we assume that a complete plan for a laboratory experiment has been developed. In this diagram, we will focus ona single step in the experiment. The upper part of the diagram corresponds to the step as carried out in the laboratory -~ illustrating that a laboratory sample undergoes processing in a laboratory step resulting in a changed sample. Measurements are made on both the input and output samples. (4) The lower part of the diagram corresponds to a model of the laboratory process. This may be either an informal model in the experimenter's thoughts or a formal model in the MOLGEN knowledge base. The model transforms a hypothesized version of the sample using a process model of the laboratory step to yield a model of the expected results of the laboratory step. If at some point in an experiment the measurements predicted by the model fail to correspond to actual measurements in’ the laboratory, a "bug" has been encountered in the experiment. Referring back to the diagram, the source of the bug may be any of the following: l. The hypothesized input sample may be inaccurate. 2. The planned parameters for the laboratory step might be not correspond to the actual parameters used. (4) Measurements are examples of laboratory steps, subject to the same descriptions and errors as other laboratory steps. This diagram has been simplified by leaving implicit the measurement step. 26 2.3 October 27, 1977 3. The process model for the laboratory step could be inaccurate. This inaccuracy could be in either the laboratory step shown explicitly or in one of the implicit measurement steps. 4, The error may be traceable to any of the above sources in any of the previous steps in the experiment. 2.3.2 Knowledge about Debugging A Molgen debugging system must be able to generate and test hypotheses about bugs from any of the sources mentioned above. The first source of bugs mentioned above -- an inaccurate hypothesis about an input state -- is a common source of difficulty in experiments. Knowledge about DNA structures, a major part of an input state inmost experiments, is almost never complete. For example, there is a danger of minor damage to DNA structures in almost any laboratory step so that nicks, gaps, and various forms of erosion tend to appear in these structures in the course of an experiment. Whereas experiment planning derives much of its power by ignoring such details, experiment debugging is likely to derive its power by scrupulous attention to these possibilities when problems arise. A discrepancy in the parameters of a laboratory step is similar in principle toa discrepancy in the input state. There are always many possible sources of error here: Was the pipette sterile? Were there nuclease contaminants? Was the pH controlled properly in the buffer? If a simple error of this kind can explain the discrepancy in the experiment, there is probably little need to lock farther. A more difficult source of error occurs when the model of the laboratory process is inaccurate or incomplete. Since most of the debugging which we have observed involves the other sources of bugs, we expect to put most of our effort there. We have, however, some modest ideas uSing analogical reasoning for extending and correcting process models. Suppose that we suspect that knowledge about enzyme A is incomplete, and suppose also that we know that enzyme A is similar to enzyme B (a well-understood enzyme). If the operation of enzyme B depends on the concentration of a magnesium ion and the model for enzyme A does not mention this ion, a reasonable question in the face of a failure in using enzyme A might be whether the model for enzyme A is deficient in this aspect. While very simple, this approach may prove to be adequate to provide useful hints. 2.3.3 A Debugging Example This section gives an example of debugging a step in a genetics 27 2.3 October 27, 1977 exper iment. (5) The example illustrates two models of genetic laboratory steps and shows the generation and testing of two hypotheses about abug. After the fairly lengthly example of debugging which follows, the research issues involved in this approach are presented in Section 2.3.4. 2.3.3.1 Generating Debugging Hypotheses As stated above, experiment design must usually be done with incomplete knowledge. In this example, the sample consisted of uniform DNA molecules from a bacteriophage. A laboratory step was proposed to cut the DNA molecules into smaller segments so that a desired gene (Thy) would be located on a smaller segment for later manipulation. A restriction enzyme (EcoRI) was chosen on the basis of availability in the laboratory to perform the cutting. It was not known (a) whether the Thy gene had a restriction site for EcoRI (i.e. whether the enzyme would cut the gene) or (b) what the size would be of the segment carrying the Thy gene. It was essential for later steps in the experiment that the Thy gene be located intact on an appropriately Sized segment of DNA. To test whether the Thy gene was functional after cutting the DNA with the enzyme, a checkpoint was used involving "transformation". Transformation is a process by which bacteria can incorporate DNA from their growth medium. In this case, conditions were established so that the bacteria could survive only if they were transformed by a functional Thy gene. The bacterium used in the transformation test was Thy- B. subtilis. Lacking the ability to synthesize thymidine itself, this bacterium can normally grow only if the medium supplies the thymidine. During transformation, if bacteria can incorporate a piece of the digested DNA carrying the Thy gene, the “transformed" Be subtilis will be able to grow on a medium lacking thymidine. The uncut bacteriophage DNA is capable of transforming B. subtilis in this manner, and it was assumed that the cut DNA would function similarly. The unexpected observation (bug) was that although the cut DNA was indeed capable of transforming the B. subtilis, it did so with a greatly impaired efficiency. A fix waS needed or the subsequent experimental steps could not be performed. At this point, two hypotheses were suggested to explain the low efficiency. 1. The EcoRI enzyme cut the gene -- damaging its transforming ability. (5) The example is from the beginning of the experiment described in [14]. 28 Ldeesedl 2.3 October 27, 1977 2.. Transforming activity decreases if the DNA segments are too small. The first hypothesis derives from fact that a gene which has been modified will function with an impaired efficiency. In this scenario, if the B. subtilis has been transformed with a damaged Thy gene, it will still grow on a thymidine deficient medium, but not as effectively. The Thy gene could be damaged if it has an EcoRI site. The second hypothesis was suggested from the fact that the length of DNA fragments (and other structural features) are known to influence the transformation process (6). In this scenario, if the segment containing the Thy gene is too short, some part of the transformation mechanism would operate less effectively and the gene would fail to be incorporated. Both of these hypotheses may be derived. by analysis of the process models for digestion (7) and transformation. These models are given below, (6) See [28] for a detailed discussion of transformation. (7) “Digestion” is a technical term referring to the cutting action of an enzyme. 29 2.3 October 27, 1977 Partial Models for Digestion and Transformation (Digestion Model) EKREKRERER KEK REE RRR RK RRR RERERERERERERRRERAKRERERREEREKREREKK Sufficient Enzyme Concentration --> no \ | yes | | Sufficient Reaction Time? —> no \I | yes | | | oo (Complete Digestion) (Partial Digestion) | |. | | Cut DNA at EVERY Cut DNA at SOME of restriction site the restriction sites | | |-----< nano / | . Break DNA into 2 segments at the affected sites. (Each segment terminated with characteristic sequence.) | | Has a gene been cut ? --> yes \ | no | | gene disabled | |--< < | KEE EREKKEKKRKREERERREEREERERERKERERREREREREERKREREREKE RRR +b FF Ob Oe Oe HOO eb Fb OOO OOOO 30 2.3 October 27, 1977 (Transformation Model) KEK KEKREKRIKER ERIK ERR IKE REI E RRA EER ERE KRREAEKERRAEREKREKRKKREKKKRERE DNA pieces toc small? --> Yes (decrease efficiency) | no | | < < | | DNA circular? --> no (Linear pieces subject to | yes exonuclease degradation.) | (decrease efficiency) | | | | | Medium contains All essential nutrients ? -—--> Yes (Grow) { no | (Some nutrients are missing.) Is there a gene to synthesize the missing nutrient? -—> yes -------- \ {| no | | | | Is the gene | intact? | no | | yes | | | (No Grow) (Growth (Grow) | impaired) + ee Fe Oe HF HO OH OO OO OOOOH HH OH HO OF OOOO KEKEEKKERE EKER ERR ERERERERRE EERE REKERERER ERR RRR REE RERER EKER EEK 2.3.3.2 Testing the Debugging Hypotheses Once a hypothesis’ has been offered to explain a discrepancy in an experiment, it is often possible to propose a measurement which can validate it. If the hypothesized change to the model of the input state is adequate to account for the discrepancy in the output state, that is suggestive that the source of the bug has been found. Other times, a hypothesis about molecular damage may Serve to suggest an alteration in the experiment or a way to fix the damage. Tf more than one hypothesis for the cause of failure is found, it is necessary to determine which of the bugs is the most likely. In 31 2.3 ; October 27, 1977 this exper iment the following knowledge was used in discounting the first hypothesized bug: Test for restriction site: If a gene can be: shown to be functional at all after its DNA has been completely digested by a restriction enzyme, the enzyme probably does not cut the gene. , Test for Completeness of Digestion: If no change is observed in the restriction pattern (in electrophoresis) for DNA after a ten- fold increase in digestion time and enzyme concentration, the DNA may be assumed to be digested to completion. (8) The completeness of digestion was demonstrated in the manner suggested, indicating that, Since there was a measurable amount of transformation activity, the difficulty lay in the size of the fragments after Eco RI digestion. Attention now goes to the process of finding a fix for the problem. This phase of the debugging process is essentially an application of the experiment planning part of MOLGEN. In the current experiment, methods were sought which would allow the DNA to be cleaved, but in such a manner that the segment containing the Thy gene would be left on a larger fragment. ‘Two possibilities were considered for this -- selection of an alternate restriction enzyme and a “partial" digestion in place of a complete digestion by EcoRI. The models which were used for generating the bug hypotheses may also be used to generate the alternative of partial digestion. In particular we can derive from the models that (1) the average piece will be larger if digestion is partial] instead of complete, and (2) this may enhance the transformation step. 2.3.4 Research issues The approach to debugging illustrated above relies on the use (8) Although this rule was cited in this form in [12], some exceptions to it are generally recognized. In the first place, there are inherent resolution limits which prevent some changes in restriction patterns from being observable in electrophoresis. Secondly, sometimes restriction sites can be covered by trace proteins. 32 2.3 October 27, 1977 of checkpoints in an experiment. The judicious selection of checkpoints is an important aspect of experiments and illustrates that some of the interplay between experiment designing and debugging can be anticipated. The checkpoint above indicated that no further steps should be taken in the experiment until the check (transforming capability) was passed. Experiment design involves the use of other kinds of checkpoints -- e.g. "controls" which serve as checks on experimental artifacts. Selection and placement of checkpoints in experiments depends on many things -- e.g. the time and expense and sensitivity of the check. We anticipate that the development of the debugging system will lead to a number of additions to the experiment design system for handling the selection and placement of checkpoints. The example and discussion above have illustrated some major research issues involving the generation and testing of potential bugs. One issue 1s how should the list of possible "bugs" be pruned when it is too large to test experimentally. Several sources of bugs have been illustrated above. A means must be found for focusing attention, i.e. determining which are the most plausible sources of error. Representation of the debugging process will probably involve considerable expansion of the knowledge base. In addition, the models for debugging seem to involve use of more detailed knowledge than is available for experiment planning -- implying some extensions to our modeling of genetic objects and processes as well. . Finally, one research issue lies in the source and organization of the models, such as those for digestion and transformation, illustrated above. Initially, these will be hand-crafted entities whose contents will be dictated by the range of experiment bugs encountered. Eventually, however, we hope to be able to exploit the commonalities arising in several models, to develop a more basic encoding of the information. For example, both digestion and ligation could be characterized as instances of “chemical reactions". Stored with this concept would be an indication that "degree of completeness" was a relevant parameter in describing its outcome. This eliminates the necessity of encoding such information redundantly, and provides a more intuitively appealing organization. 2.4 Reasoning by Analogy At the end of the current grant period, MOLGEN will bea working planning system. The knowledge base will include a library of process schema expressing problem solving technigues and domain operations, and a library of skeletal plans expressing general solutions to particular problems. The planning system will have techniques for hierarchical problem solving. This system provides an excellent environment for exploring various types of analogical reasoning. 33 2.4 October 27, 1977 There is a long intellectual tradition in philosophy of use and analysis of the concept of analogy. The tradition begins as early as Plato and Aristotle, In fact, some of the most famous and memorable passages in Plato's Dialogues hinge upon a brilliant and imaginative use of analogies. We have in mind especially the famous image of man in the caves able to observe only shadow as an analogy to the problem of perceiving the true character of eternal forms. During the long and technical phase of philosophy identified with the Middle Ages, the concept of analogy achieved a central importance as part of the extensive discussions: of analyzing and knowing the properties of God. The conceptual difficulty was the recognition that it is not necessarily cognitively possible for human beings, themselves, to know the unbounded powers, knowledge, etc., of the deity. Consequently there developed a long tradition of attempting to argue by analogy from known properties of humans to what should be the properties of an omniscient and omnipotent deity. This conceptual literature on analogy is still a vigorous one, and publications can be found as recent as the last decade. Another stream of thought that has given a good deal of attention to the concept of analogy is the philosophical analysis of the problem of induction. A distinguished and diverse set of philosophers, including Hume, Kant, and John Stuart Mill, have all had important things to say about reasoning by analogy. The development of an analogy requires the specification of the analogous concepts and a description of the manner in which the concepts are analogous. Once enough of a description is given to establish the analogy, the description is extended to derive new attributes of one of the concepts. Unfortunately, philosophers and psychologists have not given precise descriptions of the process of analogical reasoning. Precision is necessary in order to use analogical reasoning as a component of a computer problem solving system. Polya began a new approach to analogy in his book, Induction and Analogy in Mathematics. His examples and informal discussions have served as a useful starting place for recent computer science investigations. The computer science work of Evans [13], Kling [23], and Brown [4] are major contributions to the effort of defining analogy precisely . . The aspects of analogical reasoning we intend to explore within the context of MOLGEN include: (1) Given a new problem specification and a set of problem specifications how can a problem analogous to the new problem be selected from the set. (2) Given a problem specification and an analogous 34 2.4 October 27, 1977 problem specification for which there is a known solution, how can the analogy be used to solve the new problem, (3) What representations of problems and solutions ere facilitative for analogical reasoning (that is, for solving problems 1 and 2)? (4) Problems 1, 2, and 3 viewed from the planning environment: given a planning situation and a library of planning processes, find and instantiate an analogous schema from the library to build a planning process applicable to the given situation. (5) Integrate problem solving by analogy into the planning system as one of the possible tools for solving subproblems. We will briefly discuss each of these aspects of analogical reasoning. Our purpose in this research is two-fold. Foremost, we want to analyze some of the current approaches to analogical reasoning in the context of a major problem solving system. Secondly, we want to add an analogical reasoning component to the production version of MOLGEN, hopefully extending the problem solving ability of the system. 2.4.1 Forming Analogies Finding a problem analogous to a given problem could require an explosively large search. A purely syntactic approach to analogy formation which attempts to map the objects, concepts, functions and relations of one problem specification into another only reduces the search significantly if the problems are identical up to a _ set of substitutions. Even then, there may be a large number of mappings to test in order to discover the analogy. For this reason, we believe a knowledge of the semantics of the domain is essential to the establishment of analogies and their subsequent use in problem solving. This implies identifying the functional relationship of the concepts of the problem specification to the problem as a whole. These functional relationships must be preserved by any analogy mapping created. Brown {4] gives a systematic method for creating an analogy map based on syntax including the ability to map meary relations to mt+k-ary relations for small k. A beginning attempt at identifying functional relationships has been made for genetic features in the context of binary discrimination problem specifications in Section 1.3.4. In this case, the implied functional relationship is that the identified feature is the only key concept in the problem. The analogy is either an identity or a generalization map on the feature. 35 2.4 October 27, 1977 This work will involve the creation of techniques for recognizing functional relationships and using them to restrict the search for an analogy map. It should be noted that this process of discovering an analogous problem does not require any specific technique for using the analogy to solve the original problem. However, a measure of how "good" an analogy is will be necessary. 2.4.2 Using Analogy to Solve Problems Once two problems are claimed to be analogous, the way we use this information to solve the original problem depends on the nature of the analogy and the representation used to store the solution to the known problem. If the analogy map is an isomorphism such that the two problem solutions are identical up to a set of substitutions, then we can apply the analogy map to the solution of the stored problem to obtain a solution to the original problem. The map must be extended to include any concepts, functions, relations occuring in the solutions that did not occur in the problem description. More commonly, we will not obtain a correct solution by simply applying an extension of the analogy map to the stored solution. For example, the analogy may indicate that the problems have identical general plans, that they require the same change of representation or the same type of reasoning (see [23] for a general discussion of types of analogies). In all of these cases, the result of applying and extending the analogy map to the stored solution may not give a correct solution to the new problem. In the context of a general problem solving system there are several approaches to explore. The first approach uses the analogy to constrain the general problem solver. That is, instead of creating a solution by applying the analogy map, the analogy is used to limit the choice of representations and transformations which the problem solver can use. Another approach would set up new subproblems to be solved as the analogy is extended to the stored solution. If the subproblems can be solved, the result of applying the analogy map to the stored solution along with solutions to the subproblems gives an accurate solution to the original problem. A third approach is one suggested by Manna [9]. The analogy map is used -to transform the stored solution into the incorrect new solution. Now the process is one of modifying and debugging the solution to satisfy the problem specification. (See [37] for a discussion of debugging techniques.) Research problems include: Classification of the types of analogies for which each approach is best: A strategy for determining when to attempt the solution of a subproblem through analogy rather than the other problem solving techniques of the system; A comparative evaluation within the framework of a single system of the interactions between representations and the approaches listed above for obtaining correct solutions from the analogy. The next section describes several possible representations, 36 2.4 . October 27, 1977 2.4.3 Representation of Solutions Several different representations have been used in research on program synthesis by analogy including: program schemata [9]; concrete programs with input/output specification with/without a list of the transformations which produced the concrete program (unpublished report by R.Moll, University of Massachusetts and J. Ulrich, U. of New Mexico); concrete programs with associated intentional plans and justifications [4]. In this section we briefly indicate that the MOLGEN knowledge can be modified to represent solutions in similar manners. 2.4.3.1 Program Schemata 'We have already indicated the relationship between program schemata and process schemata in Section 1.3.3.1. The program schema is a generalized version of the solution to the programming problem. We can also create generalized versions of the solutions to experiment design problems. , The MOLGEN knowledge base has many plans which are almost schemata, namely the skeletal plans of Section 1.3.4. and the right hand sides of refinement rules. Also, Section 2.2 proposes techniques for the identification of new skeletal plans and their addition to the knowledge base. These plans can be extended to become program schema. We first will associate with each plan an input/output problem specification. We generate intermediate world state descriptions by running the plan on the input world state. A generalization process, starting with the I/O specifications and proceding to include each world state in the plan will create world state descriptions in terms of abstrect symbols with restrictions on the substitutions allowed for the abstract symbols. This process will generate preconditions which restrict the refinement and specification of the generalized transformations specified by the original plan. Representing stored solutions as program schemata could have two advantages. First, the myriad of details present in a primitive description of an experiment procedure are suppressed. Our intuition is that these details are dependent on the specific transformations used in the procedure and thus would interfere with the creation of the analogous solution. Second, the analogous solution could be verified at the abstract level of the program schema. The general problem solving system could refine the schema using the restrictions generated in forming the analogy. Thus analogy would be used to guide the problem solver quickly to a workable plan without having to consider details, 37 2.4 October 27, 1977 2.4.3.2 Solutions as Concrete Programs There are many systems which create analogies from a concrete program with I/O specifications. We are not building a library of such experiment procedures in the present grant period. However such a library will be created in the renewal period for studying analogical reasoning. After completion of a planning run, the user can specify that the experiment procedure created is to be saved. Saving the planning transformations which created the solution is simple since the system keeps such a record in the planning network. ‘Thus both representations can easily be used, . As explained above, it is our intuition that the detail of the primitive procedures (concrete programs) will interfere with the analogy formation. However, this intuition will be tested. A different use of the analogy is intended when the planning transformations are stored with the solution. Now to create the analogous solution, one applies the transformations (modified by the analogy map where necessary) to the original problem specification. Again, the resulting solution may not be correct and thus may need a modification/debug cycle. 2.4.3.3 Programs/Intentional Plans/Justifications Brown [4] has suggested that solutions (programs) should have three components: the ccde, an intentional plan, and a justification, There are transformations which create code from plan, justification from plan and so on. Each component is important in his analogy system. We will develop a verification, or justification, language for experiment procedures. At the present time MOLGEN does not include the use of intentional information in the description of a rule, However, the names of the generalized transformations have been serving this purpose. We suggest the intentional information be made an explicit part of the transformation's representation. This is particularly important in planning processes. It is not always possible to deduce the process intention from the process itself. For example, a planning process might check for two conditions which the process creator knew were indicative of a certain situation. If the situation itself were never mentioned in the process, the intention of the process creator could not always be deduced from the process, The experiment procedure representing the final product of the planning process could be annotated to indicate the planning program's intentions in creating the steps of the plan. lf one considers the hierarchy of abstract transformations created during the planning process as a net indicating various levels of generalized transformations, then each level of the net specifies an intentional plan for the following level. Initially we will concentrate on creating intentional plans for primitive and 38 2.4 October 27, 1977 generalized transformations and program schemata. Then we will extend the concept to include all process units. 2.4.4 Creating Context Dependent Planning Processes So far, we have been limiting the analogical reasonirg to finding and using analogies between problem specifications. However, many of the same ideas can be applied to the design of a component which examines the planning state and the current goals, selects a planning process schema whose input/output specifications are analogous to the present planning context and creates an instantiation of the schema, The major contribution of this work would be the identification of the functional relationships among the planning concepts represented in the planning state. In fact, this search for analogous planning contexts turns one of the problem solving techniques back on the problem solving process itself and allows the system to modify its process knowledge base dynamically. We view this asa difficult problem which awaits the further specification of the planning state representation for more detailed approaches, 2.4.5 Integration of Analogical Reasoning and MOLGEN We have indicated above that if our analysis of diverse methods of analogical reasoning shows one to be advantageous in the MOLGEN context, we would add this technique as a component of the production system. This involves the creation of process units describing application criteria for the analogy package. The analogy packages themselves will be designed within the framework of the current MOLGEN representation paradigm and thus will be compatible with the rest of the system. As with other general AI problem solving techniques, the iain difficulty will be in accurately specifying the appropriate contexts for application. There will be some adjustment if the successful representation method is a variant of one currently in use. However, the extensions we have mentioned are additions to the information represented rather than major modifications of the representation itself. 2.5 | Performance Evaluation and Improvement of AI Knowledge- Based Systems Overview: Performance Evaluation as an AI Problem Solving Task The objective of this part of the proposed research is to investigate methods of automating the process of measuring, evaluating, and improving MOLGEN's performance. 39 2.5 . October 27, 1977 The work is motivated by the belief that, for AI systems that attempt to solve real world problems effectively, it is just as important to have a representation of how knowledge is used during problem solving, as it is to have representations of how that knowledge is organized. For example, inefficiencies in testing rules, excessive backtracking, too frequent or infrequent access of specific items in the knowledge base, adequate but consistently suboptimal answers to problems, may all be symptoms of a mismatch between the user's conception of the domain and the structure of the domain required for the efficient solution of problems under consideration. Performance evaluation tools can be used to detect these performance problems and, used properly, can give the user a clearer understanding of the domain. Guided by its measurements of performance, the system may be able to agsist the user in proposing and testing changes to reflect this understanding. A sophisticated system would use its own performance statistics to propose changes that would better reflect current usage. The work is also motivated by the conviction that performance knowledge is essential for conveying a program's scope and limitations to a user, who can then use the program more intelligently. Thus, a second motivation is to increase.the user's knowledge of the system's capabilities. Experience with Rhowledge based systems has shown that one of the biggest investments of time and effort made by a starting user is in finding out the scope and limitations of the program. Buchanan and Smith [5] has an excellent discussion of this problem in the use of the CONGEN program [6]. CONGEN is a sophisticated generator of chemical structures under user-provided constraints. Users of CONGEN have found that manual assistance is insufficient to acquaint them with the sorts of constraints which the program can accept. In trying to make the system easier for the expert to use, it is also important where possible to give the user some idea of how much of the problem has been solved, how much remains to be done, and where the system has been spending most of its effort. CONGEN and SECS [40] let the user see current problem status by means of special user commands or dynamic displays of the proplem solving graph. While not all of these measurements will be meaningful for every type of problem solving, giving the user at minimum a description of what the system is working most on will greatly increase understanding of how it attacks problems of different kinds. Thus, abasic goal of incorporating performance evaluation tools into ATI problem solvers should be to provide descriptions of the distribution of system effort -- what information was accessed, what rules were invoked. 2.5.1 Creating a Knowledge Base for Performance Diagnosis and Correction We take the point of view in this research that evaluating a system's performance and suggesting improvements is itself an AI 40 2.5 October 27, 1977 problem solving task. The diagnosis of system performance problems can be viewed aS a problem of hypothesis formation and verification. Elementary actions of the system can be viewed as "Signals" that must be interpreted and summarized to form patterns of behavior, or system "symptoms", Correction will be based on user definitions of what changes should be made in the system following the detection of specific symptoms, An important part of the research will be to formulate knowledge about measurement, evaluation, and improvement in terms of rules, The rules in this "performance evaluation" knowledge base will perform a variety of functions. They will: 1, interpret elementary system events as hypotheses about system behavior. 2. initiate and focus additional measurements on the basis of proposed hypotheses about behavior. 3. suggest modifications in the organization of the system's knowledge base or in its operators and strategies as a consequence of aé_ sufficiently confirmed diagnosis. A sophisticated system should help the user define different Sets of these rules for different system performance goals. That is, the user should be able to specify what behavior to measure and what modifications to make depending upon the importance which he assigns to varicus criteria of system performance. ‘These criteria include efficiency, quality of answer, plausibility of processing sequence, and clarity of the knowledge base. As in medicine, there would be no explicit definition of "normal" system behavior. Instead, the user's assignment of performance priorities would effectively define “healthy” system behavior, by designating the circumstances under which modifications should be made (in other words, by defining “unhealthy” behavior). The effects of a performance evaluation "meta system" acting upon a problem solving "object system" should be visible in several ways. The object system-should work more efficiently, and on a wider range of problems, getting improved answers. The user should get feedback about how the system works, where it spends most of its effort, and how each new rule or object affects system performance. The guidance by the system of acquisition of new rules and objects from the user should reflect its experience with how existing rules and objects affect performance. Perhaps most important, the user should become better informed about the connections among various sources of knowledge in his own domain, at least for the set of problems confronted by the problem solving system. ~ Al 2.5 October 27, 1977 This investigation will primarily be concerned with how well the knowledge base of an AI problem solving system is organized and used. The emphasis will be on such factors as accessing and grouping information about domain objects, and retrieving, testing, and invoking domain rules, There are, of course, other important determinants of system performance, such as the selection of the best internal representation for a data structure, or the selection of the most efficient method of searching a data file. However, the main concern here is to assist users in organizing domain knowledge for effective problem solving in conjunction with the AI system. Therefore, we will concentrate on the aspects of the system that might be modifiable directly by the domain expert. Proposed Research Steps The specific steps of the research will be: 1. Identify the elementary events of rule based problem solving systems which may affect performance. 2. Establish simple measurement techniques for detecting these events. 3. Develop interpretations of these events as higher level hypotheses about system behavior. 4, Formulate these interpretations as rules which map events into descriptions of behavior. 5. Formulate inverse mappings: rules that initiate additional measurements on the basis of behavior hypotheses suggested by existing data. Establish measurement methods as needed. 6. Identify what changes might be made to improve various kinds of performance. 7. Investigate how to use top level descriptions of behavior to suggest system modifications, and formulate these as modification rules. It is hoped that these steps will have visible positive effects on system performance and user knowledgability. In addition, it is hoped that this research opens the way to: (1) expressing desired system performance in the form of sets of modification rules, and (2) using past evaluations to estimate the impact of new knowledge and to guide the acquisition process. Performance Criteria 42 2.5 . October 27, 1977 The whole point of measucing what the system is doing is to improve it with respect to established performance criteria. A major aim of the research is to make it easy for a user to express different criteria and as a result to get different sets of responses to object system symotoms. Here are some examples of performance criteria and how they might differ in the changes they recommend. 1. Get an adequate answer with a minimum of effort, The most important factor is efficiency. 2. Get the best answer and don't overlook any possibilities. This is a conservative system. 3. Keep the knowledge base as simple as possible. One motivation would be the ease of understanding how the system gets its answers. 4, Follow a sensible line of processing. Perhaps it is most important that the solution method resemble the designer's idea of how a human solves the problem. Each of these criteria dominates the Gesign of at least one major AI problem solving program. The eventual aim of this line of research should be to understand the behavior of AI problem solvers well enough so that, with system help, a user could specify precise performance criteria which the system would translate automatically into a set of measurement and correction rules. Short of this, it will still be helpful to give the user part of these capabilities. The user should be able to write correction rules, he should be given feedback on their effects, and he should be able to designate alternative sets of correction rules to correspond to different performance goals. However much of the process of creating therapies is automated, the practical product of establishing criteria will be a set of rules giving correction recommendations for detected symptoms. Two brief examples will illustrate the differences in recommended changes when different criteria are in force: 1. An operator fails to achieve its desired effect fairly often, The choice is whether to add a check for the upsetting condition or to rely on backtracking. If the performance criterion is "“etficiency”, the recommendation might be to add the check. If the criterion is "simplest adequate knowledge base", where "simplest" includes fewest operator preconditions, the recommendation might be to leave the operator as is. 2. The criterion of "sensible order of processing" might emphasize pursuing a Single search path as long as its 43 2.5 October 27, 1977 heuristic rating is above some minimum. By contrast, a “best answer" criterion might demand that the most promising node be expanded at each step. Operator selection strategies would surely be modified differently to meet these two criteria. 3 Significance 3.1 Significance to Computer Science The proposed research brings together many themes of recent artificial intelligence work. The task area of molecular genetics is richer and more complex than other tasks in which these themes were originally developed. Therefore, we view MOLGEN as research on extensions of currently known methods as well as on integration and application of those methods. In the Introduction we listed several of the specific issues we believe are critical for developing reasening programs that aid scientists. We mention only in passing the fundamental issues of representing, managing, and acquiring knowledge for a reasoning program because we take these to be inescapable in AI research. While we have no radical discoveries in these areas, we have tried to state clearly in the proposal how we approach these “knowledge engineering" issues. 3.1.1 Reasoning by Abstraction The most highly developed program to exploit severa] levels of abstraction in its reasoning [32] works in simple domains with few facts and relations. This pioneering work will be refined and extended by the work on MOLGEN so that reasoning in complex domains can be guided by scientific knowledge at many levels, The basic knowledge with which MCLGEN solves problems has been organized from the start in a hierarchy that reflects successively more detailed descriptions of objects and operators. The proposed work on diagnosing failure in designs for experiments operates on the premise that one major cause of failure is using general knowledge when more detailed specifications are required. On the other hand, the programs that plan the experiments gain their power from omitting details. We also propose to use the abstraction hierarchy to improve the knowledge base in light of experience. Again, the general descriptions 44 3.1 October 27, 1977 of successful experiments will be the ones MOLGEN will be able to apply to new problems, and the ones of most interest to researchers. 3,1.2 Strategy Knowledge As the knowledge base of facts and inference rules increases in a program, it is necessary to find efficient means of reducing the amount of computation at every reasoning step. The most sophisticated AI technique for controlling a reasoning program is to encode and use strategy knowledge to guide the program into the most useful facts and rules, without considering the others. The Teiresias system [7], developed at Stanford, demonstrates the power of encoding strategy knowledae in rules and using it to guide the invocation of domain specific knowledge. In MOLGEN, the need for strategies to guide the processes of designing experiments and diagnosing failures will be acute due to the amount of potentially relevant information at each step. 3.1.3 Integration of Diverse Sources of Knowledge Reasoning about complex preblems requires integrating information from more than one source. Problem solving is not neatly compartmentalized into independent packages of relevant material. © the contrary, expert problem solvers know how to use information from many sources about many different aspects of the problem. The HEARSAY programs for speech understanding are among the best known examples of bringing multiple "experts" into a common problem solving process [22]. There, each expert contributes what it can to a current best hypothesis with little communication among the experts themselves. In nearly every aspect of the MOLGEN program, there are several ways of viewing problems, each with its sources of information and problem solving procedures. We are therefore looking for general mechanisms that enable different experts to cooperate on problems of various kinds. For example, both the experiment planning and the debugging systems have to call on experts with knowledge of different instruments and experimental procedures. The experts may not have a common vocabulary, yet they must be able to contribute to the problem solving at different levels of abstraction and about different parts of the problem, 45 3.1 a October 27, 1977 3.1.4 Interaction between Search and Simulation Heuristic search requires strong evaluation functions to judge the plausibility of branches leading to complete hypotheses, However, in molecular biology many of the answers about plausibility of alternatives are not known a priori, but can only be known by experiment. Since laboratory experiments are typically very expensive, MOLGEN includes simulation models that can predict the time-course of an arbitrary experiment and thus can give some measure of the plausibility of that experimental procedure. Several items mentioned above, including planning experiments, debugging and reasoning by analogy, will need to exploit the program's ability to simulate some aspects of experiments it is reasoning about. Complete simulations, of course, are also expensive, so we need to resolve issues about control of the simulation depending on the kinds of answers sought by the heuristic program. 3.1.5 Reformulating Available Methods Each time our research group (9) has built another large AI program, we have learned more about how to do it better and faster next time. For example, the production rule interpreter in Heuristic DENDRAL (for special-purpose rules) became the general rule interpreter of MYCIN. One of the significant products of MOLGEN research will be the sets of ideas and programs for encoding and manipulating large amounts of knowledge about a scientific discipline. We have transferred some parts of the MOLGEN Units package to another project interested in building a knowledge base about AI methods and techniques. Making the tools used here available for use in new programs is an important aspect of our work, end is generally important for cumulation of knowledge in the AI field, In order to do this we must reformulate the methods so they are more generally applicable and more readily combined in diverse ways. 3.2 Significance to the Conduct of Experimental Science and to Science Policy The exposition of a balanced view of the potentials of ATI for practical applications in science faces many hazards. Enthusiasts make unlimited claims whose eventual realization is hard to disprove -- except that it is hard to say how long is ‘eventual'. It would be difficult (though increasingly possible) to justify the investment in (9): The MOLGEN project is part of a larger group known as the Heuristic Programming Project. 3,2 October 27, 1977 particular projects in terms of their utility for the object discipline; we believe, for example, that the DENDRAL project had to be assessed as a pathfinder, rather than for its specific utility to mass spectrometrists working today. The art has progressed to the point where MOLGEN may be expected to be at the margin -- that is to be the agency of concrete discoveries within a decade, advances that will compete in value with those achieved from comparable investments in existing tools in the biochemical laboratory. In suggesting new forms of working assistance, we do not imply that the creative imagination of scientists will be mimicked or displaced by AI programs over a broad domain of fact and insight: certainly not within our own immediate ambitions. However, we have intentionally chosen an experimental field, part of which is characterized by combinatorially elaborate contingency trees, some rigor of inference, and a fairly limited frame of relevant world-knowledge. These are precisely the conditions where programs can be expected to be of some help to human intellect, which thrives on the converse. A reexamination of the process of Science may also be important to bolster and defend basic science at a policy level. The very justification for basic research is under critical, often even hostile scrutiny. Many quarters are asking such questions as "How much of the science progress of the past 30 years can be attributed to advances in knowledge connected with federally-supported research?" “Are our institutional arrangements and patterns of funding really the most appropriate for the most efficient 'transfer of technology' from the basic laboratory to useful applications?" Less often raised by external critics is, "To what extent does the present system support the most fundamental innovations within science itself; or does it inevitably focus overwhelming support on the most obvious, transparent questions and discourage more revolutionary kinds of inguiry?” Many of our colleagues reply to these attacks with well- intentioned promises and manifest good faith. Nevertheless, it is easy to show that many short-term advances have arisen from the most pragmatic kinds of investigation: empirical screening for antibiotics or antidiuretics has undoubtedly generated more life-saving therapeutic products than the most sophisticated molecular biology, up to the present moment, Indeed, salt-water, intelligently administered, has been one of the great life-savers of the recent era! It would be tragic to undermine the enormous long range potential of basic insight without a deeper analysis of the process by which knowledge and insight move from basic science into applied problems; and we just might find some ways to improve the system without wrecking it! It would be premature to claim that computer programs per se will soon be delegated the major responsibility for "systematic identification of relevant knowledge", although they can already play a very helpful role in assisting human intelligence to correlate bibliographic data, and in other ways. However, the very process of implementing an “applied philosophy of science", which is the principal fore-work of developing a domain for the application of knowledge-based 47 3.2 October 27, 1977 AI, is exactly the kind of formal systematization needed for constructive efforts to facilitate technology transfer. Although our substantive efforts are mostly concerned with the "micro-problems" of scientific inference, there may be more important treasures in a macro-perspective on the integration of scientific specialties. Improved systematization of scientific knowledge, should be an important side effect. of these investigations in knowledge- engineering; and this may lead in turn to the recognition of remedial rents in the overall fabric. For example, it is dismaying to reflect on the delay of 35 years, from Beadle and -Tatum's discovery of nutritional mutants in Neurospora, before similar ideas were applied to the biochemistry of human heart disease -- our most serious health problem by far. Is there no way to accelerate the benefits of such fundamental research? We will not get analytically persuasive or policywise sound determinations of such questions without more attention to the underlying process of scientific inquiry than unselfconscious scientists are wont to indulge. These ideological implications of an ‘applied philosophy of science’ are complemented by some of the practical technologies of AI work. Our own research is greatly facilitated by access to the SUMEX- AIM system. This comprises not only a computational facility, but a national community of mutually interested investigators bound by effective computer-data communications. The development of formal representations of experimental science adds to the effectiveness with which the scientific community can enter into informed criticism of other's work, at the level of strategies of discovery and proof as well as in the exchange of laboratory data. 48 3.2 October 27, 1977 4 Budget. Budget to National Science Foundation MOLGEN: A Computer Science Application to Molecular Genetics Professors Edward A. Feigenbaum and Joshua Lederberg, Principal Investigators June 1, 1978 - May 31, 1980 Total Budget 6-1-78/5-31-79 6-1-69/5-51-80 6-1-78/5-31.-80 Salaries and. Wages Senior Personnel idward A. fcigenbaun, Co-Principal Investigator 5% time Academic Year; 1 month Summer FRE: 1.45 months 5,095. 5 ug, 10,542. Joshua Lederberg, Co-Principal Investigator 5% time, Calendar Year; FYE: .45 months iy 6 gb Bruce G. Buchanan, Adjunct Professor . 25% time, Calendar Year; FIN: 3.00 months 8, 68h. 9,291. 17,975. Other Staff fark Stefik: Student Research Assistant 100% time, Summer Quarter; FTE: 3.00 months 2,862. Ph.D. Research Associate 75% time, effective 9-1-78 Year One FTE; 6.75 months 11,250. Year Two FTE: 9.00 months 15, 84h, 29,956. Peter Friedland: Student Research Assistant 100% time, Sumer Quarter; FIE: 3.00 months 2, 862. Ph.D. Research Associate ‘ 100% time, effective 9-1-78 Year One FIE: 9.00 months 15,000. Year Two FTE: 12.00 months 21,125. 48, 987. Student Research Assistant, Computer Science, to be named 50% time, Academic Year 100% time, Sumer Quarter FTE: 7.50 months , 75587. 7, 038. 15,225. Post-Doctoral Scholar, Genetics, to be named 100% time, Calendar Year FTE: 12.00 months 14, 630. 15,508. 30,138. Secretarial Assistance 20% time, Calendar Year FTE; 2.10 months 2,095. 2,2h5. he3h0. Subtotal, Salaries $ 69, 863. $ 77,300. $ 147,165. 49 Staff Benefits 19.0% 9-1-77/8-31-78 20.3% 9-1-78/8-31-79 21.6% 9-1-79/8-51-80 Permanent Equipment Computer Terminal, Datamedia, including modem. Expendable Supplies and Equipment Travel Domestic: professional meetings and trips to collaborate in New Mexico Publications Costs Computer Costs - All services provided by SUMEX computer facility Other Costs Communications (telephones) Subtotal, Direct Costs Indirect Costs, Excluding Permanent Equipment - 58% Total Budget Total Budget 6-1-78/5-31-79 6-1-79/5-31-80 6-1-78/5-31-80 1,000. 700. 720. 89,975- 50,542. $ 140,517. 49a 16, hue. 1,000. 1,000. 800. 900. 97, hho. 56,517. $ 153,959. 50, 39h. 2,835. 1,875. 2,000. 1,500. p i, 650. 187, 417. 107,059. BUDGET NOTES 1. Staff salaries Mark Stefik and Peter Friedland, currently Computer Science Ph.D. thesis students, have been working with the MOLGEN project since its inception. They have invested very heavily in personal time and effort, and the MOLGEN project has invested resources, to bring them to the point of being highly informed "bridge" scientists between Computer Science and Genetics. A high return on this investment, from both the scientific and economic viewpoints, will be obtained by retaining these young scientists for two more years as post-doctoral researchers. Each is expected to complete his Pn.D. thesis in August,1978, and will tnerefore be paid at pre-doctoral Salary rates for the summer of 1978, Thereafter, or upon completion of the Pn.D., they will be paid at "new Ph.D." salaries, i.e. approximately what a new assistant professor would be paid (annualized). Stefik is budgeted at 75% effort under the assumption that other support can be found for the remaining 25% of his time. Thus, he will be on the project 100%, but paid only 75% from this budget. 2. Permanent equipment We are requesting one more terminal of the "standard" type that is used for 1200 baud access to SUMEX, i.e. the Datamedia. MOLGEN is a compute-intensive project. In the continuation period, we anticipate that the chief bottleneck to effective computer use will not be computer access or cycles but terminal access. The SUMEX facility does not supply terminals to individual projects, but expects each project to supply its own needs. MOLGEN has two terminals now, but one more will be needed as the project expands in effort, scope, and Size. This need is modest considering the fact that all other computer support is supplied to the project at no charge. 3. Phone charges These include not only charges for ordinary project business (small part) but also charges for phone line communication to the computer (large part). 4, Travel Funds for domestic travel are needed to support travel to occasional professional conferences in Computer Science, particularly the annual conference of the SUMEX-AIM community (the conference supports some of this; individual projects pick up the remainder); and to support travel to collaborate with Professor Martin and her group at the University of New Mexico. 49b October 27, 1977 5 Resources Professors Edward Feigenbaum, Bruce Buchanan and Nancy Martin (New Mexico) will direct the computer science research of the MOLGEN continuation project. The principal project staff members at Stanford will be Peter Friedland and Mark Stefik. Both will complete their Ph.D. theses on MOLGEN early in the continuation period, and both have agreed to stay on for the MOLGEN continuation. An additional computer science student will be taken on, hopefully leading to another MOLGEN project Ph.D. thesis. Molecular genetics knowledge, expertise, insights, techniques, and experimental heuristics will be provided by the researchers in Professor Joshua Lederberg's laboratory at Stanford, including graduate student Jerry Feitelson and a post-doctoral research fellow to replace Dr. S. D. Ehrlich. Professor Lederberg himself will provide substantial amounts of time on a regular basis for directing the project from the genetics viewpoint. Offices for the MOLGEN project will be provided within the Stanford Heuristic Programming Project so as to foster interaction and exchange of ideas with workers on similar projects. Active projects within the Heuristic Programming Project include: 1) DENDRAL, a knowledge-based system for the analysis of organic compounds from spectrometric data. 2) META-DENDRAL, a system for inducing rules of mass spectrometry and n.m.rc. Spectroscopy from instrument data. 3) MYCIN, a system for the diagnosis and treatment of infectious disease. Approximately thirty workers including faculty, research associates, and graduate students are involved among the projects. All of these projects are active in the design of intelligent systems for specific domains of science and medicine providing sources of problems and insight concerning complex reasoning processes, There has been considerable synergy among the various projects. 4) PUFF, a MYCIN-like system for diagnosis of pulmonary function disorders. 5) CRYSALIS, a system for inferring the structure of proteins from electron density maps derived from x-ray crystallographic data. 6) VM, a heuristic process control system for assisting physicians in the management of a breathing-assistance machine in the intensive care units of hospitals. 50 October 27, 1977 7) AGE, an attempt to build: a software package of AI techniques and methods for problem solving and hypothesis formation; and its associated user-interface. The superb computing facilities of the NIH-supported SUMEX-AIM timesharing facility will be available at no charge to this project. The SUMEX-AIM facility, with Prof. Lederberg as principal investigator, is a national resource for the application of artificial intelligence techniques to problems in biology and medicine. Resources to be provided will include all CPU-time and storage required. Those involved at Stanford will be operating through hard-wired or dial-up equipment to the SUMEX PDP-10, while those at the University of New Mexico will access the system through either the ARPA network or TYMNET. The SUMEX-AIM facility is a powerful interactive computing system open to a national community. Interlisp and other high level languages are available and supported by a large system staff. Many convenient text editors for developing programs are provided. The TENEX operating system supports flexible file handling and sophisticated storage management for a highly interactive computing environment. October 27, 1977 Appendix I GLOSSARY AGE Stanford project ("Attempt to GEneralize") to build a general package of AI methods. AI Artificial Intelligence AI "toolbox" A set of AI concepts embodied in programs that are general enough to be used to construct problem solving programs in many different domains. alkaline phosphatase An enzyme that removes terminal phosphates from nucleic acids, and whose optimum pH is alkaline. AT-rich DNA DNA containing a high proportion of Adenine + Thymine base pairs. Because of the Watson-Crick base pairing rules, A/T=1 and G/c=l, but (A+T) /(G4C) ratios can can vary widely between DNA of different species and even different regions of the same DNA molecule. B. subtilis A common soil bacterium often used in genetic experiments. bacteriophage Viruses that multiply in bacteria. base sequence A string of nucleotides in a nucleic acid. bottom-up process A program that puts together inferences from data without the benefit of global expectations and goals. CONGEN Constrained Generator of molecular structures for DENDRAL, dalton A unit of mass equal to that of a single hydrogen atom. data-driven procedure bottom-up process demon a procedure in a_ program that is triggered by an event, aS opposed to being executed in the "normal" execution of a sequential program. 52 October 27, 1977 denaturation The loss of the native configuration of a macromolecule resulting, for example, from heat treatment, extreme pH changes, chemical treatment, or other denaturing agents. It is usually accompanied by the separation of strands (in DNA) and the loss of biological activity. DENDRAL Heuristic program for generating and testing organic molecular structures/as candidate explanations of empirical data. digestion With reference to enzymes, implies the cleaving of chemical bonds in the target molecule. For example, exonucleases "erode" or remove terminal nucleotides, restriction enzymes cut at the internal recognition sequences. dimer A concatenated DNA structure consisting of two identical constituents. discrimination experiment A series of experimental steps designed to conclude whether structures are identical or not. DNA Deoxyribonucleic acid. A polymer of deoxyribonucleotides (see nucleotide definition). Can exist as double or single strands. The genetic material of all cells and the central molecule in molecular genetics. domain-specific critic A procedure which applies specific EcoRI EDNA electron genetics knowledge to problem solving, as opposed to general problem-solving knowledge. A restriction enzyme isolated from a strain of Ee coli that cleaves DNA at site-specific regions along the molecule. Its recognition site is 5'- GAATTC-3', The DNA-structure-editor for MOLGEN. microscopy Abbreviated EM. A high-resolution technique for visualizing material that uses beams of electrons instead of light rays. Resolutions of about 10°-7 cm are posible with biological materials, electrophoresis An experimental technique used to separate, purify, and measure the molecular weight 53 October 27, 1977 of molecules having an electric charge in solution. endonuclease An enzyme that cuts DNA backbone chains internally. enzyme Protein molecule capable of catalyzing a specific chemical reaction. E. coli A common intestinal bacterium: the most intensively studied organism except for man. event-driven procedure demon exonuclease An enzyme that digests DNA from the ends of strands. experiment planning An activity characterized by the production of a sequence of experimental steps to achieve a goal. focus rules Focus of attention procedures. Items of knowledge that guide a program to the most relevant parts of the problem or the most useful subroutines. gaps An internal feature of double-stranded DNA which is aregion of unpaired nucleotides due to the excision of a string on one strand. HEARSAY AI program written at Carnegie-Mellon University to understand spoken English. Integrates inferences made by multiple experts. hierarchical planning AI techniques refined by Sacerdoti which uses a hierarchy of descriptions to plan an efficient problem solution procedure, inspector Domain-specific critic INTERLISP A powerful extension of the LISP programming language. KRL-0 A programming language (knowledge representation language) developed at Stanford and Xerox, Palo Alto Research Center. ligase ’ An enzyme capable of covalently joining parts of, or entire DNA molecules together. ligation The enzymatic joining together of DNA molecules. 54 linear DNA meta~rules October 27, 1977 Double stranded DNA that is not covalently closed at its termini. Rules for a program that mentions domain-specific rules, i.e, to prune or reorder the set of rules relevant for problem solving in specific contexts. molecular adapter A chemically synthesized segment of DNA MOLGEN monomer MYCIN nicks nuclease nucleotide pH phosphodiester that is utilized to join together DNA molecules which do not have complementary termini for ligation. Computer program for reasoning in molecular genetics. Main subject of this proposal. A single DNA molecule (or nucleotide) that has not undergone polymerization (viz. a unit character capable of assembly into a string). Medical diagnosis and therapy recommendation program developed at Stanford. A local interruption in the phosphodiester backbone of DNA. No genetic information is missing due to this structural anomaly. An enzyme which breaks chemical bonds in the DNA phosphodiester backbone. Consists of endonucleases and exonucleases. The building blocks of DNA consisting of a purine (Adenine or Guanine) or a pyrimidine (Thymine or Cytosine) linked to a deoxyribose sugar with a phosphate group also linked to adjacent sugar. Adjacent nucleotides are linked together through a phosphate group and a hydroxyl group on the sugar component (see phosphodiester). The negative logarithm of the effective hydrogen ion concentration or hydrogen ion activity in gram equivalents per liter. Used in expressing both acidity and alkalinity on a scale whose values run from 0 to 14; 7 representing neutrality, less than 7 increasing acidity, >7 increasing alkalinity. DNA exists in native form between pH values of 5 and 12, . The chemical link between adjacent nucleotides. The following diagram of its structure was drawn using CONGEN [6]: plan schema October 27, 1977 . / . Oo oO \ \/ 0 O P \/ /\\ P o 8600 //\ / Oo O- CH2 \ c-C / | CH \ Cc-C / BASE A sketch of a procedure describing the plan for an experiment in abstract, general terms. planning islands Partial solutions toa problem found bya planning rules plasmid poly-A region poly-c poly-G poly-T planning program. "Stepping stones" to a complete solution. Procedures or items of knowledge that aida program in constructing a problem solving plan. Extrachromosomal DNA molecules which are double stranded, circular, and supercoiled. They range in size from about 5*10°6 daltons to near 10°78. Small plasmids can exist in many (more than 50) copies per cell while large ones are maintained at one or two. They are often used as vectors for amplifying and transferring DNA from one organism to another. (sequence) A homopolymeric sequence of adenine nucleotides. Implies a poly-T region on the complementary strand. Homopolymeric cytidine nucleotides. Homopolymeric guanine mucleotides. Homopolymeric thymine nucleoties. 56 October 27, 1977 polymerase . Enzymes that are catalysts for nucleic acid chain growth, pre-conditions Premise clauses of conditional sentences that must be satisfied before the consequent actions are taken. production rules Conditional sentences used to encode inferential knowledge for a program. prototype . The type of unit created for representing information about general concepts. Features are defined by slots associated wih the prototype. restriction enzyme Site-specific endonucleases used frequently in molecular genetic manipulations. Allow previously impossible experiments to be performed due to their ability to cleave DNA at reproducible locations allowing rearrangements within and between molecules. RNA Ribonucleic acid. Typically single stranded, is a polymer of ribonucleotides connected by phosphodiester bonds. schema/rule schema/program schema An abstract, generalized representation of a concept or program. In MOLGEN, program schemata (or rule schemata) are represented as Units with slots defined for important features. SECS Chemical synthesis plarning program developed by . Prof, Todd Wipke (U.C. Santa Cruz). self-circularization Ligation of the ends of the same DNA resulting in a circular, covalently closed molecule. self-ligation Ligation of a DNA molecule to itself, resulting in in a circular molecule. Catalyzed by ligase. sequencing experiment A technique to determine the order of nucleotides in a strand of. DNA, slots Pre-defined features of objects for which values are sought. SMALLTALK Display~oriented programming language developed at Xerox, Palo Alto Research Center. | 57 sticky ends STRIPS SUMEX-AIM Teiresias TENE Thy gene October 27, 1977 A condition of partial single-strandedness at the termini of DNA molecules, allowing base pairing in that region. Restriction enzymes often leave sticky ends, greatly facilitating the rearrangements of DNA, Robot plnning program developed at SRI. NIH-sponsored computer resource for applications of artificial intelligence in medicine. AI program that acquires inference rules for MYCIN and guides MYCIN reasoning. Operating system for the DEC KI-10 system running at the SUMEX-AIM facility. A gene coding for the enzyme, thymidylate synthetase. This enzyme is crucial in enabling a bacterium possessing it to produce thymidine, a constituent of DNA, top-down process A program that works from general principles, Units vector world states 3'~end/ 5'-end testing data against expectations and goals, often working by dividing complex problems into simpler ones. Basic element of representation in MOLGEN. Units are organized in a hierarchy to facilitate the representation of class-subclass and prototype- instance relationships. Units are used for representing processes as well as concepts. A self-propagating DNA molecule that can be used to link DNA sequences of interest. Vectors can be one of several replicating plasmid or bacteriophage DNAs. Representation of the state of an experiment at any given time. The “world" for the program is the limited set of objects and operations relevant to a specific experiment. Related to the direction of the phosphodiester bonds in the backbone of DNA molecules. Each Strand thus has one 3' end and one 5' end. 58 October 27, 1977 Appendix IT EDNA -- The Editor for DNA The following is an actual session with the MOLGEN knowledge acquisition system recorded from SUMEX. Comments preceded by a semicolon have been inserted to clarify some aspects of the dialog. @ue :UE is the name of the Unit ;Editor. Here it is being called (Version 4-OCT-77 08:56:16) sfrom TENEX Welcome to the MOLGEN Unit Editor. Type ? anytime for assistance. The symbol : indicates that the editor is waiting for your input. Two characters are enough for command recognition. You may type ahead responses for a command. Name of Network: jerry ;Jerry is the name of an yexisting Knowledge Base on :file. :create testl root specialization 7A new unit TESTI is created. Give a value for the DESCR slot ;UE asks for documentation, Text Editor te: Test unit to demonstrate the DNA structure editor te: done :User indicates he is done swith documentation. Do you want to see what slots have already been filled? yes DESCR: (U) from ROOT Test unit to demonstrate the DNA structure editor MODIFIER: (U) from ROOT STEFIK CREATOR: (U) from ROOT STEFIK MODIFIED: (U) from ROOT 6-OCT-77 10:22:03 CREATED: (U) from ROOT 6-OCT-77 10:22:03 *Note that the system has rautomatically recorded the zauthor, date and time of the snew unit. You can now create new slots or edit old ones. When through type DONE EDIT: create substrate Datatype: dna ;DNA is a datatype. Role: r :"Role” controls transmission sof the value in the ssubstrate slot if we make 59 Is it a dynamic slot? n DNA Editor Copy or Create anew? create Segment Type: ? October 27, 1977 pspecializations of TESTI. ;Since the datatype is DNA, ;we get the DNA editor. 7A "2" may always be typed to tell the system to clarify swhat it expects for a ;response. Choose one of the following Segment Types. Type Description LE Length Segment. Indicates number of nucleotides in a region. BA Base Segment. Indicates actual Base Sequence. SI Cut Site for enzyme. Segment Type: length Length: ? sAnother "?"™ Indicate number of nucleotides as in the following examples: You Type Meaning 5 5 nucleotides 100 200 1K 1.3K R 1K 1.3K Length: 2.5k edna: print DNA Printer (Version 19-SEP-77 ) 1L (2500) edna: insert 1 3' bases attacg edna: print 1 234567 (2500) AT TACG edna: mirror 4 to 7 edna: print 1 234 5 (2500) ATT A AN Q~ 60 Between 100 and 200 nucleotides. Between 1000 and 1300 nucleotides. (K=1000) No Spaces! Same aS above except RNA instead of DNA. , ;The initial segment is now ;Sspecified. We may now issue yany legal EDNA command. ;EDNA presents its structures ;pictorially. Segments are rreferenced by number. :"Mirror" means to add a sparallel DNA strand. 7EDNA knows about (Renumbering 11) October 27, 1977 ;complementary bases. sBreak a bond. A Break ;command can specify 5', 3', 7or H bonds. :Segment 11 is depressed sto indicate the broken bond. 7A similar notation is used ;to indicate Hairpin loops. ;EDNA can "undo" any of its structure changing commands ;User is finished editing sthis structure. He returns to sthe slot editor. STEFIK STEFIK 6-OCT-77 10:22:09 6~-OCT-77 10:22:03 A T GC 11109 8 edna: break 4h edna: print 1 2345 67 (2500) ATTA CG / —— ~-T GC A 1098 il edna: connect 7 3' 8 edna: print 1 2345 67 (2500) ATTA C G8 | /---~--| —--T GC A 109 8 11 edna: undo CONNECT undone. edna: done EDIT: print all DESCR: (U) from ROOT Test unit to demonstrate the DNA structure editor MODIFIER: (U) from ROOT CREATOR: (U) from ROOT MODIFIED: (U) from ROOT CREATED: (U) from ROOT SUBSTRATE: (R) *Top* 1 2345 67 (2500) ATTA CG \— ——~7T GC A 1098 61 EDIT: done :done Save JERRY? no Bye 62 October 27, 1977 ;User is done with this unit. ;User is done with this sknowledge base. 7He doesn't save his changes rbecause this was just a ;demonstration. October 27, 1977 Appendix III A Genetic Planning Example This section is intended to extend the range of genetic examples for which MOLGEN is envisioned being applied. In particular, the recent cloning of the rat insulin gene in E. coli [38] has been achieved using a simple additional step to the usual experimental protocol, It is asserted that the genesis of this efficiency—improving step can be found in the relatively simple application of knowledge about enzyme properties and DNA ligation kinetics. The basic experimental outline is seen below, modified from [38]. It closely follows the 'classical' recombinant DNA methodology, with a few additional steps. The one we wish to focus on most closely in this discussion is the application of Alkaline Phosphatase to the plasmid vector after cutting of the plasmid by the restriction enzyme HindIII, ---~-----(T)y 5! : | Alkaline Digestion | - 3! | w----——== (T)y 5! | Reverse transcriptase 63 { Sl Nuclease Tomer (AZ 38 pwn manenn n= (T)Z 5! | | T4 DNA Ligase | + | 5' CCAAGCTTGG 3! | 3' GGPICGAACC 5! | October 27, 1977 plasmid HindIII site --//-- / -/f--- \ II Hl 1] {| HI | || L| \ / | HindIIt | restriction . | enzyme 5' CCAAGCTTGG---~~--—- (A) zCCAAGCTTGG 3! 5' paAGcTT-----— A 3! 3' GGTICGAACC-—--~—--—- (T) zGGTTCGAACC 5! 3" A-----TICGAp 5' | | | HindIII restriction enzyme | ALKALINE | | PHOSPHATASE 5' pAGCTTGG--------- (A) 2CCA 3! 5' ohAGCTT-—--A 3! 3’ ACC--------- (T) zGGTICGAp 5! 3' A-—---TICGAoh 5!' | | \ | \ / \ / \ 7s \/ | T4 DNA Ligase | * ~-AAGCTTGG-------—= (A) zCCAAGCTT—— / --TICGAACC—--~----- (A) zGGPICGAA-- \ || 7 cDNA l | {| I [| || li plasmid DNA lI \ / 64 October 27, 1977 The following steps were carried out. First, insulin messenger RNA was purified from B cells in the rat pancreas. This was reverse transcribed into a hybrid DNA/RNA structure by the use of avian myeloblastosis virus (AMV) reverse transcriptase and the RNA selectively degraded by raising the pH. A double-stranded DNA form was synthesized by incubating this with deoxynuclecside triphosphates and the AMV reverse transcriptase (a DNA polymerase could have been used). The hairpin at the end of the molecules and any non~base paired regions were removed with the single-strand specific nuclease Sl. The resulting molecular structure is termed. cDNA, or copy DNA, because it should contain the precise genetic information contained in the gene coding for the insulin messenger RNA. ‘This is the in-vitro synthesized segment that is to be cloned in bacterial recipients for amplification and analysis. A recently developed technigue for ligating chemically synthesized restriction site linkers (adapters) [35] to cDNA was used in order to produce cDNA molecules with cohesive termini after digestion with a restriction endonuclease enzyme. Ligating the resulting cDNA to plasmid DNA cut with the identical restriction enzyme would create a recombinant plasmid which could then be cloned ina suitable bacterial host. Specifically, a decamer linker containing a Site for HindIII was covalently joined to the ends of the cDNA with T4 DNA ligase, and then cleaved with HindIII; pMB9, a 3.5 million dalton plasmid conferring tetracycline resistance with a single site for HindIII, was also cut with the same endonuclease, Tne usual procedure would he to now straightforwardly ligate these two molecules together creating the desired recombinant molecule. Kinetic theory [11] suggests that in order to insure ligation of most of the cDNA molecules to plasmid DNA, it is necessary to add a molar excess Of plasmid DNA. However this would result in the majority of the plasmids simply self-circularizing without an insert of cDNA, and thus most the transformed cells would contain only pMB9 and not the desired recombinant plasmids. Here is where the novel step of removing terminal phosphates on the plasmid was generated. Several sources of knowledge need be brought to bear in order to understand the basis. of this new optimization step. Firstly, we need to know that alkaline phosphatase removes 5' terminal phosphates from the HindIII endonuclease-generated ends of the plasmid. Secondly, knowledge about the requirenent of the T4 ligase for a phosphate end configuration allows us to infer that rewoving the phosphate ends prevents self-Ligation of the plasmid DNA, Thirdly, the kinetic theory of ligation [1]] combined with a rule that says, "In a process that involves two or more competing components, you can optimize one process by inhibiting the other (s)", 65 October 27, 1977 should tell us that circle formation is now dependent on the insertion of a DNA fragment containing 5'-phosphorylated termini: the cDNA, Finally, since transformation is directly linked to the DNA source, the one step inference is: "Only recombinant plasmids will transform the recipient bacteria”. A side effect that needs to be dealt with is the fact that the recombinant plasmids generated after phosphatase and ligase treatments will have two nicks, represented as asterisks in the figure, and that this has no known effect on transformation efficiency. The application of alkaline phosphatese to remove terminal phosphates from a restriction enzyme-cleaved vector (e.g. plasmid) to eliminate self-ligation is a novelty that “should" have been obvious to any investigator working in this field. In fact, three or four years passed before Ullrich et. al. utilized it. One can only speculate as to the reasons why. However two related responses arise in this context, First, there are a very large number of DNA reagents available to the investigator (enzymes, chemical and separative technigues) so the number of possible combinations are vast. Secondly, especially with a well focused goal such as the Jigation optimization step discussed above, people tend to think along relatively stereotyped paths, e.g. previously ceveloped protocols. A computer system, such as MOLGEN, with a complex knowledge base and a good? set of heuristic rules, will be likely to uncover novel applications of well known tools, precisely along the lines of the example just presented. 66 October 27, 1977 should tell us that circle formation is now dependent on the insertion of a DNA fragment containing 5'-phosphorylated termini: the cDNA, Finally, since transformation is directly linked to the DNA source, the one step inference is: "Only recombinant plasmids will transform the recipient bacteria". A side effect that needs to be dealt with is the fact that the recombinant plasmids generated after phosphatase and ligase treatments will have two nicks, represented as asterisks in the figure, and that this has no known effect on transformation efficiency. The application of alkaline phosphatesé to remove terminal phosphates from a_ restriction enzyme~-cleaved vector (e.g. plasmid) to eliminate self-ligation is a novelty that "should" have been obvious to any investigator working in this field. In fact, three or four years passed before Ullrich et. al. utilized it. One can only speculate as to the reasons why. However two related responses arise in this context. First, there are a very large number of DNA reagents available to the investigator (enzymes, chemical and separative technigues) so the number of possible combinations are vast. Secondly, especially with a well focused goal such as the ligation optimization step discussed above, people tend to think along relatively stereotyped paths, e.g. previously developed protocols. A computer system, such as MOLGEN, with a complex knowledge base and a good set of heuristic rules, will be likely to uncover novel applications of well know tools, precisely along the lines of the example just presented. 66 October 27, 1977 References Bobrow D.G. and Winograd T., Experience with KRL-0: One Cycle of a Knowledge Representation Language, 5IJCAI:213-222 (1977) Bobrow D.G. and Winograd T., An Overview of KRL, a Knowledge Representation Language, Cognitive Science 1 (1977) Brachman R.J., What's in a Concept: Structural Foundations for Semantic Networks, BBN Report No. 3433 (1976) Brown R., Use of Analogy To Achieve New Expertise, Artificial Intelligence Laboratory, MIT, AI-TR-403 (1977) Buchanan, B.G. and Dennis Smith, "Computer Assisted Chemical Reasoning", in Proceedings of the III International Conference on Computers in Chemical Research, Education and Technology, Plenum Publishing, pp. 388-408 (1977) Carhart, R.E., Smith, D., Brown, H., and Djerallis, C., Applications of artificial intelligence for chemical inference. XVII. An approach to computer-assisted elucidation of molecular structure J. Am. Chem. Soc. 97:5755-5762 (1975). Davis R., Applications of Meta Level Knowledge to the Construction, Maintenance and Use of Large Knowledge Bases, Stanford Computer Science Department Report No. STAN-CS~76-552 (1977) Davis R., Generalized Procedure Calling and Content-Directed Invocation, SIGPLAN Notices, 12:45-54 (1977) Dershowitz N. and Manna Z., The Evolution of Programs: A System for Automatic Program Modification, Stanford University Artificial Intelligence Laboratory (1976) 67 10, ll. 12, 13. 14, 15. 16. 17. 18. 19, October 27, 1977 Dijkstra E.W., Notes on Structured Programming, in Dahl,0O., Dijkstra, E.W. and Hoare, C.A.R. in Structured Programming, Academic Press (1972) Dugaiczyk, A., Boyer, H. and Goodman, H., Ligation of EcoR1l Endonuclease- generated DNA Fragments into Linear and Circular Structures, J. Mol. Bio. 96:171-184 (1975) Ehrlich §.D., Bursztyn-Pettegrew H., Stroynowski I. and Lederberg J., Expression of the thymidylate synthetase gene of the Bacillus subtilis bacteriophage Phi-3-T in Escherichia coli, Proceedings of the National Academy of Sciences USA, 73:4145-4149 (1976) Evans 7.G., A Program for the Solution of Geometric Analogy Intelligence Test Questions, in First Order Mathematical Logic, Margaris, A., Blaisdell Publishing Company, Waltham, Mass. (1967) Feitelson J.S. and Stefik M.J., A Case Study of the Reasoning in a Genetics Experiment, Heuristic Programming Project Working Paper 77-18, Computer Science Department, Stanford University (1977) Fikes R.E. and Nilsson N.J., Learning and Executing Generalized Robot. Plans, Artificial Intelligence, 3:251~288 (1972) Fikes R.E. and Nilsson N.Jd., STRIPS: A New Approach to Applications of Theorem Proving to Problem Solving, Artificial Intelligence 2:189-208 (1971) Feigenbaum E.A., The Art of Artificial Intelligence: I, Themes and Case Studies of Knowledge Engineering, 5IJCAI:1014-1029 (1977) Gerhart S.L., What Goes Down Should Also Come Up: Some Issues about Abstraction, 5th Texas Conference on Computing Systems (1976) Gerhart S. L., Knowledge About Programs: A Model and Case Study, Proc. of Intl. Conf. on Reliable Software, Los Angeles (1975) 68 20. 21. 22, 23. 24, 25. 27. 28. 29. 30. 31. October 27, 1977 Goldberg A. and Kay A., SMALLTALK-72 Instruction Manual, Xerox Corporation (1976) Goldstein I.P. and Roberts R.B., NUDGE, A Knowledge-based Scheduling Program, 5IJCAI: 257-263 (1977) Hayes-Roth F, and Lesser V.R., Focus of Attention in the Hearsay- II Speech Understanding System, 5IJCAI:27-35 (1977) Kling R.E,, Reasoning By Analogy With Applications To Heuristic Problem Solving: A Case Study, Stanford Computer Science Department Report No. CS-216 (1971) Martin N., Friedland P., King J. amd Stefik M., Knowledge Base Management for Experiment Planning in Molecular Genetics, 5IJCAT: 882-887 (1977) McDermott D., Vocabularies for Problem Solver State Descriptions, SIJCAI:229-234 (1977) Miller M.L. and Goldstein I.P., Structured Planning and Debugging, SIJCAI:773-779 (1977) Minsky M.A., A Framework for Representing Knowledge, in Winston P (ed) The Psychology of Computer Vision, New York: McGraw-Hill -(1975) Notani, N.K. and Setlow, J.K., Mechanism of Bacterial Tranformation and Transduction, in Progress in Nucleic Acid Research and Molecular Biology 14:39-100 (1974) Platt J.R., Strong Inference, Science 146:347 (1964) Rieger C. and Grinberg M., The Declarative Representation and Procedural Simulation of Causality in Physical Mechanisms, DIJCAI: 250-256 (1977) Roberts, R.J, CRC Critical Reviews of Biochemistry 4 :123-164 (1976) 69 32. 33. 34, 35. 36. 37. 38. 39. 40, 4l. October 27, 1977 Sacerdoti E.D,.,. The Nonlinear Nature of Plans, 4IJCAI,:206-214 (1975) Sacerdoti E.D., Planning in a Hierachy of Abstraction Spaces, 3IJCAI: 412-422 (1973) Schaffner K., Logic of Discovery and Justification in Regulatory Genetics, Stud. Hist. Phil. Sci. 4:349-385 (1974) Scheller R.H., Dickerson R.E., Boyer H.W., Riggs A.D. and Itakura K., Chemical Synthesis of Restriction Enzyme Recognition Sites Useful for Cloning, Science 196:177-180 (1977) Stefik M.J. end Martin N., A Review of Knowledge Based Problem Solving as A Basis for a Genetics Experiment Designing System, Stanford Computer Science Department. Report No. STAN-CS-77-596 (1977) Sussman G.J., The Virtuous Nature of Bugs, Proceedings of the AISB Summer Conference (1974) Ullrich A., Shine J., Chirgwin J., Pictet R., Tischer E., Rutter W.J., and Goodman H.M., Rat Insulin Genes: Construction of Plasmids Containing the Coding Sequences, Science 196:1313-1319 (1977) Walker D.E., Paxton W.H., Grosz B.J., Hendrix G.G., Robinson A.E., Robinson J.J., Slocum J., Procedures for Integrating Knowledge in a Speech Understanding System, 5IJCAI:36-42 (1977) Wipke W.T., Computer-Assisted Three Dimensional Synthetic Analysis, in Computer Representation and Manipulation of Chemical Information, Wipke W.T. et.al. (eds.) John Wiley (1974) Woods B., What's in a Link?, in Bobrow and Collins (eds.) Representation and Understanding (1975) 70