MONA LISA: A Framework for Reasoning About Gel Electrophoresis of Nucleic Acids Submitted to\the Hawaii International Conference on System Sciences Motifs in Biology: Analysis of Ambiguous Data Biotechnology Computing Track Michael Cookj Michiel Noordewierf{ Joshua Lederberg} {The Rockefeller University tRutgers University This work was supported by DARPA grant #MDA972-91-J-1008 ARPA order #8145/02 Introduction We are developing MONA LISA, a knowledge-based system in the domain of gel electrophoresis. The purpose of this paper is to introduce this problem to computer scientists, and discuss our work to date on representations and algorithms for some aspects of this area. A branch of modern experimental molecular biology attempts to deter- mine the behaviour of organisms by observing the alteration of the information- containing molecules DNA and RNA. A principle tool for the characterization of these nucleic acids is their migration pattern in an gel under the influence of an electric field. Many experimental results, however, lend themselves to multiple interpretation. We therefore have found it useful to formally enu- merate the possible results, each of which corresponds to a hypothesis about the events which have transpired in the experiment. The goal of the MONA LISA system is to generate and evaluate hy- potheses about in molecular biology experiments. This is accomplished in two steps: 1. DNA atid RNA sequences which are the plausible products of the ex- periment are systematically generated, 2. The predicted gel electrophoretic behaviour of the putative molecules is compared with the actual gel results to eliminate hypotheses. Input to the program is a set of descriptions defining: e nucleic acid molecules e protein factors e experimental conditions e gel electrophoresis parameters and results Output from the program is set of hypotheses, where a hypothesis 1s defined to be an assignment of DNA and RNA molecules to bands which are observed on the gel. The existence of a computational system to examine gel-electrophoretic experiments is expected to be of utility to a biologist because it addresses a problem which arises continuously in the laboratory. Although the generation of a specific assignment of nucleic acid molecules to bands is not difficult, the number of potential such assignments grows combinatorially, and is therefore difficult to exhaustively enumerate manually. The interpretation of gel electrophoresis data is a good challenge for knowledge-based programming. First, its achievement would be useful to researchers — it is not a toy problem. Second, its solution requires further research into important areas such as knowledge representation and qualita- tive reasoning. Gel Electrophoresis Gel electrophoresis is one of the most widely used techniques today in molec- ular biology. It is based on the fact that most biological macromolecules are electrically charged and will therefore move in an electric field. “This prop- erty can be used to determine molecular weights, to distinguish molecules by virtue of their net charge or shape,...,and to separate different molecular species quantitatively” [5]. The many applications of gel electrophoresis include: DNA sequenc- ing, Southern transfers, restriction mapping, separating DNA by molecular weight, or by shape due to conformation (e.g. supercoiling), detection of re- combinant plasmids in a cloning experiment, analysis of unknown mixtures, peptide analysis, etc. Many different types of gels are in use: capillary gel electrophoresis, pulsed field, temperature gradient, 2D agarose, 2D polyacry- lamide, transverse gradient, and denaturing gradient gel electrophoresis, to name a few (for an overview of electrophoresis see [4, 2, 1]). The detailed theory of electrophoresis is “highly complicated and at present incomplete” [5]. Therefore, in practice, researchers use empirically deter- mined heuristics to establish the conditions used in gel experiments, as well as in their interpretation. The problem of interpreting a gel is not a “well- formed” problem. The exact goal varies from one context to another, depend- ing on the level of resolution of the data and the goals of the experiment. Also, the problem states are not discrete, and the operators used to move between states are not obvious. At this point, gel data is often not that well quantified. For instance, the total amount of material involved in the experiment may not be known, and so the obvious constraint imposed by the conservation of mass is not available. Extra material which cannot be accounted for is often simply ignored. However, the technique of gel electrophoresis is remarkably useful to biol- ogists, and there is a body of knowledge that qualifies as expertise. For these reasons we decided that gel clectrophoresis experiments are a good domain in which to build an expert system for use by researchers to assist in the design and interpretation of gel experiments. Because of the interests in our laboratory, we have focused on one-dimensional (separation) gels with nucleic acids. The main other category of gels is 2-D gels, and some work has been done in the automatic scanning, matching, and interpretation of such gels [7, 9]. We are more interested in automatic reasoning and knowledge represen- tation than image processing and data analysis, although in a complete gel system all these functions would be integrated. Example In our lab a significant amount of time is spent reasoning about gels. Often a gel is run in order to confirm that an experiment has produced the expected result. For example, if four different species of DNA are expected as the result of a certain procedure, one would expect four bands to appear on a gel. If only three bands appear, the question “what happened to the fourth band?” naturally arises. Several explanations are possible, and they are generated by members of the lab, and discussed for plausibility. Each such explanation is a candidate hypothesis which often suggests follow up experiments to test it. This is a situation in which the systematic enumeration of possibilities based on a knowledge base of facts and heuristic rules could be useful to the researcher. Expert Systems Because of the focus on generating hypotheses to fit data, we have been very much influenced by the DENDRAL paradigm, Plan-Generate-Test [3, 8]. DENDRAL was an early expert system designed to interpret mass spectroscopy experiments. DENDRAL’s task of inferring the structure of a molecule from its mass spectrum is analogically replaced by that of inferring the sect of molecular species loaded into a gel from the pattern on the gel. However, a gel is run in many different contexts and this distinguishes it from the situation DENDRAL handles, which covers a standardized instrumental paradigm. Thus, the knowledge base for a gel system is richer and more diverse. Outline of Paper We present a data structure for representing gels, in order to concretize the sub-class of gels we are considering. Our basic model of experiments involving gels is: an experiment E is performed on an analyte N, resulting in a set of molecular species S; these species are run on a gel G; which is then interpreted as a set of bands B. Diagrammatically, E:N-45§4G XB This structures our discussion, and suggests a general framework for rea- soning about nucleic acid gels. Each arrow in the above sequence suggests a different point of view on the problem. 1. The passage from analyte to a set of molecular species is modelled by rules of the form: Reagent: Nucleic Acid —> Products The reagent could be an enzyme, or possibly null. We are building an “enzymatic production system” consisting of such rules, which is discussed in the section on a language for nucleic acids operations. 2. The passage from a set of molecular species to a migration pattern on a gel is a step involving the theory of gel electrophoresis, which is little understood, and not directly addressed in this paper. The expertise in this domain can be modeled by rules of the form: Nucleic Acid x Gel —+ Migration Distance Different types of nucleic acid and gel parameters result in different migration behaviours, some of which are at best empirically known, many of which are not. Some very basic heuristics from this domain have informed our hy- pothesis generation algorithm, and as we pull more rules through the knowledge acquisition bottleneck, they will be used ina way discussed below. The most basic rule of thumb is one we have named “Mono- tonicity:” Rule of Monotonicity: If nucleic acid A is longer than nucleic acid B, it will migrate more slowly. This rule has many exceptions and ramifications which form much of the lore in this domain. 3. Usually the kinds of gels we are studying are described as a series of “bands” - discrete areas of co-migrating material which often but not always consist of homogeneous molecules. The passage from the gel to this more abstract description is accomplished by eye as an act of perception. Our current approach is to take the bands as a given and reason about them, but we believe that a gel can be scanned, and in most cases, bands isolated which correspond to what is perceived (in difficult cases, a band can be resolved by running a gel under different conditions). Thus, in this paper, steps 2 and 3 are collapsed into one step, S5—B In this context, we present a generator of hypotheses, where a hypothe- sis is defined to be an assignment of species to bands, that is a function S —— B. In addition, we present a scoring function which ranks hy- potheses in order of likelihood. What is a Gel? This section describes a structure for representing a gel experiment. A gel experiment is: {G, P} 1. G = Global gel parameters concentration of matrix (polyacrylamide, agarose, etc.) physical dimensions of gel applied voltage length of run goal (purify, analyze, separate, etc.) 2. P=A set of lanes, each with the following structure: lane = (experimental conditions, data) where experimental conditions is a vector of the ingredients that have been loaded into the lane, and data is a set of values representing the amount of material at distances d,,d2,d3,...,d, from the well. A hypothetical example is shown in Figure 1: in the diagram each col- umn is a lane, and the global parameters are at top. Also, in each data value we simply indicate the presence or absence of a band, rather than any quantitative amount. The main point of this example is as follows: often lanes of one experiment are compared. They form what Simon calls a “data-cluster”. We want to relate differences in the experimental conditions to differences in the data, in a manner similar to Simon’s BACON [6]. Many gel interpretations are based on a comparison of lanes in one gel, since differences in the gel material from gel to gel makes it difficult to compare one gel to another. Very often there is a marker lane, which contains a material whose migration characteristics are well known, and this lane is used to calibrate the parameters of the gel in order to interpret the other lanes. GEL #1 - effects of UV radiation PAGE 7% 15 cm x 20 cm V = 1000 t = 2 hrs. goal = compare Experimental Conditions: Mgt+, + + + - - - RNAp, [x] [x] [x] [x] [x] [x] pBS DNA, [y] [yl] fy] ly] [y] [yl UV Light, + + + + + + Time, 5 10 20 30 40 60 DATA: Figure 1: Hypothetical Gel Experiment A Little Language for Nucleic Acids Research As previously described, the passage from analyte to a set of molecular species is modelled by rules of the form: Reagent: Nucleic Acid —+ Products The reagent can be an enzyme, or possibly null. Our “enzymatic production system” consists of such rules, which models the common transformations applied to nucleic acids by biological and physical reagents. For example, we have included rules which describe the results of application of a restriction enzyme or a DNA polymerase. The antecedents of these rules match data structures which describe in- dividual species of DNA and RNA molecules. To facilitate the use of such tules, we describe an enhanced string language for representing nucleic acids. The language includes conventions for representing single stranded molecules, double stranded molecules, RNA, DNA, and RNA/DNA hybrids; for distin- guishing between the two strands of a double stranded molecule, and for keeping track of the 5’ to 3’ orientation of a sequence. Our formalism sup- ports operations representing the action of basic enzymes used in genetic engineering. We have implemented a parser in PROLOG for this syntax. - The “full” representation is a unambiguous representation of a double stranded DNA molecule, for instance: 5’ - gaattcaaa - 3? 3’ - cttaag... - 5? The dots are place holders, indicating the absence of nucleotides. It is implied that both strands are covalently bonded, and hydrogen bonded with each otlier. Our goal is to write down rules to describe nucleic acid experiments in a one-dimensional way which is easily understandable, and reflects the informal way we describe these situations at lab meetings, but is formal enough to allow automatic reasoning and the establishing of provable properties. First we give the conventions for representing molecules, and then we give some rules describing enzymatic reactions on nucleic acid molecules. 10 Convention 1: The left to right direction always represents 5’ to 3’. Convention 2: Lower-case characters refer to ssDNA. Convention 3: Upper-case characters refer to dsDNA Convention 4: Characters in quotes are literal nucleotide specifications: 7 AGC’, ?paag? Convention 5: Characters outside quotes are variables specifying strings of nucleotides. Convention 6: The complement operator is “*’, and refers to the biological complement of a sequence, i.e. the sequence of the complementary strand (if the molecule is double stranded). It can only refer to single stranded sequence. Examples illustrating the first five conventions: 7AGC’? <==> 5?-age-3? 3° -tcg-5? ?age? <==> 6? -age-3? “age? <==> 5?’-tcg-3? Convention 6: A caret “ * ” or underscore “ _” following any expression indicates that the lower-case characters in the expression are on the upper or lower strand, respectively, (e.g. b” , ’gaa’., "AAggg’” ). Some examples: *AAgeg’” <==> 6’-aaggg-3? 3°-tt -5? *AAgge?_ <==> 5?-aaccec-3’ 3’?-tt -5? 11 >gaa?_ ==> 5’-ctt-3? *gaa’_ = <==> ~? gaa?’* Convention 7: Nucleotides within a string are indexed by optional paren- theses following the string variable: ?AAGCTTG’ (4,7) <==> 5?-cttg-3? 3’-gaac-5’ Convention 8: (Convention 1 revisited) Sequences are written in a “canonical” 5’ to 3’ direction. Single stranded regions are written as the sequence they would be if paired on the “up- per” strand: > AAggeg’~ <==> 5’ -aageg-3? 3?-tt -5? 7aattC?_ <==> 5e- c-3? 3’-ttaag-5’ Convention 9: A DNA molecule is specified as one or more segments separated by commas within square brackets: [ R, ’GAATTC’, S ] Convention 10: An RNA molecule is specified as one or more segments separated by commas within curly brackets: { R, ’GAATTC’, S } Convention 11: A DNA/RNA hybrid molecule is specified as a post-fix notation on a segment within square or curly brackets indicating the composition of one of the strands (modifying the nucleic acid type specified: 12 { °GAATTC’:D } <==> 5?’-gaattc-3’ DNA 5’-cttaag-3’ RNA { °GAATTC’:d }) <==> 5?’-gaattc-3’ RNA & 5’-cttaag-3’ DNA [ ?GAATTC’:R J] <==> 5’-gaattc-3’ RNA & 5’-cttaag-3? DNA [ ?GAATTC?:r ] <==> S’-gaattc-3’ DNA & 5’~cttaag-3’ RNA A molecule which mixes DNA and RNA on the same backbone can be specified as above for hybrid molecules: DNA RNA [ X, ¥:R ] <==> 5? -xxxxxxyyyyyy-3? 3? -xxxxxxyyyyyy-5’ These eleven conventions allow the representation of a wide variety of nucleic acid molecules. Rules for Enzymatic Manipulation of Nucleic Acid Molecules Utilizing the above conventions, we can describe the actions of enzymatic agents on DNA and RNA. The general approach is to match the antecedent of a rule to a description of a set of nucleic acid molecules, binding the sequence and structural properties to variables in the left hand side of the rule. The rule then acts as a production, to create the description of a product set of molecules. Rules have the form: antecedent molecules —+ consequent molecules 13 Examples of rules are found in Figure 2. Hypotheses about the behaviour of processes on informational molecules in vitro are thus confirmed or denied by examining the creation or modification of nucleic acids. We note that there are a number of relevant biological processes which are not addressed yet, but should be handled by straightforward extension of the rule syntax we have described. These include nicking reactions and circular molecules. Other extensions will include more unusual conformational states of DNA and RNA, such as supercoiling, or the formation of triplex molecules and non-canonical hybrids, which are coming under increasing scrutiny from the molecular biology community. 14 EcoR1 endonuclease: [R, °GAATTC’, S] —+[R, ’Gaatt’_] + [’aattC’? , S$] DNA polymerase (progressive): { A, b_(1,n) ] —_ [ A, B(1), b(n,2) ] DNA polymerase (complete): [ A, b-] — [A,B] DNA Ligase (sticky ended molecules): [X,s-]+[s° ,Y]—[X,S,Y] RNA Polymerase (intermediate state): [X,P,¥]—[x,p’,y° ]+[x pp. ¥:R] RNA Polymerase (final state): [X,P, Y] — [X,P, Y]+{R} DNA Ligase (blunt ended molecules): DNA ligase: [X] + [Y] —> [X,Y] Reverse Transcriptase: {x} — {X:d} — [x] — [X] Exonuclease: [X(1,n)] —_ [X(1,n-1)] Annealing: an [x] + [ x] — [X] Figure 2: Rules for Enzymatic Manipulation of Nucleic Acids 15 An Hypothesis Generator for the Assignment of Molecular Species to Bands We consider experiments with the following structure: 1. Run a reaction involving nucleic acids, resulting in a set of molecular species S = $1, S2,..., Sk. bo Load the resulting material in a well in a gel, and electrophorese. 3. Visualize the material in the gel, by autoradiography, staining, or some other procedure. Schematically, we have BE: N—-S— >G—B- where £ is the experiment, N the analyte, S is the set of molecular species resulting, G is the gel, and B is a set of bands that are perceived. Given the experimental results G and the input data to the experiment E we want computer assisted hypothesis formation about the content of the gel. We attack a very simple case first, one in which we “know” S and B. That is, we assume a well-defined set of distinct molecules, of differing molecular weights, S. We also assume a well-defined set of distinct “bands” on a gel, B = by, be,...,6,. In this context, a “hypothesis” is an assignment of species to bands, that is, a function f: S — B. In this simplified setting, we can reason about the set of all hypotheses, and generate systematically a reasonably constrained subset of them. The first observation is that if all mappings f : S — B. are considered, the set of hypotheses is k”. As an illustration of how reasonable constraints can dramatically prune the search space, notice that if k = n, and we focus attention on one to one functions, the size of the resulting set is n!. (This corresponds to assuming that each species of molecule can only appear in one band, and to assuming that two species of molecules do not co-migrate. These assumptions do not always hold, but are not unreasonable.) 16 To further cut down the size of the search space, we impose the further constraint of monotonicity, that is, we assume that S is sorted by size, that B is sorted by migration distance from the well at the top of the lane, and that mappings from S to B are monotonic. In this case, with k = n, there is exactly one hypothesis that fits the data, the unique 1 to 1 function f : S > B that maps decreasing weights into faster migrating bands. We consider next the situation in which the number of bands and the number of molecular species differ. There are two cases to consider: 1. |S| < |B| — less species than bands; 2. [S| > |B| — more species than bands; In each case, we would like a generator of hypotheses, where each hy- pothesis satisfies the monotonicity requirement. (A situation which did not satisfy this condition is a good candidate for what is loosely termed “anoma- lous migration.”) Before we analyze the general case, an example of case 1) and case 2) should clarify the discussion. Example: 5 bands, 3 species: There are 5 mappings. Note that if the species are assumed to occupy 3 bands 1, 3, and 5, then B2 can be hypothesized to be material from either B1 or B3; and B4 can be hypothesized to be material from either B3 or B5. EXAMPLE: 3 bands, 5 species: 17 S10 wenne-- =e e+ +--+ $2. ------- $3 0 -enn--- =~ SB wnennn- wee There are ; mappings. Note that if the bands are assumed to be occu- pied by species 1, 3, and 5, then S2 can be hypothesized to co-migrate with either S1 or $3; and S4 can be hypothesized to co-migrate with either $3 or 55. In case 1), there are ( * | ~ n* one to one functions from § to B, the k number of ways of choosing which k bands are hit by elements of S. After the target bands are chosen, one must still account for the remaining bands. For each such “remainder” there are two possibilities, within the constraints of monotonicity: either it is material from the band above it, or it is material from the band below it. If there is no band above it, then we assume it is from the band below, and if there is no band below it we assume it is from the band above it. If the remaining bands are all interior (not the top band or the bottom band), the number of hypotheses is: (3 )#(—me2 This formula can be easily adjusted for a case in which a remainder band is at an extreme position on the gel. In case 2), there are ( . functions which map onto B, this being the number of ways of choosing n species that have been collapsed into neighbor- ing bands. A further complication is the question as to which neighboring bands they have collapsed to. This is a question of which bands have co- migrated with which (again, within the constraints of monotonicity). This situation is entirely analogous to the above, and the formula is the same. 18 To generate all the hypotheses associating molecular species with bands within the above framework, we can first generate a mapping, and then for each mapping, generate the 2 * (n — k) assignment of missing bands, or missing species. So the first question at hand is: find an algorithm to systematically generate all subsets of k elements in an n element set. We present an algorithm in the next section. Algorithm for the Generation of Ali Subsets of Size k in a Set of Size n Recall the recursion relation: n n—1 n—-1 (h)=(22r)#("2") We use this to generate all n bit numbers with exactly k bits equal to 1. Once we have done this, it is clear how to associate this with subsets of an n element set. | Let T = {n bit. numbers with exactly & bits turned on} The observation used is simply that this set consists of two subsets, odd numbers whose last bit is 1; and even numbers, whose last bit is 0. The first set has k — 1 of its first n — 1 bits turned on, the second set has k of its first yy turned on. Thus if we define O and E by: ' @ O ={n—1 bit numbers with exactly k — 1 bits turned on } e £ ={n—-1 bit numbers with exactly k bits turned on} Then if T ={n bit numbers with exactly k bits turned on} T={2*O0+1}U {2% E} Now we must specify the base of the recursion, which we do as follows: 19 For ( np we return the number whose binary representation is n 1’s. For ( ” ), we return the n bit number all of whose bits are off, i.e., zero. 0 This algorithm has been coded in LISP, and can be used to generate all constrained hypotheses in the above described context of gel experiments. It is reasonable to discuss at this point the various heuristics that can be used to rank these hypotheses. For instance, middle bands are often given more weight than bands at either end of the gel. Also, more intense bands are given often given more weight. However, it is interesting that impor- tant discoveries have been made by focussing on faint bands - for example, ribozymes, and the reverse transcriptase activity of Taq polymerase. 20 Ranking of Hypotheses One of the very interesting aspects of this project is a chance to study multiple levels of interacting hypotheses. A typical gel discussion might have the following “hypothesis structure”: At a top level, there is a hypothesis about the migration of nucleic acids, for example: e Hypothesis 1: IfI plot the migration of nucleic acids of known molecular weights on semi-log paper (weights against distance), the curve can be fitted with a cubic polynomial. Given migration distances for the unknown material, I can then use this curve to estimate its molecular weight. Remark: This hypothesis is open to question, because there is always the possibility of anomalous migration, due to some condition that has not yet been documented. Given this working hypothesis, working hypotheses about the existence of species loaded into the wells are formed: e Hypothesis 2: Species s1,$2, and s3 have resulted from the experiment performed and are present in the gel. e Hypothesis 3: The above species have molecular weights of wi, w2, and wg respectively. Remark: The second two hypotheses are also often rethought during the course of a discussion. Finally, in the context of the above hypotheses, hypotheses about the as- sociation of species with bands may be formed, which is the level of discussion addressed in the previous section. However, the existence of these multiple levels of hypotheses, the way they interact in practice, and the way they are modified and adjusted in the course of a typical discussion among experts, is, ultimately, the complex knowledge structure we hope to formalize. In this section we discuss only the last mentioned level of hypothesis formation, and present one possible measure of the “likelihood” of such an hypothesis. 21 For an hypothesis which takes the form of a list of pairs: (Weight;, Distance;),i =1,...,k with descending weights and ascending distances, we define a vector consist- ing of ratios of successive differences as follows: i= log(wi41) — log(wi) =1k-1 dizi — d; Then define the ”variation” of a hypothesis as the maximum distance between these ratios: mar 1,9| Ri— R; | In the absence of anomalous migration, a mapping from weights to bands is therefore more likely if it has less variation, i.e. the best hypothesis is gotten by choosing the vector with the least variation. An example should clarify this proposed rule for ranking hypotheses. Example: Given DNA fragments of 10, 20 and 30 base pairs, and a gel with bands at 2cm, 4cm, 4.1cm, and 6cm from the origin, there are 4 \ = 4 3 hypotheses about which bands contain which species: | 1) (30 bp, 2 cm) (20 bp, 4 cm) (10 bp, 6 cm) R = (5,5) max = 0 2) (30 bp, 2 cm) (20 bp, 4.1 cm) (10 bp, 6 cm) 22 R = (10/2.1,10/1.9) = (4.76,5.26) max = .50 3) (30 bp, 4 cm) (20 bp, 4.1 cm) (10 bp, 6 cm) R (10/.1,10/1.9) = (100, 5.26) max = 94.74 " 4) (30 bp, 2 cm) (20 bp, 4 cm) (10 bp, 4.1 cm) w u (10/2,10/.1) = (5, 100) max = 95 Thus, the ranking in this case is: 1. is the most likely 2. is the next most likely 3. is the next most likely 4. is the least likely. Of course, in cases for which there is no good guess as to the sizes of the fragments, this rule is not applicable. This area is complex, and is a focus of our current research. We anticipate finding different methods for ranking hypotheses, depending on the granu- larity of the data, the confidence factors associated with various data, and the particular goal of the experiment at hand. 23 The above discussion was based on knowing S and B. This is often not a fully realistic assumption in the world of gel electrophoresis. In reality, the process of going from an experiment to a set of molecular species is fraught with unknowns; and this aspect of modeling is addressed in our “enzymatic production system”. Summary The process of thinking about gels as we have observed it, exhibits the fol- lowing pattern: 1. Look at the gel, G, and discern its significant features: its bands, B, their intensity, thickness, and number, areas of smear, and any anoma- lies. 2. Consider the experiment, and hypothesize a set of species that are expected to appear. 3. Generate hypotheses about the association between molecular species and bands - and rank them according to “expert” knowledge. 4. Often, rethink the expected species, generating new hypotheses in the light of discussion; and rethink the description of the bands in the gel. 5. Finally, most gel discussions end with a suggestion for what experiment or experiments would be valuable to perform next, in order to resolve remaining ambiguities. Once a set of species, S and a set of bands, B, are postulated, hypotheses about their possible associations h : S —> B are enumerated with a simple generator, and ranked according to user imposed heuristics and criteria. 24 References [1] Gel electrophoresis of nucleic acids: a practical approach. Oxford: IRL Press, 1982. [2] Anthony T Andrews. Electrophoresis: theory, techniques, and biochemical and clinical applications. Oxford: Clarendon Press, 1981. [3] Bruce G. Buchanan, G. L. Sutherland, and Edward A. Feigenbaum. Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry, volume 4, pages 209-254. Edinburgh University Press, 1969. [4] Andreas Chrambach. The practice of quantitative gel electrophoresis. Deerfield Beach FL: VCH Publishers, 1985. [5] David Michael Freifelder. Physical biochemistry: applications to biochem- istry and molecular biology. San Francisco: W. H. Freeman, 1976. [6] Pat Langley, Herbert A. Simon, Gary L. Bradshaw, and Jan M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Pro- cesses. MIT Press, Cambridge, Massachussetts. . P.F. Lemkin and L.E. Lipkin. GELLAB: A computer system for 2d gel electrophoresis analysis I,II. Computers and Biomedical Research, 14:355- , 380. [8] Robert Lindsay, Bruce G. Buchanan, Edward A. Feigenbaum, and Joshua Lederberg. Applications of artificial intelligence for organic chemistry: the DENDRAL Project. New York: McGraw-Hill, 1980. [9] Mark J. Miller. Computer analysis of two-dimensional gels: Semi- automatic matching. Clinical Chemistry, 28(4):867-875, 1982. 25