Section 9.1.5 MYCIN Project constraints. We are just beginning work, in conjunction with IBM Scientific Labs, to develop an EMYCIN consultation package for electronic fault diagnosis. GUIDON A plan for further development of GUIDON is described in terms of a partial ordering of research problems. Improving the student model will receive priority. interruption/assistance/evaluation teaching strategies / \ / \ / \ dialogue planning \ | \ I \ | \ case selection \ | \ \ | \n rer nse tr ccceseH student model case differences/ genetic epistemology Implementation of the strategical methods is now proceeding. There are several tasks (corresponding to the managerial and operational considerations) organized hierarchically. These tasks will be expressed in rule form (if then ). Structural knowledge will serve to hook these domain independent Strategical rules into a particular rule set like MYCIN's. This will involve adding a taxonomic problem classification to the knowledge base and regrouping rules and parameters according to this classification, Besides using the strategical model for guiding a dialogue with a Student, we are investigating the possibility of reconfiguring MYCIN's rule set so that the strategy rules direct a consultation. The result will be a knowledge base of rules and parameters, just like MYCIN's, that does hypothesis formation with focusing by the same backward chaining interpreter we have always used. Even without this Step, by formalizing (on paper) a strategical model in terms of production rules, we are led to conclude that it is the exhaustive, depth-first character of MYCIN's search that is different from hypothesis formation, not backward chaining. The Strategical rules are meta-rules that modify MYCIN's search. Subgoaling by backward chaining of rules is compatible with both depth-first search and hypothesis formation. Missing knowledge aside, we find that many of MYCIN’s rules are too detailed to be learned by people. We find that people just don't think about the fine-line, statistically-based distinctions that MYCIN rules record. We have developed a way to encode what an expert actually knows by Privileged Communication 201 E. A. Feigenbaum MYCIN Project Section 9.1.5. overlaying qualifications on top of MYCIN's rules. This takes the form of a functional statement (e.g., csf-protein is proportional to intensity and duration of iltness) and ranges of discrimination ( <100 means viral: >250 means chronic or bacterial; otherwise "it could be anything"). These Summary statements capture what the student should learn; they will be used in quizzes based on the rules, as well as for selecting cases. In a related development, we are trying to record aphorisms and mnemonics that experts use for remembering strategical and mechanistic principles, e.g., "when you hear hoof beats think of horses, not zebras" and "csf glucose is low for bacterial meningitis because bacteria eat the glucose for food" (this is wrong, but physicians remember it and generally don't realize or care that it is wrong!). We find that causal knowledge in our domain serves as a cue for remembering associations; actual diagnosis generally occurs at a level higher than causal mechanism. ONCOCIN In the three months remaining in the current year, we expect to have completed the PASCAL interface program that will respond to the special keypad on the Datamedia terminal. We also intend to codify the rules for one more chemotherapy protocol (probably oat cell carcinoma of the lung) in order to verify the generality and flexibility of the representation scheme we have devised. In the coming year, our plans include the following: (1) To develop the software protocols for achieving communication between the PASCAL interface program and the INTERLISP reasoning program. (2) To coordinate the printing routines needed to produce hardcopy flowsheets, patient summaries, and encounter sheets. (3) To install the new terminal and hard copy device in the Oncology Day Care Center for final testing and debugging. (4) To begin offering the ONCOCIN system for use by oncology faculty and fellows in the chemotherapy clinics (three mornings per week) in which most of the lymphoma patients receive their treatment. (5) To codify and implement additional protocols contingent upon adequate progress with the steps outline above. Throughout this work we shall continue to relate the requirements of the system we are devetoping to the underlying artificial intelligence methodologies. We are convinced that the basic science frontiers of AI are best explored in the.context of systems for real world use; thus ONCOCIN Serves as a vehicle for developing an improved understanding of the issues that underlie other forms of knowledge engineering. E. A. Feigenbaum 202 Privileged Communication Section 9.1.5 MYCIN Project B. Requirements for Continued SUMEX Use All the work we are doing (EMYCIN, GUIDON, ONCOCIN, pilus continued use of the original MYCIN program) is totally dependent on continued use of the SUMEX resource. The programs all make assumptions regarding the computing environment in which they operate, and the ONCOCIN design in particular depends upon proximity to the 20/20 which will enable us to use a 9660 baud interface. Most of us use SUMEX as the only comsuter on which we work. In addition, we have long appreciated the benefits of GUEST and network access to the programs we are developing. SUMEX greatly enhances our ability to obtain feedback from interested physicians and computer scientists around the country. Network access has also permitted high quality formal demonstrations of our work both from around the United States and from sites abroad (e.g., Japan, Sweden, Great Britain). C. Requirements for Additional Computing Resources The recent acquisition of the 20/20 by SUMEX has been crucial to the growth of our research work, both to insure high quality demonstrations and to enable us to develop a system such as ONCOCIN for real-world use in a clinical setting. As we continue to develop systems that are potentially. useful as stand-alone packages (e.g., an exportable EMYCIN), additional small computers would be particularly valuable resources. It is not yet clear which machines are optimal for the LISP-based applications we are developing, and an opportunity to test our systems on several small-to- medium machines would be invaluable and in keeping with our desire to move some of the AIM products into a community of service users. As we have mentioned, the response time on the main machine continues to be a major problem during the daytime hours, and is beginning to be limiting on occasion in the evenings as well. Any acquisitions that would provide additional cycles or permit off-loading of some users from the PDP- 10 would significantly benefit the SUMEX research community. The continued growth of our research project, with MYCIN space still required, GUIDON growing, and ONCOCIN now a new and large system, has resulted in some moderate problems with disk allocation as well. We have managed to shuffle allocations reasonably effectively until now, but there is no longer much flexibility and an additional allocation of approximately 2500 pages would greatly relieve the pressure. D. Recommendations for Future Community and Resource Development We have two principal recommendations for new SUMEX developments. First, the acquisition of several small machines, linked to the main processor through the ethernet, and each able to run INTERLISP, would allow important experiments in bringing the more mature AIM systems closer to being exportable for use outside of strict research environments, Privileged Communication 203 E. A. Feigenbaum MYCIN Project Section 9.1.5 Second, we propose the formal establishment of a mechanism for providing hardware and communications equipment for SUMEX demonstrations at a distance. There are beginning to be enough invitations for the older AIM Systems to be shown at meetings and to funding agencies, that a dedicated system of demonstration equipment and personnel seems appropriate at this time. E. A. Feigenbaum 204 Privileged Communication Section 9.1.6 Protein Structure Project 9.1.6 Protein Structure Project Protein Structure Modeling Project Prof. E. Feigenbaum and Mr. Allan J. Terry Department of Computer Science Stanford University I. Summary of Research Program A. Technical goals The goals of the protein structure modeling project are to 1) identify critical tasks in protein structure elucidation which may benefit by the application of AI problem-solving techniques, and 2) design and implement programs to perform those tasks. We have identified two principal areas which are of practical and theoretical interest to both protein crystallographers and computer scientists working in AI. The first is the problem of interpreting a three-dimensional electron density map. The second is the problem of determining a plausible structure in the absence of phase information normally inferred from experimental isomorphous replacement data. Current emphasis is on the implementation of a program for interpreting electron density maps (EDM's). B. Medical relevance and collaboration The biomedical relevance of protein crystallography has been wel] stated in an excellent textbook on the subject (Blundell & Johnson, Protein Crystallography, Academic Press, 1976): "Protein Crystallography is the application of the techniques of X-ray diffraction ... to crystals of one of the most important classes of biological molecules, the proteins. ... It is known that the diverse biological functions of these complex molecules are determined by and are dependent upon their three-dimensional structure and upon the ability of these structures to respond to other molecules by changes in shape. At the present time X-ray analysis of protein crystals forms the only method by which detailed structural information (in terms of the spatial coordinates of the atoms) may be obtained. The results of these analyses have provided firm structural evidence which, together with biochemical and chemical studies, immediately suggests proposals concerning the molecular basis of biological activity.” The project involves a collaboration between computer scientists-at Stanford University and crystallographers at Oak Ridge National Privileged Communication 205 E. A. Feigenbaum Protein Structure Project Section 9.1.6 Laboratories (Dr. Carrol] Johnson), the University of California at San Francisco (Dr. Robert Langridge), and the University of California at San Diego (under the direction of Prof. Joseph Kraut). Our principal collaborator at UCSD is Dr. Stephan Freer. C. Progress summary We have completed a major cycle of design review and program reorganization, resulting in the system described in publication number three below. The system now has a completely rule-based control structure proceeding from strategy rules, to a set of task rules, ending with individual knowledge sources. This new design seems powerful and flexible enough to provide the basis of a useful EDM interpretation system for protein structure determination. After building the control structure we wanted, we have worked on building up the knowledge base. Large chunks of knowledge are called "tasks"; we have completed the Initialization task, implemented a tracing task, and implemented a task to split group toeholds. Further details of these tasks and their content can be found in publication number three. We have also continued our efforts to improve the power of our data representations. Towards this end we have implemented a new preprocessor. to assign functional labels to segments. This program consists of heuristics that attempt to capture the knowledge a human uses when he visually examines a skeletonized EDM. We find the use of labeled segments greatly aids the main CRYSALIS program by allowing rules to be written in terms much closer to those which humans use rather than the language in which the EDM skeleton is defined, Finally, we are compiling documentation on the system and the knowledge it embodies. These documents should be sufficiently complete so that we, or other groups, will have little difficulty picking up where we leave off. We also feel that explicit documentation of our model-building heuristics will be useful to the crystallographic community as it provides a new viewpoint, complementary to traditional crystallographic methods. The work currently in progress can be characterized as additions to the knowledge base and work on new data representations. Whereas the previously-implemented tracing task attempts to grow an "island of certainty” in the hypothesis in a non-directed manner, we are now working on a task that specifically tries to link two such islands. In addition to this new task, we are augmenting the system's tracing knowledge to deal with small sidechains that seldom appear in the data. The final addition to the knowledge base is an effort to incorporate some notion of Stereochemistry and the constraints on three dimensional structure it provides. This will be useful in the matching of features and in the prediction of secondary structure. The last item of work in progress is an attempt to design a data representation that captures volume information. Current representations such as the skeleton preserve topology but do not preserve shape. With the inclusion of volume information, we should be able to capture much of the expert's knowledge of shape and form that presently goes unused. E. A. Feigenbaum 206 Privileged Communication Section 9.1.6 Protein Structure Project D. List of Publications ‘1) Robert S. Engelmore and H. Penny Nii, "A Knowledge-Based System for the Interpretation of Protein X-Ray Crystallographic Data," Heuristic Programming Project Memo HPP-77-2, January, 1977. (Alternate identification: STAN-CS-77-589) 2) E.A. Feigenbaum, R.S. Engelmore, C.K. Johnson, "A Correlation Between Crystallographic Computing and Artificial Intelligence," in Acta Crystallographica, A33:13, (1977). (Alternate identification: HPP-77- 15) 3) Robert Engelmore and Allan Terry, “Structure and Function of the CRYSALIS System", Proc. GIJCAI, 1979. pp250-256 (Alternative identification: HPP-79-16) 4) R. S. Engelmore, A. Terry, S. T. Freer, and C. K. Johnson, "A Knowledge- Based System for Interpreting Protein Electron Density Maps", Abstracts of Amer. Crystallographic Ass. 7,1 (1979) p38 E. Funding status Grant title: The Automation of Scientific Inference: Heuristic Computing Applied to Protein Crystallography Principal Investigator: Prof. Edward A. Feigenbaum Funding Agency: National Science Foundation Grant identification number: MCS 79-33666 Term of award: December 1, 1979 through November 31, 1981 Amount of award: $35,318 (direct costs only) II. Interaction with the SUMEX-AIM resource A. Collaborations The protein structure modeling project has been a collaborative effort since its inception, involving co-workers at Stanford and UCSD (and, more recently, at Oak Ridge and UCSF). The SUMEX facility has provided a focus for the communication of knowledge, programs and data. Without the special facilities provided by SUMEX the research would be seriously impeded. Computer networking has been especially effective in facilitating the transfer of information. For example, the more traditional computational analyses of the UCSD crystallographic data are made at the CDC 7600 facility at Berkeley. As the processed data, specifically the EDM's and their Fourier transforms, become available, they are transferred to SUMEX via the FIP facility of the ARPA net, with a minimum of fuss. (Unfortunately, other methods of data transfer are often necessary as well Privileged Communication 207 E. A. Feigenbaum Protein Structure Project . Section 9.1.6 -- see below.) Programs developed at SUMEX, or transferred to SUMEX from other laboratories, are shared directly among the collaborators. Indeed, with some of the programs which have originated at UCSD and elsewhere, our off-campus collaborators frequently find it easier to use the SUMEX versions because of the interactive computing environment and ease of access. Advice, progress reports, new ideas, general information, etc. are communicated via the message and/or bulletin board facilities. B. Interaction with other SUMEX-AIM projects Our interactions with other SUMEX-AIM projects have been mostly in the form of personal contacts. We have strong ties to the MYCIN, AGE and MOLGEN projects and keep abreast of research in those areas on a regular basis through informal discussions. The SUMEX~AIM workshops provide an excellent opportunity to survey all the projects in the community. Common research themes, e.g. knowledge-based systems, as well] as alternate problem-solving methodologies were particularly valuable to share. C. Critique of Resource Services The SUMEX facility provides a wide spectrum of computing services which are genuinely useful to our project -- message handling, file management, Interlisp, Fortran and text editors come immediately to mind. Moreover, the staff, particularly the operators, are to be commended for their willingness to help solve special problems (e.g., reading tapes) or providing extra service (e.g. immediate retrieval of an archived file). We would also like to commend the staff for its extensive help in setting up a Jink between SUMEX and Dr. Langridge's group at UCSF. Such cooperative behavior is rare in computer centers. There are several facilities we wish to single out as particularly useful in furthering our research goals. Since the members of the project are physically distant, the MSG program is very useful. Similarly, the file system, the ARCHIVE facility, and the general ease of getting backup files from the operator greatly aid our efforts at coordinating the efforts of collaborators using many large data sets and programs. The crystallographers in the project find SUMEX to be a friendly environment which allows them to do their work with a minimum of dealing with operating system details. It has become increasingly evident, however, that as CRYSALIS expands, the facility cannot provide enough machine cycles during prime time to support the implementation and debugging of new features. For example, our segqment-labeling preprocessor requires about an hour of machine time per 100 residues of protein (this is typically five to eight hours of terminal time during working hours) even when the Lisp code is compiled. E. A. Feigenbaum 208 Privileged Communication Section 9.1.6 Protein Structure Project III. Use of SUMEX during the remaining grant period (8/79 - 7/81) A. Long-range goals Our short term goals are to build up the knowledge base to the point where it can solve a small, known protein from “live” data. This will probably entail the implementation of about a dozen tasks. By this point we should also have a package of data-reduction programs Suitable for export to interested crystallographers. Our Jong range goais are the exploitation of the rule-based control Structure for investigating alternative problem-solving strategies, the investigation of modes of explanation of the program's reasoning steps, and the expansion and generalization of the system to cover a wider range of input data. B. Justification for continued use of SUMEX We feel that SUMEX is the ideal vehicle for further research on CRYSALIS. While some of our work is numerical in nature and uses such facilities as FORTRAN, our main interest is in artificial intelligence. Besides being an expert system of use to the crystallographic community, CRYSALIS is an exploration of the general signal processing problem. We are vitally concerned with issues such as proper architecture for using a wide variety of heuristics effectively and hypothesis formation when both data and model are poor. The utility of our work to the AI community is partially demonstrated by the development of the AGE project, an extension of Ms. Nii's early work on CRYSALIS. This project progresses by the collaboration of several physically- Separated groups. SUMEX provides a unique resource, an electronic community of researchers in our field, through the many systems such as net mail, country-wide access, and community workshops. We feel that CRYSALIS would not be possible outside of such a community. C. Needs and plans for other computing resources Our major need for other computing resources is for graphical display of our data and results. This need will be met by use of Dr. Langridge's Evans and Sutherland Picture System at UCSF and Dr. Johnson's raster-based graphics system at ORNL. The major impediment is SUMEX’s current inability to support data transfer to other machines at more than 1200 baud. We are attempting to link SUMEX to UCSF by using FTP over the ARPAnet to the LBL machine and then use an existing link from LBL to UCSF. D. Recommendations for future community and resource development There are two recommendations we wish to make, the first and most important is to expand the computing power available to SUMEX users. CRYSALIS is an inherently-large problem. Proteins contain hundreds, to thousands of atoms which means large hypothesis structures, large quantities of data, and a compute-bound inference program. As the system grows to maturity, we expect increasingly serious problems with address space limitations and with machine cycle availability. Privileged Communication 209 E. A. Feigenbaum Protein Structure Project Section 9.1.6 The second recommendation is that SUMEX develop some relatively inexpensive file transfer facility for machines not on the ARPAnet. Software for this already exists in the form of the TTYFTP program (or possible future programs like it, but in a more portable language), the development needed is in hardware and in the TENEX operating system so that transfer rates greater than 1200 baud can be achieved. We are motivated to recommend this not only by our own need for such a facility, but also by the belief that it would aid other collaborations involving SUMEX and outside computers (the SECS project for example), and aid in the dissemination of useful programs from the research setting of SUMEX to user laboratories. E. A. Feigenbaum 210 Privileged Communication Section 9.1.7 RX Project 9.1.7 RX Project The RX Project: Deriving Medical Knowledge from Time-Oriented Clinical Databases Robert L. Blum, M.D. Division of Clinical Pharmacology Department of Internal Medicine Stanford School of Medicine Gio C. M. Wiederhold, Ph.D. Departments of Computer Science and Electrical Engineering Stanford University I. Summary of Research Program I.A. Technical goals: Introduction: Medical and Computer Science Goals The objective of the RX Project is to develop a medical information System capable of accurately deriving knowledge of the course and consequences of treatment of chronic diseases from a large collection of stored patient records. Computerized clinical databases and automated medical records systems have been under development throughout the world for at least a decade. Among the earliest of these endeavors was the ARAMIS Project, (American Rheumatism Association Medical Information System) under development at Stanford by Dr, James Fries and his colleagues since 1967. A prototype ambulatory records system was generalized in the early 1970's by Prof. Gio Wiederhold and Stephen Weyl in the form of a Time-Oriented Database (TOD) System. The TOD System, run on the IBM 370/3033 at the Stanford Center for Information Processing (SCIP), now supports the ARAMIS Project as well as a host of other chronic disease databases which store patient data gathered at many institutions nation-wide. At the present time ARAMIS contains records of over 10,000 patients with a variety of rheumatologic diagnoses. Over 30,000 patient visits have been recorded, accounting for 20,000 patient-years of observation. The fundamental objective of ARAMIS, the other TOD research groups, and all other clinical data bank researchers is to use the raw data which has been gathered by clinical observation in order to Study the evolution and medical management of chronic diseases. Unfortunately, the process of reliably deriving knowledge from raw data has proven to be refractory to existing techniques because of problems stemming from the complexity of disease, therapy, and outcome definitions; the complexity of time relationships; complex causal relationships creating strong sources of bias; and problems of missing and outlying data. Privileged Communication 211 E. A. Feigenbaum RX Project Section 9.1.7. A major objective of the RX Project is to explore the utility of symbolic computational methods and knowledge-based techniques at solving this problem of accurate knowledge inference from non-randomized, non- protocol patient records. A central component of RX is a knowledge base of medicine and statistics, organized as a hierarchy or taxonomic tree consisting of nodes with attached data and procedures. Nodes representing diseases and therapeutic regimens contain procedures which use a variety of time-dependent predicates to label patient records in the database, facilitating the retrieval of time-intervals of interest in the records. The database is then inverted so that each node or object in the knowledge base contains pointers to all time-intervals during which its definition is satisfied. Nodes in the knowledge base also contain lists of other nodes which are causally related. These functional dependencies are used to infer causal pathways among nodes for purposes of selecting confounding variables which need to be controlled for in the study of a specific hypothesis. Causal pathways may also be used in an exploratory mode to discover new hypotheses, To study a particular causal hypothesis the knowledge base also contains information on the applicability of various statistical procedures and procedures for applying them. I.B. Medical Relevance and Collaboration As a test bed for system development our focus of attention has been on the records of patients with systemic lupus erythematosus (SLE) contained in the Stanford portion of the ARAMIS Data Bank. SLE is a chronic rheumatologic disease with a broad spectrum of manifestations which can lead to death in the third decade of life. With many perplexing diagnostic and therapeutic dilemmas, it is a disease of considerable medical interest, In the future we anticipate possible collaborations with other project users of the TOD System such as the National Stroke Data Bank, the Northern California Oncology Group, and the Stanford Divisions of Oncology and of Radiation Therapy. The RX Project is a new research effort only in existence for about a year, and, hence the project is very much in a developmental stage. The primary issues being addressed at this stage are those concerned with the specifics of knowledge representation and flow of control, rather than with the testing of specific hypotheses in chronic disease management. We believe that this research project is broadly applicable to the entire gamut of chronic diseases which constitute the bulk of morbidity and mortality in the United States. Consider five major diagnostic categories which are responsible for approximately two thirds of the two million deaths per year in the United States: myocardial infarction, stroke, cancer, hypertension, and diabetes. Therapy for each of these diagnoses is fraught with controversy concerning the balance of benefits versus costs. £. A. Feigenbaum 212 Priviteged Communication Section 9.1.7 RX Project 1) Myocardial Infarction: Indications for and efficacy of coronary artery bypass graft vs. medical management alone. Indications for long-term antiarrhythmics ... long-term anticoagulants. Benefits of cholesterol-lowering diets, exercise, etc. 2) Stroke: Efficacy of long-term anti-platelet agents, long-term anticoagulation. Indications for revascularization. 3) Cancer: Relative efficacy of radiation therapy, chemotherapy, surgical excision - singly or in combination. Optimal frequency of screening procedures. Prophylactic therapy. 4) Hypertension: Indications for therapy. Efficacy versus adverse effects of chronic antihypertensive drugs. Role of various diagnostic tests such as renal arteriography in work-up. 5) Diabetes: Influence of insulin administration on microvascular complications. Role of oral hypoglycemics. Despite the expenditure of billions of dollars over recent years for randomized controlled trials (RCT's) designed to answer these and other questions, answers have been slow in coming. RCT's are expensive of funds and personnel. The therapeutic questions in clinical medicine are too numerous for each to be addressed by its own series of RCT's. On the other hand, the data regularly gathered in patient records in the course of the normal performance of health care delivery is a rich and largely underutilized resource. The ease of accessibility and manipulation of these data afforded by computerized clinical data banks holds out the possibility of a major new resource for acquiring knowledge on the evolution and therapy of chronic diseases. The goal of the research which we are pursuing on SUMEX is to increase the reliability of knowledge derived from clinical data banks with the hope of providing a new tool for augmenting knowledge of diseases and therapies as a supplement to knowledge derived from formal prospective clinical trials. Furthermore, the incorporation of knowledge from both clinical data banks and other sources into a uniform knowledge base should increase the ease of access by individual clinicians to this knowledge and thereby facilitate both the practice of medicine as well as the investigation of human disease processes. Highlights of Research Progress 1 July 1979 to 1 April 1980 Our predominant objective was to detail the overall conceptual framework for the knowledge base and to develop the extensive computational machinery necessary for retrieving, analyzing, and displaying defined time- intervals within patient records. Privileged Communication 213 E. A. Feigenbaum RX Project Section 9.1.7 The RX Knowledge Base (KB): The central component of RX is a knowledge base of medicine and Statistics, organized as a frame-based, taxonomic tree consisting of units with attached data and procedures, Units representing diseases and therapies contain procedures which use a variety of time-dependent predicates to label the patient records, facilitating the retrieval of time~intervals of interest in the records. Other units representing Statistical techniques are used to map hypotheses onto study designs and event dafinitions. Implementing the algorithms and data structures of this AG was Gane of the major tasks of the current year. At the current time the RX KB contains about 200 units of which 75 contain definitions and other relevant information pertaining to disease courses, effects of drugs, lab values, etc. This information compromises a small subset of medical knowledge dealing with some of the signs and symptoms of systemic lupus erythematosus (SLE) as well as the effects and indications of some drugs used for this disease. Other units contain machine-readable knowledge of statistical techniques needed for testing entered hypotheses. There are approximately 40 time-dependent functions used to map from the database values onto defined units. The entire RX system currently contains approximately 250 INTERLISP functions accounting for 75 disk pages of code. The KB is about 30 disk pages. One disk page = 512 words * 36 bits per word. Also one disk page = approx, 1.5 typed pages on 8.5 by 11.5 inch paper. Statistical Interfaces: Once the relevant episodes have been defined and retrieved from the database they must be analyzed statistically. In order to do this we use the SPSS package (Statistical Package for the Social Sciences) available on SUMEX. A collection of RX programs create SPSS "source decks" containing card images of the appropriate commands along with the extracted data. RX then calls the operating system and runs SPSS on the source file, The human-readable listing is then searched for important results which are automatically extracted and interpreted. Time-Oriented Graphics Package: This package enables data on an individual patient to be graphed over time, either linearly by visit or by calendar time with a "telescoping" capability. The program overlays graphs of both point data and data represented as episodes. Study Editor: Dr. Jerrold Kaplan, a research associate affiliated with the project, has implemented an additional package of programs which display to the clinician user those decisions which have been made by the knowledge base concerning which statistical techniques are to be employed, which variables are to be controlled for, and which time intervals are to be excluded. This affords the user with a means for seeing a sketch of the study plan before it is executed, and enables him to modify that plan. E. A. Feigenbaum 214 Privileged Communication Section 9.1.7 RX Project Clinical Study: The Effect of Prednisone on Cholesterol As a testbed for the prototype system we have been investigating the hypothesis that the steroid, prednisone, produces a significant elevation of plasma cholesterol. To test this hypothesis, the records of 50 patients with systemic lupus erythematosus (SLE) were transferred from the ARAMIS Database to SUMEX. Of these patients, 18 were found to have five or more cholesterol determinations and to have had sufficient variance in their prednisone regimens to be testable. The KB is used to elaborate a complex causal model for the prednisone/cholesterol hypothesis which is tested using a hierarchical multiple regression method with time-lagged values. The KB is used to determine sources of possible bias and to control for those variables in the regression or to eliminate corresponding time- intervals from records. An empirical Bayes method is used to average the estimated effects in patients with varying amounts of data. The result, a highly statistically significant elevation of cholesterol by prednisone, will be submitted for publication during the coming year. Research In Progress Much work remains to be done in expanding the system software and in expanding the knowledge base. Current work is addressed to increasing the flexibility of the time-segmentation functions and enriching the data Structures which encode relationships among objects. We are trying to make increasingly general the class of medical hypotheses which the system can analyze automatically. This requires incorporating knowledge of additional statistical methods into the KB and the development of expanded capabilities for interfacing RX to on-line Statistical packages. We are also attempting to generalize our algorithms for selecting the set variables which may potentially confound a given hypothesis. As a means for testing and expanding the system's capabilities we intend to perform several specific studies of importance in the management of the rheumatic diseases. Our study of the effect of prednisone on cholesterol was mentioned above. Other studies now being planned include the effect of chronic aspirin ingestion on liver function in rheumatoid arthritis, the specific incidence of infectious complications of steroids as a function of dose and duration, and the utility of various autoantibodies in the prediction of flares of SLE as compared to the utility of other indicators. Finally, we are developing a methodology for discovering hypotheses of interest in the database using a heuristically guided search of large matrices of simple and partial correlation coefficients. Publications Blum, Robert L.; Wiederhold, Gio: Inferring Knowledge from Clinical Data Banks Utilizing Techniques from Artificial Intelligence. Proc. of The 2nd Annual Symp. on Computer Applications in Medical Care, pp. 303 to 307, IEEE, Washington, D.C., November 5-9, 1978 Privileged Communication 215 E. A. Feigenbaum RX Project Section 9.1.7 Blum, Robert L.: Automating the Study of Clinical Hypotheses on a Time- Oriented Data Base: The RX Project. Submitted for publication to MEDINFO80, Tokyo, Japan, Oct. 1980 Wiederhold, Gio: Databases in Healthcare. To be published in a compendium series on Technology in Healthcare, sponsored by the Healthcare Technology Center, Univ. of Missouri, Columbia, Mo., also available as Stanford CS Report 80-790 Funding Support Status 1) A Computer-Based System for Advising Physicians on Clinical Therapeutics Robert L. Bium, M.D.: Awardee Post-Doctoral Research Fellowship in Clinical Pharmacology Pharmaceutical Manufacturers' Association Foundation Total award: $32,500 (direct) Term: July 1, 1978 to June 30, 1980 2) Integrating Medical Knowledge and Clinical Data Banks Robert L. Blum, M.D.: Principal Investigator National Library of Medicine, New Investigator Award Total award: $90,000 (direct) Term: July 1, 1979 to June 30, 1982 3) Integrating Medical Knowledge and Clinical Data Banks Gio C. M. Wiederhold, Ph.D.: Principal Investigator National Center for Health Services Research, Small Grants Total award: $35,000 (direct) Term: April 1, 1979 to March 31, 1981 IIT. INTERACTIONS WITH THE SUMEX-AIM RESOURCE II.A. Collaborations Since our project is new, we do not yet have public versions of the programs. There is, however, a large sphere of collaboration which we expect in the future. Once the RX program is developed, we would anticipate collaboration with all of the ARAMIS project sites in the further development of a knowledge base pertaining to the chronic arthritides. The ARAMIS Project at SCIP is used by a number of institutions around the country via commercial leased lines to store and process their data. These institutions include the University of California School of Medicine, San Francisco and Los Angeles; The Phoenix Arthritis Center, Phoenix; The University of Cincinnati School of Medicine; The University of Pittsburgh School of Medicine; Kansas University; and The University of Saskatchewan. All of the rheumatologists at these sites have closely collaborated with the development of ARAMIS, and their interest in and use of the RX project is anticipated. We hasten to mention that we do not expect SUMEX to support the active use of RX as an on-going service to this extensive network af arthritis centers, but we would like to be able to allow the national centers to participate in the development of the arthritis knowledge base and to test that knowledge base on their own clinical data banks. E. A. Feigenbaum 216 Privileged Communication Section 9.1.7 RX Project B. Interactions with Other SUMEX-AIM Projects Several of the concepts incorporated into the design of the RX Project have been inspired by other SUMEX-AIM Projects. The RX knowledge base is similar to the Units Package of the MOLGEN PROJECT. The production rule inference mechanism used by us is similar to that in the MYCIN Project. Several programs developed by the MYCIN group are regularly used by RX. These include disk hash file facilities, text editing facilities, and miscellaneous LISP functions. Regular communication on programming details is facilitated by the on-line mail system. C. Critique of Resource Management: The SUMEX KI-10 has been severely overloaded for at least a year. Working in LISP is impossible during the day and is even difficult at times which were formerly low utilization times. This has forced us to rely increasingly on other local computation facilities. The SUMEX resource management, per se, has always been accessible and cooperative in trying to provide our project with adequate resources subject to prevailing constraints, ITI. RESEARCH PLANS The overall goal of the RX Project is to develop a computerized medical information system capable of accurately extracting medical knowledge pertaining to the therapy and evolution of chronic diseases from a database consisting of a collection of stored patient records. Goals for the year August, 1980 to July, 1981 have been detailed in section IC. above on research in progress. To summarize that section, our main short-term goal is to generalize and refine our methods for labeling and retrieving time-intervals or episodes from individual patient records and to generalize the class of hypotheses which the system is capable of analyzing. This requires further refinements in RX's algorithms for choosing and controlling for variables which may potentially confound an hypothesis of interest. Long-Range Goals: August, 1981 to July, 1986 There are two inter-related long-range goals of the RX Project: 1) automatic discovery of knowledge in a large time-oriented database and 2) provision of assistance to a clinician who is interested in testing a specific hypothesis. These tasks overlap to the extent that some of the algorithms used for discovery are also used in the process of testing an hypothesis. We hope to make these algorithms sufficiently robust that they will work over a broad range of hypotheses and over a broad spectrum of data distributions in the patient records. Privileged Communication 217 E. A. Feigenbaum RX Project Section 9.1.7. Justification for Continued Use of SUMEX Computerized clinical data banks possess great potential as tools for assessing the efficacy of new diagnostic and therapeutic modalities, for monitoring the quality of health care delivery, and for support of basic medical research. Because of this potential, many clinical data banks have recently been developed throughout the United States. However, once the initial problems of data acquisition, storage, and retrieval have been dealt with, there remains a set of comnlex problems inherent in the task of accurately inferring medical knowledge from a collection of observations in patient records. These probiems cancera the complexity of disease and outcome definitions, the complexity of time relationships, potential biases in compared subsets, and missing and outlying data. The major problem of medical data banking is in the reliable inference of medical knowledge from primary observational data. We see in the RX Project a method of solution to this problem through the utilization of knowledge engineering techniques from artificial intelligence. The RX Project, in providing this solution, will provide an important conceptual and technologic link to a large community of medical research groups involved in the treatment and study of the chronic arthritides throughout the United States and Canada, who are presently using the ARAMIS Data Bank through the SCIP facility via TELENET. Beyond the arthritis centers which we have mentioned in this report, the TOD (Time-Oriented Data Base) User Group involves a broad range of university and community medical institutions involved in the treatment of cancer, stroke, cardiovascular disease, nephrologic disease, and others. Through the RX Project, the opportunity will be provided to foster national collaborations with these research groups and to provide a major arena in which to demonstrate the utility of artificial intelligence to clinical medicine, SUMEX as a Resource To discuss SUMEX as a resource for program development, one need only compare it to the environment provided by our other resource, the IBM 370/168 installation at SCIP - the major computing resource at Stanford. Of the programs which we use daily on SUMEX -INTERLISP, MSG, TVEDIT, BBD, LINK- there is nothing even approaching equivalence on the 370, despite its huge user community. These programs greatly facilitate communication with other researchers in the SUMEX community, documentation of our programs, and the rapid interactive development of the programs themselves. The development of a program involving extensive symbolic processing and as large and complex as RX at the SCIP facility, would require a staff many times as large as ours. The SUMEX environment greatly increases the productive potential of a research group such as ours to the point where a large project like RX becomes feasible. E. A. Feigenbaum 218 Privileged Communication Section 9.1.7 RX Project Computation resources required by RX: Disk Allocation: RX requires the use of two large data files which need to be kept on- line: the patient database (DB) and the knowledge base (KB). In the course of testing a hypothesis several other files are used: inverted files, source files for statistical processing, LISP SYSOUT files, etc. Our current total disk allocation of 1500 pages for all RX group members has been just adequate. In the future, with anticipated expansions in numbers of patients and size of the KB, we intend to request an increase of our total allocation to 2000 pages. Programs: RX is written in INTER-LISP. To increase our useable address space, we actually use a stripped-down version prepared by William VanMelle of the MYCIN Project. To run statistical data RX calls SPSS in an inferior fork. The text editor, TVEDIT, is also called from an inferior exec fork. Other Computational Resources It is clear that the scope of potential application of the RX Project is large. Within the term of the SUMEX-AIM grant projected through July, 1986, we anticipate the involvement of several of the national ARAMIS collaborating institutions in developing and testing arthritis knowledge bases which reflect their own patient populations and therapeutic biases. The current SUMEX machine configuration will not be able to support this national interaction because the central processors of the KI-10 are already taxed to the limit. Ours is among the SUMEX groups which would greatly benefit by the addition of one or more PDP-10 compatible machines, which could provide support to our anticipated national user community. Another resource which would be highly desirable is a faster and more reliable means for transferring data interactively between SUMEX and the SCIP IBM 370. Our current method utilizes a 2400 baud line with transmission from SCIP to SUMEX only, and is fraught with a high error rate. The addition of a reliable local network facility would greatly facilitate our ability to transfer patient files from SCIP to SUMEX and to transfer statistical source matrices back to SCIP to be run on that machine. D. Recommendations for Resource Development: SUMEX is heavily loaded everyday and almost every evening. Program research is next to impossible during those periods. Program development would be greatly facilitated by the addition of any resources which lessened this loading: upgrading the current machine to a KL or adding core to decrease page swapping. Privileged Communication 219 E. A. Feigenbaum National AIM Projects Section 9,2 9.2 National AIM Projects The following group of projects is formally approved for access to the AIM aliquot of the SUMEX-AIM resource or the Rutgers-AIM resource. Their access is based on review by the AIM Advisory Group and approval by the AIM Executive Committee. E. A. Feigenbaum 220 Privileged Communication Section 9.2.1 Acquisition of Cognitive Procedures (ACT). 9.2.1 Acquisition of Cognitive Procedures (ACT) Acquisition of Cognitive Procedures (ACT) Dr. John Anderson Carnegie-Mellon University I. Summary of Research Program A. Project Rationale: To develop a production system that will serve as an interpreter of the active portion of an associative network. To model a range of cognitive tasks including memory tasks, inferential reasoning, language processing, and problem solving. To develop an induction system capable of acquiring cognitive procedures with a special emphasis on language acquisition and problem-solving skills. B. Medical relevance and collaboration: 1. The ACT model is a general model of cognition. It provides a useful. model of the development of and performance of the sorts of decision making that occur in medicine. 2. The ACT model also represents basic work in AI. It is in part an attempt to develop a self-organizing intelligent system. As such it is relevant to the goal of development of intelligent artificial aids in medicine. We have been evolving a collaborative relationship with James Greeno and Allan Lesgold at the University of Pittsburgh. They are applying ACT to modeting the acquisition of reading and problem solving skills. We have made ACT a guest system within SUMEX. ACT is currently at the state where it can be shipped to other INTERLISP facilities. We have received a number of inquiries about the ACT system. ACT is a system in a continual state of development but we periodically freeze versions of ACT which we maintain and make available to the national AI community. C. Highlights of Research Progress: This last year has seen developments in two main directions. We are completing developing and documenting a system (ACTF) that is capable of a relatively rich variety of cognitive learning and we are completing an application to the modelling of the acquisition of proof skills in high- school students. , Our ACTF system is a production system that operates in a semantic network data base. Our learning work has been focused on ways of increasing the power of production systems for performing various tasks. One class of learning mechanisms concern what we call knowledge compilation. This involves automatic mechanisms for creating productions Privileged Communication 221 E. A. Feigenbaum Acquisition of Cognitive Procedures (ACT) Section 9.2.1 that directly perform behavior that formerly required interpretative processing of knowledge in the semantic network. These compilation mechanisms also model the process by which human experts develop special purpose procedures to deal with the different types of problems that occur in their domain of expertise. Another class of learning mechanisms are concerned with tuning existing procedures so that they apply more appropriately. There are various mechanisms concerned with extending or generalizing the range of application of a procedure. In the past year we have been working at reducing these different generalization processes to a common partial matching process. In addition to generalization, tuning occurs in the ACTF system by means of discrimination and composition. Discrimination is a process for restricting the range of applicability of a production. Composition attempts to build macro-operators out of a series of productions. The third direction of our learning work has been concerned with developing a flexible strength-based set of conflict resolution rules. Here we are concerned with modelling the gradual improvement seen in human cognitive skills and also providing the system with the resilience so that it can recover from noise and changes in environmental contingencies. A manual has been under construction describing these changes. We plan to have a final version of the ACTF system by the end of May and the manual should be finished by the end of the summer. We have been applying this theory in detail to a simulation of how Students acquire proof skills in geometry. We have a more or less thorough analysis of how students learn new postulates of geometry; initially use these postulates in an interpretative fashion, integrating them with prior knowledge; how they compile special purpose procedures that directly apply this knowledge to proof generation; and how these procedures become tuned with practice. This application has provided strong evidence for most of the learning developments in the ACT system. It has also forced us to develop formalisms for how planning and problem-solving should be structured within a production-system framework. D. List of project publications: [1] Anderson, J.R. Language, Memory, and Thought. Hillsdale, N.Jd.: L. Eribaum, Assoc., 1976. [2] Kline, P.J. & Anderson, J.R. The ACTE User's Manual, 1976. [3] Anderson, J.R., Kline, P. & Lewis, C. Language processing by production systems. In P. Carpenter and M, Just (Eds.). Cognitive Processes in Comprehension. L. Erlbaum Assoc., 1977. [4] Anderson, J.R. Induction of augmented transition networks. Cognitive science, 1977, 125-157. E. A. Feigenbaum 222 Privileged Communication Section 9.2.1 Acquisition of Cognitive Procedures (ACT) [5] Anderson, J.R. & Kline, P. Design of a production system. Paper presented at the Workshop on Pattern-Directed Inference Systems, Hawaii, May 23-27, 1977. [6] Anderson, J.R. Computer simulation of a language acquisition system: A second report. In D. LaBerge and S.J. Samuels (Eds.). Perception and Comprehension. Hillsdale, N.J.: L. Erlbaum Assoc., 1978. [7] Anderson, J.R., Kline, P.J., & Beasley, C.M. A theory of the acquisition of cognitive skills. In G.H. Bower (Ed.). Learning and Motivation, Vol. 13. New York: Academic Press, 1979. [8] Anderson, J.R., Kline, P.J., & Beasley, C.M. Complex Learning. In R. Snow, P.A. Frederico, & W. Montague (Eds.). Aptitude, Learning, -an Instruction: Cognitive Processes Analyses. Hillsdale, N.J.: Lawrence Erlbaum Assoc., 1980. [9] Anderson, J.R. & Kline, P.J. A Jearning system and its psychological implications. To appear in the Proceedings of the Sixth International Joint Conference on Artificial Intelligence, 1979. [10] Reder, L.M. & Anderson, J.R. Use of thematic information to speed search of semantic nets. Proceedings of the Sixth International Joint Conference on Artificial Intelligence, 1979, 708-710. [11] Neves, D.M. & Anderson, J.R. Becoming expert at a cognitive skill. To appear in J.R. Anderson (Ed.), Cognitive Skills and their Acquisition. Hillsdale, N.J.: Lawrence Erlbaum Associates, 1981. [12] Anderson, J.R., Greeno, J.G., Kline, P.J., & Neves, D.M. Learning to Plan in Geometry. To appear in J.R. Anderson (Ed.), Cognitive Skills and their Acquisition. Hillsdale, N.J.: Lawrence Erlbaum Associates, 1981. E. Funding Support: A Model for Procedural Learning, John R. Anderson, Principal Investigator, Office of Naval Research (N00014-77-C-0242) $175,000 September 1, 1978 - September 30, 1980 II. Interaction With the SUMEX-AIM Resource A. & B. Collaborations, interactions, and sharing of programs via SUMEX. We have received and answered many inquiries about the ACT system over the ARPANET. This involves sending documentations, papers, and copies of programs, The most extensive collaboration has been with Greeno and Lesgold who are also on SUMEX (see the report of the Simulation of Comprehension Processes project). There is an ongoing effort to assist them in their research. Feedback from their work is helping us with system design. Privileged Communication 223 E. A. Feigenbaum Acquisition of Cognitive Procedures (ACT) . Section 9.2.1 We find the SUMEX-AIM workshops (those that we could manage to attend) ideal vehicles for updating ourselves on the field and for getting to talk to colleagues about aspects of their work of importance to us. Due to memory space problems encountered by ACT we expect that soon we will need to make use of the smaller version of INTERLISP developed at SUMEX for use in the CONGEN program. C. Critique of resource management. The SUMEX-AIM resource has been well suited for the needs of our project. We have made the most extensive use of the INTERLISP facilities and the facilities for communication on the ARPANET. We have found the SUMEX personnel extremely helpful both in terms of responding to our immediate emergencies and in providing advice helpful to the long-range progress of the project. Despite the fact that we are not located at Stanford, we have not encountered any serious difficulties in using the SUMEX system; in fact, there are real advantages in being in the Eastern time zone where we can take advantage of the low load on the system during the morning hours. We have been able to get a great deal of work done during these hours and try to save our computer-intensive work for this time. Two location changes by the ACT project (from Michigan to Yale in the summer of 1976 and from Yale to Carnegie-Mellon in the summer of 1978) have demonstrated another advantage of working on SUMEX: In both cases we were back to work on SUMEX the day after our arrival. III. Research Plans (8/80-7/86) A. Project goats and plans: Our long-range goals are: (1) Continued development of the ACT System; (2) Application of the system to modeling of various cognitive processes; (3) Dissemination of the ACT system to the national AI community. Our more immediate goals (for the next year or two) involve application of the ACTF system, whose development we have finished, to three domains. First, we hope to complete the development of a simulation of geometry learning in the system. Second, we are starting to embark on an effort to model the acquisition of programming skills in LISP. This will serve as another test of the ideas that we have developed in geometry about learning and planning. The third application will be the modelling first language acquisition. This is a more radical departure from our work in problem-solving and so will provide a rather different test of the learning theory. E. A. Feigenbaum 224 Privileged Communication Section 9.2.1 Acquisition of Cognitive Procedures (ACT) B. Justification for continued use of SUMEX: Our goal for the ACT system is that it should serve as a ready-made "programming language" available to members of the cognitive science community for assembling psychologically-accurate simulations of a wide range of cognitive processes. Our intention and ability to provide such a resource justifies our use of the SUMEX facility. This facility is designed expressly for the purpose of developing and supporting such national AI resources and is, in this regard, clearly superior to the facilities we have available locally from the Carnegie-Mellon computer science department. Among the most important SUMEX advantages are the availability of INTERLISP on a machine accessible by either the ARPANET or TYMNET and the existence of a GUEST login. It appears that, at least for the time being, ACT has no hope of being a national resource unless it resides at SUMEX and, given the local unavailability of a network- accessible INTERLISP, it would even be very difficult to shift any Significant portion of our development work from SUMEX to CMU. C. Needs and plans for other computational resources Carnegie-Mellon's plans to begin upgrading its PDP-10 hardware to emerging state-of-the-art machines (VAX, LISP machines, etc.) promises to provide a excellent resource eventually, and we hope to have access to that resource as it develops. However, given that a considerable amount of software development will be required, a sophisticated LISP system such as INTERLISP is not likely to be available on this hardware in the near future. D. Comments and suggestions for future resource goals: We are beginning to feel squeezed by various limitations of the SUMEX facility. The problem of peak load is quite serious. We have also been Struggling with the address limitations of the current INTERLISP which is made more grievous by the amount of space INTERLISP requires. The computation time and address space limitations have meant that we have not been able to pursue certain projects that we would have otherwise. We applaud any efforts to increased computational power, to increase the address space of INTERLISP (e.g. VAXes), or to create significantly more space efficient versions of INTERLISP., Privileged Communication 225 E. A. Feigenbaum