DETAILED PROGRESS REPORT Section 1.3.2.12 Figure 12. Average Diurnal Loading (3/77): Percent Overhead 200-} Total Day (Low= 24.4, Ave= 46.7, High= 63.9) Prime Time (Low= 26.3, Ave= 52.5, High= 63.9) Non Prime Time (Low= 24.4, Ave= 39.5, High= 50.3) 1 | I —y| t 4 ' i | ‘ t t “tf 5 tl i i | @@00008000203000009090a9998 -| COCOSIGGGGAIOGARASAIGAAaRORAGABa =k @899aaa8a | @@ ee @ GC9009090800908 90080929 200900R a0 aS ROB AB AAR AAaRARRRGA (GG@908098 CACD90GER0REA2G800800000009000005990998908906a00G09889 8008008 | 886900 0ae90900daRaaszaeaanasaaeaag_as0990900000099084900RsAaR;aaESAE0— | @80900090G89992089RAG09099D2A0GaIa0Gaa098000000000909990G8009000088088000 PAC $o---- tone n +an ee tmee en tame e a +——- = to---- pene nH t—- +----~ pam + TIME 0 2 4 6 8 10 12 14 16 138 20 22 24 Figure 13. Average Diurnal Loading (3/77): Balance Set — Jobs in Core 12-} Total Day (Low= .7, Aves 2.4, High= 4.9) | Prime Time (Low= .7, Ave= 3.1, High= 4.9) | Non Prime Time (Low= .8, Ave= 1.6, High= 2.8) i 1 $ 1 “I i i i 1 ' ' i i ‘ ! i ea | C8e@e0@33 @ aasaas { 8809902030999893993a000 -| 80999000000909009089099009809 @ | @380880A98009900099998038000000 &2296930 2 0809800909800908598330394090000990980003998000a0080 188809099 efGe 2008090900 9000300900990339988a900908900000090998R 0000 | 08000980929890909990090000000000090900399090990959940980000098989889009a PAC pone +----~ +----= $—-—--- tama $----- +~---~ bon eee penne panne $a —-~ + TIME 0 2 K 6 8 10 12 14 16 18 20 22 24 Privileged Communication 57 J. Lederberg Section 1.3.2.12 DETAILED PROGRESS REPORT Figure 14. Average Diurnal Loading (3/77): Runnable Jobs Total Day (Low= .7, Aves 2 Prime Time (Low= .7, Ave= 3 Non Prime Time (Low= .8, Ave= 1 9 , High= 8, High= T ’ High= Wan =a NN -7) . -7) . 1) ] ‘ ‘ t i \ 1 ' i i a =! aeee 029990@ 2@ @ G2 GaRGGeRadeIaG90990 @ @820@3000809089000920 i OG PGQGATIARARVAGAIAVEA2IIE G@earaaaoaeagagagaeazaaaaadsa ee i GG GG8ARAIGAOGRAIEGRGAIADGAEAIAAS a@32@49@ i @ GECOCCORAIGZERAABAAAERAIIAIGAGAIIA @ i a@ @aage C2CCBOE9898E928 90009920 A9A90999989808 i a @ ae Q PAC +----- y 6 8 10 12 144 16 1 J. Lederberg 52 Privileged Communication DETAILED PROGRESS REPORT Section 1.3.2.13 1.3.2.13 NETWORK USAGE STATISTICS NETWORK USAGE PLOTS The plots in Figure 15 show the major billing components for SUMEX-AIM TYMNET usage. These include the total connect time for terminals coming into SUMEX and the total number of characters transmitted over the net. The ratio of characters received at SUMEX to characters sent to the terminal is about 1:12 over our period of usage. Also shown for recent months is a plot of ARPANET connect time which tracks the corresponding data for TYMNET usage fairly closely. No data for "character" transmission is available for ARPANET since file transfers and terminal traffic use different byte sizes and these data are not resolved and maintained for the ARPANET. Privileged Communication 53 J. Lederberg Section 1.3.2.13 1900+ 8004+ 500+ 400+ Connect Time (Hrs) 200+ 0 1974 204 184 164 144 124 104 Characters Transmitted (x 10°) On at Opt ttt ASOND J 1974 J. Lederberg TYMNET ————— ARPANET —— — ASONDJ DETAILED PROGRESS REPORT + 4 + + 4 : 4 ‘ + . 4 p> t JFMAMJI J 1977 JJASONDIFPMAMITASOND 1975 1975 FMAM TYMNET -——-—~ AMJJASONDJIJFMAMJIJASO 1975 1976 femme FM NDJFMAMJ J 1977 Figure 15. TYMNET and ARPANET Usage Data 54 Privileged Communication Section 1.3.2.14 DETAILED PROGRESS REPORT 1.3.2.14 PUBLICATIONS The following are publications for the SUMEX staff and have included papers describing the SUMEX-AIM resource and on-going research as well as documentation of system and program developments. Publications for individual collaborating projects are detailed in their respective reports (see Section 6 on page 44 in Book II). {1] Carhart, R.E., Johnson, S.M., Smith, D.H., Buchanan, B.G., Dromey, R.G., and Lederberg, J, "Networking and a Collaborative Research Community: a Case Study Using the DENDRAL Programs", ACS Symposium Series, Number 19, COMPUTER NETWORKING AND CHEMISTRY, Peter Lykos (Editor), 1975. [2] Levinthal, E.C., Carhart, R.E., Johnson, S.M., and Lederberg, J., "When Computers Talk to Computers", Industrial Research, November 1975 [3] Wilcox, C. R., "MAINSAIL - A Machine-~Independent Programming System," Proceedings of the DEC Users Society, Vol 2, No 4, Spring 1976. Mr. Clark Wilcox also chaired the session on "Languages for Portability” at the DECUS DECsystem10 Spring °76 Symposium. In addition as reported earlier, a substantial effort has gone into developing, upgrading, and extending documentation about the SUMEX-AIM resource, the SUMEX-TENEX system, the many subsystems available to users, and MAINSAIL. These efforts include a number of major documents (such as SOS, PUB, and TENEX~- SAIL manuals) as well as a much larger number of document upgrades, user information and introductory notes, an ARPANET Resource Handbook entry, and policy guidelines (see Appendix VI, and Appendix VII in Book ITI). Publications for individual user projects are summarized in the respective reports (see Section 6 in Book II). J. Lederberg 56 Privileged Communication DETAILED PROGRESS REPORT 1.3.2.15 RESOURCE STAFFING HISTORY PROFESSIONAL PERSONNEL (YEARS 01-04) Name Title of Position Lederberg, Joshua Principal Investigator Rindfleisch, Thomas Facility Manager Levinthal, Elliott AIM Liaison Cower, Richard System Programmer Crossland, James. System Programmer Gilmurray, Frank System Programmer Heathman, Michael System Programmer Lieb, James System Programmer Reiss, Steven System Programmer Sweer, Andrew System Programmer Tucker, Robert System Programmer schulz, Rainer System Programmer - IMSSS Roberts, Ronald System Programmer - IMSSS w bd " " tt Smith, Robert - System Programmer - IMSSS Quam, Lynn syst. Prog. - Cardiology Johnson, Suzanne Applications Programmer Snito, Nancy Applications Programmer Kahler, Richard User Consultant Jackson, Phillip User Support Specialist Wilcox, Clark Syst. Prog. - Res. Asst. Veizades, Nicholas Electronics Engineer ~ IRL Nozaki, Thomas Electronics Engineer - IRL (#) The figures shown give the 4% of effort during the respective employment. Privileged Communication 57 (*) 2 of Effort ee 10 100 22 100 100 1090 100 100 100 100 100 61 50 52 50 50 109 100 100 190 63 50 Section 1.3.2.15 Period of Appointment 10/1/73 - present 10/1/73 - present 12/1/73 - present 6/24/74 = 6/15/77 8/6/74 - 1/16/76 6/1/77 (tent. start) 10/1/73 = 8/15/75 T/1/74 = 11/14/75 10/1/73 - 7/31/74 1/19/76 - present 6/1/77 (tent. start) 2/1/74 - present 2/1/TH - 7/31/74 5/1/75 - 7/31/75 5/1/75 - 7/31/75 3/1/76 ~ 5/31/76 T/22/T4 - present 3/25/74 = 8/20/76 12/1/75 - present 11/18/74 ~ 7/28/75 3/25/74 — present 10/1/73 - present 5/1/74 - present periods of J. Lederberg SPECIFIC AIMS 2 SPECTFIC AIMS The following outlines the specific objectives of the SUMEX-AIM resource during the follow-on five year period. Note that these objectives cover only the resource nucleus; objectives for individual collaborating projects are discussed in their respective reports (see Section 6 on page 41 in Book II). We break our research aims into the categories 1) resource operations, 2) training and education, and 3) core research. 2.1 RESOURCE OPERATIONS AIMS The broad objectives remain to provide an effective computing facility with extensive network access to support the community of projects developing ATI applications in medicine. This goal includes the limited dissemination of these programs to outside research groups to provide the necessary feedback from actual research applications for effective program development. Specific aims include: 1) Continue the building of a community of projects applying AI techniques to medical problems including improving mechanisms for inter- and intra- group collaborations and communications. We plan to extend the existing AIM community management structure to accommodate justified growth in computing resources at other sites including a close collaboration between nodes on such a "resource network" and a meaningful division of responsibilities and regional expertise. To minimize administrative barriers to the community-oriented goals of SUMEX-AIM, we plan to retain the current user funding arrangements; user projects will fund their own manpower and local needs and will actively contribute their special expertise to the SUMEX-AIM community in return for an allocation of computing resources under the control of the AIM management committee structure. There will be no "fee for service" charges for community members. While AI is our defining theme, we may entertain exceptional applications justified by some other unique feature of SUMEX-AIM essential for important biomedical researcn. 2) Provide an effective computing resource to support the development and research dissemination of large and complex computer programs for a broad range of medical AI applications. This will include the continued development and refinement of the existing resource and the development and implementation of a plan for the upgrade of current hardware to the emerging next generation when justified by community, technical, and economic advantages. 3) Provide effective and geographically accessible network comnunication facilities to the SUMEX-~AIM community for effective remote collaborations and to allow external users to experiment with available AI programs. We also plan to demonstrate the utility of network communications for scientific collaboration, in selected cases which do not interfere with our primary mission, to groups in other areas of computer science related to medicine. The ONET collaboration (see the Rutgers Resource progress J. Lederberg 58 Privileged Communication RESOURCE OPERATIONS AIMS , Section 2.1 report on page 144) illustrates the value of these facilities apart from the AI programs themselves. 2.2 TRAINING AND EDUCATION AIMS Our goals during the follow-on period for assisting new and established users of the SUMEX-AIM resource are a continuation of those adopted for the first grant term. Collaborating projects will provide their own manpower and expertise for the development and dissemination of their AI programs. The SUMEX resource will provide community-wide support and will work to make resource goals and AI performance programs known and available to appropriate medical scientists. Specific aims include: 1) Provide documentation and assistance in interfacing users to resource facilities and programs. We will continue to exploit particular areas of expertise within the community for developing pilot efforts in new application areas. 2) Continue to allocate "collaborative linkage" funds to qualifying new and pilot projects to provide for communications and terminal support pending formal approval and funding of their projects. These funds are allocated in cooperation with the AIM Executive Committee reviews of prospective user projects. 3) Provide support for a "visiting scientist" position to allow prospective qualified SUMEX-AIM project investigators or users to spend a term in close contact with on-going research work. The selection of appropriate candidates for this rotating position would be made in cooperation with the AIM Executive Committee. 4) Continue to support AIM Workshop activities in collaboration with the Rutgers Computers in Biomedicine resource. 2.3 CORE RESEARCH AIMS Our core research efforts will emphasize the generalization and documentation of tools and techniques available for AI research and applications and the examination of alternative approaches for implementing and exporting large and complex AI performance programs. These efforts will be important community-wide to facilitate the investigation of new application areas and to meet the demand, beyond SUMEX-AIM capacity, for external users to be able to run developed AI programs conveniently. Fortunately, we have independent funding from various agencies for research activities that overlap the core-research Privileged Communication 59 J. Lederberg Section 2.3 CORE RESEARCH AIMS opportunity, e.g., CONGEN, MOLGEN, Heuristic Programming Project, and DENDRAL mass spectrometry. Specific aims include: 1) Continue to encourage community efforts at organizing and developing AL techniques by supporting projects such as the AI Handbook, special language developments (e.g., KRL), and other projects community members may propose to contribute. 2) Explore the generalizations of AI tools for knowledge acquisition, representation, and utilization; reasoning in the presence of uncertainty; strategy planning; and explanations of reasoning pathways. This effort will attempt to extract and generalize some of the best concepts and functional capabilities developed in the context of particular projects (e.g., DENDRAL, MYCIN, MOLGEN, etec.). The objective is to evolve a body of software packages that can be used to more efficaciously build future knowledge-based systems and explore other medical AI applications. 3) Explore AI software implementation and export mechanisms such as network communication systems, machine-independent languages, and special purpose computer systems. This will include the continued development of the MAINSAIL system and the investigation of microprogrammable machines specialized for target languages or satellite general purpose machines capable of running existing systems. Even the present level of computer capacity is not sufficient to meet the demands of a number of our projects. The DENDRAL CONGEN program is a good example where the potential for effective application to real biochemical structure determination problems is close but it simply takes too long to run problems that are really interesting. Therefore new approaches to computing are needed that may involve parallel processing, multiple small machines, or new developments from commercial vendors such as very much cheaper analogs of the PDP~10 that eould be run in a more nearly dedicated mode. J. Lederberg 60 Privileged Communication METHODS OF PROCEDURE 3 METHODS OF PROCEDURE This section details our plans for SUMEX-AIM goals during the next five year period. As indicated earlier, objectives and plans for individual collaborating projects are discussed in Section 6 on page 41 (see Book II). In general SUMEX-AIM will retain its community orientation in formulating and implementing a resource for AI research in medicine. We have had good success at integrating the tools and expertise of on-going active research efforts where possible and building on these where extensions or innovations are necessary. . This orientation has proved to be an effective way to build the current facility and community and we expect it to be equally productive during the next period. We have assembled a growing community of projects which contribute to SUMEX-AIM resource goals and have at the same time come to depend on SUMEX for computing support and as a means of interacting with collaborators. We plan to continue our commitment to providing effective support to this community of projects. This opportunistic approach also places constraints in synchronizing particular advances with our community needs. We are presently facing demands for increased computing resources as well as for effective methods for exporting mature AI performance programs. At the same time a new generation of hardware and firmware systems is just becoming available. These will have a large impact as a means to meet our goals, providing economic and technical advantages while minimizing redesign and reprogramming requirements. The anticipated timing for the announcement of a new generation of general purpose machines that might run AI software using existing operating systeus and language support with substantially reduced capital investment is one to two years off. Such systems could be used to export software packages intact or to incrementally augment central resources like SUMEX. A similar situation exists for special purpose microprogrammable machines which can be tailored to particular language needs for increased throughput and efficiency. We aim to respond in a timely fashion to take advantage of this emerging technology but until concrete details are publically available, we can only describe our basic objectives and general design possibilities. : Thus the following description of research plans concentrates on software issues in planning for assimilation of the new technologies with the expectation that hardware announcements one to two years hence will impel careful reconsiderations of our strategies. Detailed budgets for computing hardware conversions are only approximate pending more detailed information on pricing. Our approach is to describe the research concept and gross estimated funding required, for review of these objectives at this time. We will further refine and elaborate the details of these plans during the first one to two years of the grant and submit them through the AIM Executive and Advisory Committees and the NIH Biotechnology Resources Program Office for approval prior to implementation. Privileged Communication 61 J. Lederberg section 3.1 RESOURCE OPERATIONS PLANS 3.1 RESOURCE OPERATIONS PLANS 3.1.1 SYSTEM HARDWARE AND MONITOR PLANS As discussed in the progress section and supported by collaborating project reports, we have implemented an effective computing resource to support AI applications to medical research. We have augmented tne present system to increase its effective capacity as far as we economically can to meet community needs. We do not propose any substantial changes either in scope of the existing resource or in its capacity. Other members of our community have proposals pending for other regional centers which may be justified on their own merits and the needs of the AIM community. We support the development of such regional expertise and specialization where justified which may allow a more coherent adaptation of a particular facility’s resources to the needs of a subset of the AIM community. For example, a substantial group of biochemical structure analysis projects has grown up (DENDRAL, Chemical Synthesis Project, Protein Structure Project, and Molecular Genetics Project) as well as a group of medical diagnostic projects (MYCIN, Rutgers ONET, and INTERNIST as well as several pilot efforts). If regionalization becomes indicated, AIM facilities could be reoriented to serve the special needs of these research and target communities via separate systems, while maintaining close administrative and informational ties. We cannot predict the funding support such new facilities might receive but we will cooperate fully in getting them started and in assuring effective management for the benefit of the overall AIM community. Our own facility has operated at capacity since early in our present grant term owing to the continuing maturing of on-going projects and the recruitment of new users, despite the periodic augmentation. As indicated earlier, our present hardware cannot be augmented further witnout upgrades to major mainframe and memory components. This should be done only after optimizing with respect to available new systems which are scheduled for announcement in the next year or so. There have been a number of recent relevant announcements but these machines have not yet been of a capacity or economic advantage to warrant immediate upgrade (indeed our decision to develop the dual KI-10 processor system was made on the basis of optimum cost-effectiveness within current technology and budgets). Furthermore, these systems are being sold packaged with relatively expensive memory and file storage and future releases may allow a more cost- effective mix of components from multiple vendors. Our hardware design is now approximately five to six years old and will be twelve years old by the end of the follow-on 5 year grant term. The economics and technical performance of the newer systems, the evolving software gaps from inherent backward incompatibilities, and the reliability and maintainability of our existing equipment will pose new opportunities and problems. They may point to a strong rationale for an upgrade of the SUMEX-AIM system to meet the needs of the AI community we are supporting. The costs of this new generation of hardware will represent a progressively smaller part of the overall effort, compared to human resource inputs, especially if user participation is fairly weighted. J. Lederberg 62 Privileged Communication SYSTEM HARDWARE AND MONITOR PLANS Section 3.1.1 The TOPS-20 system DEC is currently marketing is derived from TENEX but already, DEC has made changes which cause incompatibilites with earlier systems. Many of these are in the direction of improved system performance (file system redundancy, system call enhancements, etc.) while others are of less obvious value (file naming conventions, message file formats, ete.). Whatever the reason, DEC’s TOPS-20 system will likely doninate future system purchases and will increasingly diverge from ours. This causes a larger burden in our pursuit of software sharing and will affect the ease with which we can cooperate with other potential AIM network nodes. To avoid effective isolation, we will have to maintain effective compatibility. DEC has no plans for making TOPS-20 run on KI- 10°s and it is not likely others will undertake this within the currently strict licensing restrictions and DEC’s motivations to sell KL-10’s. Our apparent alternatives are to upgrade to some KL-"n" system when this product line matures and fills out so a proper choice can be made or to progressively modify our current system to remain as compatible as possible. A hardware conversion would likely cost at least $500,000 (based on current prices, but presumably much less as time passes) while system modifications for compatibility will entail 1-2 additional people per year in software effort. The cost of the latter approach must also include a measure of user community investment to circumvent unavoidable residual incompatibilities. The choice for optimum return will depend on the timing of major price declines for a given hardware capability, and on the way that cognate facilities evolve and participate in sharing software burdens. We do not expect these trade-offs to be clear before 1979. We tentatively propose to expend the man-effort required to maintain compatibility between our existing system and TOPS-20 so long as this remains tenable. We budget initially one person for this purpose and add an additional programmer at the middle of the grant term. If this approach proves too costly and ineffective, we may propose reallocating tnese funds for a hardware conversion. Such a contingency would be thoroughly reviewed with AIM management committees and the NIH-BRP before finalizing a plan or requesting additional funding. In the meantime we plan to reevaluate the performance of our existing system to wring out any remaining inefficiencies for more effective community Support. The dual processor system has stabilized nicely and with the memory augmentation we are implementing, we will have taken advantage of all of the obvious sources of inefficiency. We will rereview the detailed operation of the facility to try to uncover remaining areas of cleanup. Recent measurements show that a high percentage of available time (80-90% in one recent test) is spent in various system routines which provide the rich set of monitor calls available through the TENEX system. It is therefore important to optimize tne efficiency of the most widely used calls. We also plan as part of this investigation to examine alternative strategies for managing memory allocations to running jobs. This will include attempting to minimize paging overhead by preloading job working sets to better utilize and overlap swapping I/O with other activities rather than waiting for page faults to read in pages on demand. We will also consider giving some program control over working set definition. Privileged Communication 63 J. Lederberg Section 3.1.2 COMMUNICATION NETWORK PLANS 3.1.2 COMMUNICATION NETWORK PLANS Networks remain centrally important to the research goals of SUMEX-AIM. We have had good success at meeting the geographical needs of the community during the early phases through our ARPANET and TYMNET connections. The major problems focus on terminal interaction delays through relatively slow or congested network facilities. In the next year or so TYMNET will be announcing their upgraded network (TYMNET IL) which may offer additional advantages for our community such as higher terminal speeds, more dynamic terminal routing, and inter-host communications. If additional AIM servers are implemented, it will be important to coordinate their network access with that of SUMEX for effective user interactions and system collaborations. During this same period ARPANET may be undergoing similar redesigns and possible further specialization to defense needs. In parallel, the TELENET facilities are evolving rapidly and whereas they offer a symmetric service for file transfer and terminal traffic, character delays are currently too high to warrant connecting immediately. We expect to retain our present connections over the early phases of the follow-on grant and to evaluate new upgrades as they become available. The specific goals for this upgrade will be improved terminal support and effective file transfer mechanisms available community-wide, particularly to interact with other AIM nodes. 3.1.3 SOFTWARE SUPPORT PLANS We will continue to maintain the system, language, and utility support software on our system at the most current release levels, including up-to-date documentation. We will also be extending the facilities available to users where appropriate, drawing upon other community developments where possible. We rely heavily on the needs of the user community to direct system software development efforts. Two specific areas we plan to pursue are extensions to the bulletin board system and improved facilities for managing and organizing collections of related information as for example, program libraries and documentation, bulletin board or message files, collections of user profile information, ete. Bulletin board extensions will include improved facilities for searching for relevant information, associating a given bulletin with multiple topic labels, and more effectively apprising users of new information of interest. We are also examining extensions of the TENEX file system syntax and design to allow better logical organization and access to groups of file information. This may include facilities to define a hierarchical data structure, a"file system within a file", to name and manipulate logically related but independent pieces of information. A number of programs use ad hoc directories to access segments of information. We would hope to better standardize and improve such tools, J. Lederberg 64 Privileged Communication COMMUNITY MANAGEMENT PLANS Section 3.1.4 3.1.4 COMMUNITY MANAGEMENT PLANS We plan to retain the current management structure that has worked out well for the recruitment and review of new projects and the guiding of resource policy formation. We expect the Executive and Advisory Committees to play a continuing important role in advising on priorities for facility evolution and on-going community development efforts such as MAINSAIL in addition to their recruitment efforts. The composition of the Executive committee will grow as needed to assure representation of major user groups and medical and computer science applications areas. The Advisory Group membership rotates with each member serving one to two years and spans both medical and computer science research expertise. We expect to maintain this policy. The AIM workshops under the Rutgers resource have served a valuable function in bringing community members and prospective users together. We will continue to support this effort in terms of the Stanford community participation and providing a computing base for workshop demonstrations and communications. Privileged Communication 65 J. Lederberg Section 3.2 TRAINING AND EDUCATION PLANS 3.2 TRAINING AND EDUCATION PLANS We have an on-going commitment, within the constraints of our staff size, to maintain a high level of documentation of the evolving software support on the SUMEX-AIM system and to provide user help facilities such as the HELP and Bulletin Board systems. These latter aids are the best way we can assist resource users to find the information they need when they need it to solve access problems. Since much of our community is geographically remote from our machine, these on-line aids are indispensible for self help. We will also provide on-line personal assistance to users within the capacity of available staff through the SNDMSG and LINK facilities. We allocate funds in our budget to continue the "collaborative linkage" Support initiated during the first term of the SUMEX-AIM grant. These funds are allocated under Executive Committee authorization for terminal and communications Support to help get new users and pilot projects started. We also have requested support for a "visiting scientist" position which will allow selected prospective investigators to gain first hand experience by visiting on-going projects such as at Stanford. We feel this can serve an important role in catalyzing the development of new application areas and in disseminating the AI programs and techniques developed within the SUMEX-AIM community. The selection of appropriate individuals will be coordinated with the AIM committees as well. Finally, we will continue to actively support the AIM workshop series in terms of planning assistance, participation in program presentations and discussions, and providing a computing base for AI program demonstrations and experimentation. J. Lederberg 66 Privileged Communication CORE RESEARCH PLANS section 3.3 3.3 CORE RESEARCH PLANS 3.3.1 GENERALIZATION OF AI TECHNIQUES The SUMEX-AIM facilities have made it possible to explore many of the frontiers of Artificial Intelligence research within the context of specific systems of medical relevance. Among those issues are the acquisition, representation and utilization of knowledge (both formal and judgmental), reasoning under uncertainty, explanation of a program’s reasoning steps, and strategy planning. During the next period we wish to extract some of the best concepts and programming techniques from the specific programming systems, demonstrate their generality by incorporating them into other working programs, and design and implement packages which can be used to construct other high performance, knowledge based systems. The five projects described below are proposed as basic core research in Support of the various AIM community projects applying the techniques of AI research to biomedical problems. References for this material can be found on page 76. Because these projects are extensions of on-going work, we are able to generalize from existing programs without requesting support for maintenance or development of the programs themselves. This is another example of the synergistic community interactions of the SUMEX-AIM resource. 3.3.1.1 DESIGN OF KNOWLEDGE-BASED CONSULTATION SYSTEMS Objective Recent work has suggested that one key to the creation of intelligent systems is the incorporation in programs of large amounts of task-specific knowledge. We intend to develop (i) methods of using large stores of expert knowledge as a foundation for computer-based reasoning, and (ii) methods of facilitating the knowledge transfer from human experts to computer programs. We believe that this will lead to principles that may help turn the art of building large systems into more of a science, and thus aid other investigators who are building large knowledge-based systems. To do this, we will work on a number of problems involving knowledge representation, accumulation, management, and use, in the context of a software "laboratory" designed to facilitate the construction and use of large knowledge bases. Motivation Some of the earliest work in artificial intelligence centered around the attempts to create generalized problem solvers. Work on programs like GPS [Newel172] and theorem proving [Nilsson71], for instance, was inspired by the apparent generality of numan intelligence and motivated by the belief that it might prove possible to develop a single program applicable to all (or most) problems. While this early work demonstrated that there was a large body of Privileged Communication 67 J. Lederberg Section 3.3.1.1 GENERALIZATION OF AI TECHNIQUES useful general purpose techniques (such as problem decomposition into subgoals, and heuristic search in its many forms), these techniques did not by themselves offer sufficient power for high performance. Recent work has instead focussed on the incorporation of large amounts of task specific knowledge in what have been called "knowledge-based" systems. Rather than non-specific problem solving power, knowledge based systems have emphasized high performance based on the accumulation of large amounts of knowledge about a single domain. A second successful focus in work on intelligent systems has been the emphasis on the utility of solving "real world" problems, rather than artificial problems fabricated in simplified domains. This is motivated by the belief that artificial problems may prove in the long run to be more a diversion than a foundation for further work, and by the belief that the field has developed sufficiently to provide techniques that can aid working scientists. While artificial problems may serve to isolate and illustrate selected aspects of a task, solutions developed for those selected aspects often do not generalize well to the complete problem. There are numerous current examples of successful systems embodying both of these trends, systems which apply task-specifie knowledge to real world problems. They include efforts at symbolic manipulation of algebraic expressions [Macsyma74], speech understanding [Lesser74], chemical inference [Buchanan71], and interactive consultants in a few specific areas [Pople75, Shortliffe75]. While all of these systems display an encouraging level of performance, however, two fundamental problems remain. First, assembling the knowledge base for each of these is a difficult, continuous task that has in most cases extended over several years. Second, the result of this effort is typically a system with an impressive level of performance, but only within a sharply limited domain of application. High performance has been achieved at the cost of generality and man-years of work in knowledge base construction. But if programs require large stores of knowledge for high performance, can we take a step back and discover powerful and broadly applicable techniques for accomplishing this transfer of knowledge? That is, can we discover ways of facilitating the communication, management and use of large amounts of task- specific knowledge? The result would be an intelligent system whose generality arose from access to the appropriate human experts, and whose power was based on the store of knowledge it acquired from them. Two central themes of the proposed work are facilitating knowledge base construction and improving the generality of the reasoning programs that use the knowledge base. We intend to employ a computer system based on broadly applicable techniques for knowledge encoding and use, and couple it with powerful techniques for accomplishing the transfer of knowledge from human experts to computer programs. The foundation for the computer system will be provided by the domain independent core of the Mycin system [Shortliffe75, Davis77]. This will be the basis for a software "laboratory" in which we can examine the relevant issues of knowledge representation, accumulation, management, and use. By setting this work in the context of a specific, existing body of software, a number of a very general issues become focussed into specific questions. Since J. Lederberg 68 Privileged Communication GENERALIZATION OF AI TECHNIQUES section 3.3.1.1 the program that constitutes our "laboratory" has been demonstrated to have a strong degree of domain independence, the results of this work will be widely applicable. This should produce a new form of generality. Unlike GPS, we do not offer one program which can solve problems in any domain. Rather, we offer the foundation for a system, along with a methodology for instantiating that system in any one specific domain. The foundation and methodology provide a framework for the expression, management, and use of domain specific knowledge, to make this instantiation task a reasonable one. It is there in the foundation and the methodology that our generality lies, not in the final performance program which results. 3-3.1.2 ATTEMPT TO GENERALIZE (AGE) PACKAGE The objective of this research is to isolate inference, control and representation techniques from previous knowledge-based programs; reprogram them for domain independence; write a rule-based interface that will help a user understand what the package offers and how to use the modules; and make the package available to SUMEX users, other research groups engaged in knowledge- based systems development, and the general scientific community. Detailed Discussion: The goal of this new effort is to construct a computer program to facilitate the building of knowledge-based systems. The design and implementation of tne program will be based primarily on the experience gained in building knowledge-based systems at the Heuristic Programming Project in the last decade. The programs that have been built are: DENDRAL[Buchanan71], meta- DENDRAL[ Buchanan72], MYCIN[ Shortliffe76], AM[Lenat76], HASP[Nii77], Protein Structure Modeler[Engelmore77], and MOLGEN[Stefik77] (the latter two currently under development). Initially, The AGE program will embody methods used in our programs. However, the long-range objective is to integrate methods and techniques developed at other A.I. laboratories. The final product is to bea collection of useful "building-block" subprograms, combined with a knowledge. based front-end that will assist a user in constructing knowledge-based programs. It is hoped that AGE can speed up this process and facilitate transfer of the technology by: (1) packaging common AI software tools so that they do not need to be reprogrammed for every problem; and (2) helping people who are not knowledge-— engineering specialists to write knowledge-based programs, Two Specific Research Activities of the AGE Effort are: 1. The isolation of techniques used in knowledge-based systems. It has always been difficult to determine if a particular problem-solving method used in a knowledge-based program is "special" to a particular domain or whether it generalizes easily to other domains. In the currently existing knowledge-based programs the domain-specific knowledge and the manipulation of such knowledge using AI techniques are often so closely Privileged Communication 69 J. Lederberg Section 3.3.1.2 GENERALIZATION OF AI TECHNIQUES coupled that it is difficult to make use of the programs for other domains. We need to isolate the AI techniques that are general to determine precisely the conditions for their use. 2. Guiding users in the application of these techniques. Once the various techniques are isolated and programmed for use, an "intelligent front end" is needed to guide users in their application. Initially, we assume that the user understands AI techniques and knows what he wants to do, but that he does not understand how to use the AGE program to accomplish his task. The program at this stage of the development will need to have the basic tools coupled with a package to guide the user in applying these tools. A longer-range interest involves helping the user determine what techniques are applicable to his task. That is, we assume that the user does not understand the necessary techniques of writing knowledge-based programs. Some questions to be posed are: What are the criteria for determining if a particular application is suited to a particular problem-solving framework? How do you decide the best way to represent knowledge for a given problem? There are some smaller, but by no means trivial, questions which also need answering. Is there a "best way" to write production rules which would apply to many task domains? Is there a data representation that would cover many tasks? What is the best way to handle differences in the ability of the users of the AGE program? Research Plan: The AGE program will be developed along two separate fronts, both of which are divided into incremental development stages. The first of these fronts is the development of the ability to help build many different types of knowledge- based programs (the "generality" front). The second front is the development of "intelligence" in the interaction between tne user and the AGE program; i.e. moving from dialogues on "how to use the tools in AGE" to "what tools to use" (the "how-to-what" dialogue front). The proposed development plan contains the following stages: a. Generality: The development of a program package that will enable the user to build "HASP-like" knowledge-based programs characterized by the integration of multiple sources of knowledge, multi-level representation of solution hypotheses, opportunistic problem-solving methods, and explanation capability of the reasoning steps. The HASP-like paradigm has been used to solve problems of interpreting large amounts of digitized physical signals, but can also be extended to problems of processing large amounts of symbolic data. Dialogue: The development of dialogue to show the user how to utilize the packaged components in AGE to build HASP-like programs. The interactive capability will be limited to: specifying how to build multi-level hypothesis structure; how to write production rules to represent domain knowledge; and how to use various techniques available for opportunistic hypothesis formation. J. Lederberg 70 Privileged Communication GENERALIZATION OF AI TECHNIQUES section 3.3.1.2 b. Generality: Supplement the ability to build HASP-like programs with a capability to build MYCIN-like goal oriented programs. Dialogue: Same level of dialogue capability with additional ability to discuss how to chain rules and how to specify the necessary parameters for the context tree. e. Generality: Same level as for b., i.e. ability to build HASP-like, MYCIN-~ like or combination of HASP-~ and MYCIN-Like knowledge-based programs. Dialogue: Begin to extract from the user some key characteristics of the task, and using that information begin to suggest appropriate knowledge representation and problem-solving techniques for the user’s task. This interactive capability will be limited to the generality level at this point in the AGE development. d. Test phase: Test the usefulness of the AGE system by developing an application program in some task domain. (a) An application program will be chosen from among on-going program development efforts within our own project or within the SUMEX-AIM community. An application will be chosen whose primary task is that of interpreting large amounts of symbolic data or described signal data. (b) Collect specific knowledge needed for the application program and begin to develop the program using the AGE system. 3.3.1.3 PLAN PACKAGE The PLAN package is oriented toward the representation of plans-of~action and toward an expert’s knowledge of the best problem solving strategies to employ in his domain. A feature of the package is its ability to make inferences on components of planning and strategy rules so that new plans and strategies can be constructed readily from previous ones. The representation will allow the manipulation of various "levels of detail" of plans and strategies. The package will be made available as previously mentioned in connection with AGE. Detailed Discussion: Before starting a technical presentation of the ideas for the Plan Package, it is worth highlighting some of the issues which motivate its development. a. How can a variety of types of domain actions be accommodated in a knowledge base? b. How can a variety of types of strategy and control knowledge be incorporated in a knowledge base? e@. How can a variety of types of problem solving states be expressed and manipulated by the system? d. How should plans be represented? Privileged Communication ~ 71 J. Lederberg Section 3.3.1.3 _ GENERALIZATION OF AI TECHNIQUES e. How can the problem statements for a variety of types of problems be acquired? f. How does the expression and representation of problem solving states relate to the expression of the domain and strategy knowledge? The Plan Package consists of two major entities -- the Planning Network and the Strategy Package. The Planning Network is a set of software which manages the representation of the plans created during the problem solving process. When a problem is acquired from a user, it is represented as an initial planning network. Problem solving takes place as the active strategy rules manipulate the planning network to create solutions. The Strategy Package itself is discussed in the next section. Since the planning state knowledge is important for the expression of Strategy in the Plan Package, it is worthwhile exploring briefly the nature of this knowledge. It is useful to consider the planning network as being composed of three parallel planes -- the solution plane, the planning plane, and the focus plane. These planes contain (1) the solution steps (domain rule applications) and world states, (2) the planning and design steps and (3) the focus of attention knowledge respectively. All three planes of the network are built dynamically during the problem solving process. Different types of nodes in the network correspond to the different components of the problem solving process, A number of issues have been raised about the management of strategy knowledge. a. How should strategies be expressed? b. How can strategy information be assimilated so that the system will use it appropriately when designing or explaining solutions? ec. How can a Knowledge based system assist a domain expert in structuring and expressing his ideas about strategy? Means-ends analysis is one of the simplest ideas in the current stock of methods for problem solving. As such, it should exist as a standard strategy in a strategy package of artificial intelligence techniques to be used as needed. The current state of artificial intelligence, where a researcher must re-code Means- ends analysis any time ne wishes to use it is akin to a carpenter forging a new hammer for each job. One approach for making an instance of Means-ends analysis available as a tool would be to provide a packaged program which accepts arguments for the various components of Means-ends analysis (e.g. a difference table, difference function, etc.). The alternative being proposed here is a system which uses schemata to drive the strategy acquisition process and which can guide a user through the details. The goal is to create a supportive environment for the painless testing of fairly high level strategies. Such a system should be able to draw on its knowledge base to provide assistance in casting a problem into a Means-ends framework. J. Lederberg 72 Privileged Communication GENERALIZATION OF AI TECHNIQUES Section 3.3.1.3 In summary, other systems have stumbled over the expression of more complex forms of domain and strategy rules and have been limited to solving a Single kind of problem. We propose extending this work by developing what we have termed the Plan Package. The Plan Package consists of two major components — a schema-based representation for the problem-solving states termed the Planning Network and a schema~based representation for domain rules and strategies termed the Strategy Package. The Planning Network will provide a representation for a variety of types of problem solving so that the problem solving system will be able to solve more than one type of problem. The Strategy Package will provide a set of Standard artificial intelligence strategies in the form of schemata, which may be instantiated into strategy rules when they are supplied with the particulars of domain knowledge. These schemata will facilitate the acquisition of tailored Strategies by guiding a user a step at a time through the particulars of the acquisition process. Tne Plan Package will be developed and tested in the domain of molecular genetics as part of the MOLGEN project. It will be further developed and extended to other domains as a test for generality as part of the AGE project. 3.3.1.4 HEURISTIC KNOWLEDGE ACQUISITION Automatic Rule Formation Methods Given a body of data from which rules are to be formed, together with a basic approach to rule induction, there remains a range of ways in which the data may be utilized, which differ in the degree of parallelism involved in the examination of instances. At one extreme are methods in which rules are formed and refined in a sequence of steps, each step involving the examination of one new instance. At the other extreme are methods which involve a single-pass rule formation process, using all available data. There are, of course, many intermediate possibilities. We propose to investigate, within the Meta-DENDRAL framework, whether some of these methods are optimal in the sense of yielding rules of comparatively high quality with the expenditure of comparatively little computing effort. It is hoped that the investigation will lead us to some general insights concerning the optimal utilization of data in automatic rule formation. Research Plan: a. Develop and implement one or more procedures for updating an evolving set of rules on the basis of newly examined data. These procedures will make use of existing capabilities of the RULEGEN and RULEMOD programs, and will make possible the implementation of a variety of schemes for data utilization, as described above. b. Select and implement.a representative subset of the class of data utilization schemes indicated above, and test their performance in the application area of mass spectrometry. Privileged Communication 73 J. Lederberg Section 3.3.1.4 GENERALIZATION OF AT TECHNIQUES ce. Describe in a technical report these experiments, their results, and the lessons learned. Rule Acguisition via Dialogue Since large stores of knowledge appear to be required for high performance, the process of accumulating that information should be made as easy as possible. The fundamental question here is, how can we make it easy for the expert to tell the system what he knows about the domain. Some initial steps in this direction are described in [Davis76], which reports on the use of what has been labelled "meta-level knowledge" as a basis for establishing communication between the System and an expert. In the simplest terms, meta-level knowledge refers to giving the system the ability to "know what it knows", and can support a wide range of useful abilities. The basic approach developed there relies on the notion of knowledge acquisition in the context of a shortcoming in the knowledge base. That is, rather than simply asking an expert to "explain all he knows about the field", we allow him to challenge the system with difficult problems and observe its behavior. If he indicates at some point that the system has made a mistake, there is available a large amount of contextual information which can aid in the process of knowledge explication and communication. Thus rather than asking "What is there to know about this domain?", we can say "Here is a problem on which you claim tne system made a mistake. Here is the knowledge it used to reach its answer. Now WHAT IS IT THAT YOU KNOW AND THE SYSTEM DOESN’T that allows you to avoid making that mistake?” This appears to be an effective approach to the problem, since it creates a well defined context, allowing the expert to focus his attempt to describe his knowledge of the domain, and provides the system with a set of expectations about the content of the new knowledge it is going to receive. Both of these offer Significant advantages in helping to build up the knowledge base. Working from this foundation, we plan to extend these ideas to provide a powerful system for knowledge acquisition. Currently, for example, the scope of the context is limited to a particular error in the knowledge base during a particular session with the expert. It ought to be extended to provide a wider perspective, so that the system could form more sophisticated expectations about a particular tutor, thereby making communication between them more effective. Thus rather than forming expectations concerning only the shortcoming presently under examination, for example, the system might be able to consider also the past several shortcomings, in an attempt to detect a broader "theme" in the knowledge it was acquiring. Tnere ought also to be more effective control over its use of context. The system is currently too "single-minded", in that it holds tenaciously to any expectations it may have formed. There should be a way of indicating to the system that it has formed incorrect assumptions, and that it should "sit back and observe" for a while until it can get "reoriented". Dealing with large knowledge bases also requires a range of auxiliary capabilities that assist the expert in keeping track of and organizing his work. J. Lederberg . 74 Privileged Communication GENERALIZATION OF AI TECHNTQUES Section 3.3.1.4 Together these constitute a "scratch pad” of sorts that allows him to annotate his new additions, mark existing rules that may need further work, or perhaps examine selected parts of the knowledge base to find areas that may presently be weak. All of these should be aimed at making it possible for the expert to extend his work over several sessions without loss of continuity, and to keep track of both changes that are required and work that has been done, no matter how large the knowledge base may eventually grow to be. 3.3.1.5 GENERAL EXPLANATION SYSTEM The function of an explanation capability is to permit the user or builder of a knowledge based system to determine: 1. in general, how the system solves problems or uses information; 2. retrospectively, how the system solved a particular problem; 3. interactively, how and why the system came up with its current answers. The success of the explanation capability for the MYCIN rule based system indicates the usefulness of this capability in debugsing the system and in making it easier for a user to learn and believe the system’s operations. To make it easier to build explanation capabilities for future knowledge based systems, including systems whose knowledge is embedded in procedures, we intend to construct a system which will provide explanations for a wide class of problem solvers. Given the appropriate trace of a program’s decisions and states, and a model of its problem solving process, it should be possible to answer a variety of well constrained but informative questions about program operation, in general or in a specific run. The aim of this research is to determine what sorts of traces and process models are needed to support selected types of explanations in several classes of knowledge based problem solvers. When the requirements for a class are determined, we intend to implement a general explanation facility to provide the selected explanations for programs in that class. Such a facility should be made useful for several classes of problem solver. The steps of the research will include: 1. Choose the types of problem solvers to wnich the explanation system will be applied; . 2. Select example knowledge based systems of each class (e.g. protein structure modelling as an example of event/medel driven hypothesis formation systems); 3. For each system selected, determine questions to be. asked, and what information, such as traces and process descriptions, are needed to answer them; Privileged Communication 75 J. Lederberg