Progress Report SUMEX~AIM RESOURCE PROGRESS REPORT YEAR 06 This annual report covers work performed under NIH Biotechnology Resources Program grant RR-785 supporting the Stanford University Medical EXperimental Computer (SUMEX) research resource for applications of Artificial Intelligence in Medicine (AIM). It spans the year from May 1978 - April 1979, 2 Resource Operations 2.7 Progress 2.1.1 Resource Summary and Goals The SUMEX-AIM project is a national computer resource with a dual mission: a) the promotion of applications of computer science research in artificial intelligence (AI) to biological and medical problems and b) the demonstration of computer resource sharing within a national community of health research projects. The SUMEX-AIM resource is located physically in the Stanford University Medical School and is administered jointly under the Stanford Departments of Genetics and Computer Science. SUMEX-AIM serves as a nucleus for a community of medical AI projects at universities around the country. SUMEX provides computing facilities tuned to the needs of AI research and communication tools to facilitate remote access, inter~ and intra-group contacts, and the demonstration of developing computer programs to biomedical research collaborators. Overview of AI Research Artificial Intelligence research is that part of Computer Science concerned with symbol manipulation processes that produce intelligent action (1). By "intelligent action" is meant an act or decision that is goal-oriented, is arrived at by an understandable chain of symbolic analysis and reasoning steps, and utilizes knowledge of the world to inform and guide the reasoning. Some scientists view the performance of complex symbolic reasoning tasks by computer (1) For recent reviews to give some perspective on the current state of AI, see: (i) Boden, M., “Artificial Intelligence and Natural Man," Basic Books, New York, 1977; Cii) Feigenbaum, E.A., "The Art of Artificial Intelligence: Themes and Case Studies of Knowledge Engineering," Proceedings of the Fifth International Conference on Artificial Intelligence, 1977; Citi) Winston, P.H., "Artificial Intelligence", Addison-Wesley Publishing Co., 1977; and (iv) Nilsson, N.d., “Artificial Intelligence", Information Processing 74, North-Holland Pub. Co. (1975). An additional overview of research areas and techniques in AI is being developed as an "Artificial Intelligence Handbook” under Professor E. A. Feigenbaum by computer science students at Stanford (see page 130 for a status report and Appendix I for a current outline). 1 E. A. Feigenbaum Resource Summary and Goals Section 2.1.1 programs as the sine qua non for artificial intelligence programs, but this is necessarily a limited vieu. Another view unifies AI research with the rest of computer science. It is a simplification, but worthy of consideration. The potential uses of computers by people to accomplish tasks can be "“one-dimensionalized" into a spectrum representing the nature of the instructions that must be given the computer to do its job; call it the WHAT-TO-HOW spectrum. At the HOW extreme of the spectrum, the user supplies his intelligence to instruct the machine precisely HOW to do his job, step-by-step. Progress in computer science may be seen as steps away from that extreme "HOW" point on the spectrum: the familiar panoply of assembly languages, subroutine libraries, compilers, extensible languages, etc. illustrate this trend. At the other extreme of the spectrum, the user describes WHAT he wishes the computer to do for him to solve a problem. He wants to communicate WHAT is to be done without having to lay out in detail all necessary subgoals for adequate performance yet with a reasonable assurance that he is addressing an intelligent agent that is using knowledge of his world to understand his intent, complain or fill in his vagueness, make specific his abstractions, correct his errors, discover appropriate subgoals, and ultimately translate WHAT he wants done into detailed processing steps that define HOW it shall be done by a real computer. The user wants to provide this specification of WHAT te do in a language that is comfortable to him and the problem domain (perhaps English) and via communication modes that are convenient for him (including perhaps speech or pictures). The research activity aimed at creating computer programs that act as "intelligent agents" near the WHAT end of the WHAT-TO-HOW spectrum can be viewed as the long-range goal of AI research. Historically, AI research has been the primary vehicle for progress toward this objective, although a substantial part of the applied side of computer research and development has related goals, if an often fragmented approach. Unfortunately, workers in other scientific disciplines are generally unaware of the role, the goals, and the progress in AI research. Currently authorized projects in the SUMEX community are concerned in some way with the design of “intelligent agents" applied to biomedical research. The tangible objective of this approach is the development of computer programs that, using formal and informal knowledge bases together with mechanized hypothesis formation and problem solving procedures, will be more general and effective consultative tools for the clinician and medical scientist. The systematic search potential of computerized hypothesis formation and knowledge base utilization, constrained where appropriate by heuristic rules, empirical data, or interactions with the user, has already produced promising results in areas such as chemical structure elucidation and synthesis, diagnostic consultation, and modeling of psychological processes. Needless to Say, much is yet to be learned in the process of fashioning a coherent scientific discipline out of the assemblage of personal intuitions, mathematical procedures, and emerging theoretical structure of the "analysis of analysis" and of problem solving. State-of-the-art programs are far more narrowly specialized and inflexible than the corresponding aspects of human intelligence they emulate; however, in special domains they may be of comparable or greater power, e.g., in the solution of formal problems in organic chemistry or in the integral calculus. E. A. Feigenbaum 2 Section 2.1.1 Resource Summary and Goals Resource Sharing Goals An equally important function of the SUMEX-AIM resource is an exploration of the use of computer communications as a means for interactions and sharing between geographically remote research groups engaged in biomedical computer science research. This facet of scientific interaction is becoming increasingly important with the explosion of complex information sources and the regional specialization of groups and facilities that might be shared by remote researchers (2). Our community building role is based upon the current state of computer communications technology. While far from perfected, these developing capabilities offer highly desirable latitude for collaborative linkages, both within a given research project and among them. Several of the active projects on SUMEX are based upon the collaboration of computer and medical scientists at geographically separate institutions; separate both from each other and from the computer resource. The network experiment also enables diverse projects to interact more directly and to facilitate selective demonstrations of available programs to physicians, scientists, and students. Even in their current developing state, communication facilities enable effective access to the rather specialized SUMEX computing environment from a great many areas of the United States Cand to a more limited extent from Canada, Europe, and other international Vocations). In a similar way, the network connections have made possible close collaborations in the development and maintenance of system software with other facilities. Synopsis of Last Year's Progress As we complete year 06, the first year of our recent 3-year continuation grant, we can report substantial further progress in the overall mission of the SUMEX-AIM resource. We have continued the refinement of an effective set of hardware and software tools to support the development of large, complex AI programs for medical research and to facilitate communications and interactions between user groups. We have worked to maintain high scientific standards and AI relevance for projects using the SUMEX-AIM resource and have actively sought new applications areas and projects for the community. Many projects are built around the communications network facilities we have assembled; bringing together medical and computer science collaborators from remote institutions and making their research programs available to still other remote users. As discussed in the sections describing the individual projects, a number of the computer programs under development by these groups are maturing into tools increasingly useful to the respective research communities. The demand for production-level use of these programs has surpassed the capacity of the present SUMEX facility and we have been investigating the general issues of how such software systems can be moved from SUMEX and supported in production environments. (2) A recent perspective on the scientific and financial aspects of technological resource sharing can be found in Coulter, C. L., Research Instrument Sharing, Science, Vol. 201, No. 4354, August 4, 1978. 3 E. A. Feigenbaum Resource Summary and Goals Section 2.1.1 A number of significant events and accomplishments affecting the SUMEX-AIM resource occurred during the past year: E. 1) 2) 3) 4) 5) 5) A. On July 1, 1978, Professor Edward Feigenbaum, chairman of the Stanford Department of Computer Science, assumed the role of SUMEX Principal Investigator following Professor Joshua Lederberg's installation as president of The Rockefeller University. We have smoothly completed the management transition and the SUMEX-AIM project and community continue to operate with the same high level of vitality. Professor Lederberg continues to maintain close ties with SUNEX activities as chairman of the SUMEX-AIM Executive Committee. Professor Stanley Cohen, Or. Lederberg's successor as chairman of the Stanford Department of Genetics, assists in the coordination of project activities with medical research. We have continued development of the SUMEX facility hardware and software systems to enhance throughput and to better control the allocation of resources. We also completed installation and evaluation of a connection to TELENET as an alternate source of communications services for our community. A first version of the AGE system, partially supported under the SUMEX core research effort, has been completed. It uses the "blackboard model" for coordinating multiple expert sources of knowledge for the solution of problems. This system provides the general contro! structure and an interactive facility for implementing representations of expert knowledge sources and is being used experimentally by one of the new SUMEX-AIM Projects to design a program for modeling aspects of human cognition. We successfully completed the design and a demonstration of the MAINSAIL language system as a tool for software portability. A common compiler and code generators and runtime support for TENEX, TOPS-10, TOPS-26, RT-11, RSX-11, and UNIX have been developed as part of this demonstration system and numerous applications programs written by collaborating research groups. Further work past this demonstration phase will be done independently of SUMEX through a private company being formed to continue the development, dissemination, and maintenance of MAINSAIL. We have completed plans for a satellite machine that will be able to support more operational demonstrations of mature AI programs and help alleviate system congestion for on-going program development. A proposal for acquiring a DEC 2020 system meeting our requirements is pending approval by the NIH-BRP. We have also assisted the DENORAL project in planning an independent system suitable for further development and export of chemical structure elucidation programs into the biochemical community. The progress of SUMEX-AIM user projects in the development of their respective programs is reported by the individual investigators. We have worked hard to meet their needs and are grateful for their expressed appreciation. Feigenbaum 4 Section 2.1.2 Technical Progress 2.1.2 Technical Progress The following material covers SUMEX-AIM resource activities over the past year in greater detail. These sections outline accomplishments in the context of the resource staff and the resource management. Details of the progress and plans for our external collaborator projects are presented in Section 4 beginning on page 64. 2.1.2.1 Facility Hardware Development « Over the past year, the SUMEX KI-~10 configuration, shown in Figure 1, has changed little and continues to operate effectively within its capacity limitations. We completed the procurement of the Systems Concepts SA-10 channel adapter including all parts outstanding as of the last report. This subsystem, with the Calicomp disks and tapes, has functioned very reliably over the past year. Qur primary new facility hardware development efforts this year have been directed at: 1) Selection of a satellite processor to allow more operational demonstrations of mature AI programs and to ease loading congestion. 2) Planning for the integration of the satellite machine into the KI-10 facility. 3) Implementing local communication line control facilities to make more efficient use of available scanner ports. These are discussed in more detail below. Loading Background The SUMEX-AIM facility has been operating at capacity in terms of prime- time computing load for the past several years as documented in our previous annual reports. In spite of implementing a number of strategic facility augmentations over the years, we have not been able to satisfy the computing demands of our community. This condition has constrained the growth of the AIM community and our ability to bring AI programs nearing operational status in contact with potential external user communities while continuing to support on- going program development efforts. We have taken active steps to transfer prime time interactive loading to evening and night hours as much as possible including shifting personnel schedules (particularly for Stanford-based projects). We have also implemented tools to control the fair allocation of CPU resources between various user communities and projects and have encouraged jobs not requiring intimate user interaction to run during off hours using batch job facilities. Despite these efforts, our prime time loading has remained at saturation. Perhaps the most significant effect of the resulting poor response time is the deterrence of interactions with medical and other professional collaborators experimenting with available AI programs, whose schedules cannot be adjusted to 5 E. A. Feigenbaum Technical Progress Section 2.1.2.1 meet computer loading patterns. This has hampered the more extensive testing of mature programs such as INTERNIST, MYCIN, CONGEN, SECS, and PUFF. This continuing saturation brought about serious discussion about the scope of computing needs of the AIM community and possible justification of additional PDP-10 scale machines to be added to the AIM network. Several specific proposals were submitted for additional user nodes. Only ane of those has been approved to date, for a DEC 2050 system at Rutgers University which was brought on-line late in the summer of 1978. A small part of that machine's capacity is available now to support AIM community needs outside of Rutgers. From the SUMEX viewpoint, we have attempted to do everything feasible and economically justified within available budgets to maximize the use of the existing hardware for productive work. We have effectively exhausted available avenues for augmenting the current KI-10 machines. Some advantage would be gained by additional core memory but we do not feel the improvement would be sufficient to justify the investment at this time. An upgrade to a more capable KL-10 system is beyond our budget limitations and may be premature in any case in light of projected developments in new machine architectures outlined in Appendix Tl. As discussed in our renewal application for this grant term, an alternative approach to meet community computing needs is to explore the use of smaller, less expensive machines as satellites to the KI-TENEX system. Such systems have been under active development during recent years and could have several advantages including: 1) A relatively small investment in capital equipment is required for each incremental augmentation. 2) Possible closer location to individual research groups thereby allowing better human engineering of user interfaces by using higher speed communication lines and display technology. 3) An improved allocation flexibility by having to satisfy fewer simul taneous scheduling constraints and by being more easily dedicatable to operational demonstrations. One disadvantage of this approach is that each such machine would have a lower capacity and it would be difficult te aggregate such dispersed capacity when needed for a single computing-intensive task. This suggests the continuing need for a spectrum of machine configurations from small "personalized" machines to large centralized resources. Nevertheless, we feel] the capacity of available small] machines is sufficient to support several simultaneous users and warrants serious consideration as both a means for incrementally augmenting the SUMEX resource and for dispersing computing power as justified to individual user groups. Based on the Council approval of this approach in our renewal application, our plans for acquiring such a satellite machine and for integrating it into the KI~10 system with a local network are described below. E. A. Feigenbaum 6 Section 2.1.2.1 Technical Progress It should also be noted that we have encouraged projects with specific needs for more operational demonstration or export of programs to consider acquiring their own machines in order to preserve SUMEX resources for new program development and for support of projects unable to justify their own machine currently. The DENDRAL project has proposed a VAX machine for such a purpose that would be integrated into the SUMEX facility but dedicated to support of the DENDRAL biomolecular characterization community. The choice of VAX was made to provide the best match with machines increasingly available in a biochemistry laboratory environment and able to run the programs being developed by DENDRAL Cineluding CONGEN recently converted from INTERLISP to BCPL). At the same time the choice of VAX is advantageous to SUMEX in that it would give us experience with that machine in line with current projections that VAX will become the “standard” DEC computing product and that the ARPANET AI community will implement a VAX INTERLISP system (see Appendix II). Satellite Machine Selection Over the past year we have spent considerable effort evaluating strategies and alternatives for implementing the planned satellite machine. The key requirement for any such machine to meet pressing community needs is that it be software-compatible with the existing INTERLISP and basic monitor functions available on the SUMEX KI-10 systems and the Rutgers DEC-2050. This will allow programs, written for the most part in INTERLISP, to move easily from development stages to demonstration trials and back with a minimum of reprogramming. A second requirement is that the system be inexpensive in order to minimize initial capital outlay and to allow other groups to purchase similar systems for their own needs. As detailed in Appendix II, we have been in a period of transition in computing technology. More compact and inexpensive yet powerful machines have become available and new directions in machine architecture are being adopted emphasizing large address spaces and improved instruction sets for user program support. In several years, we expect the PDP-10/20 architecture to begin to be replaced by larger address space and more cost-effective systems (mest likely VAX). We do not expect even early versions of these new systems that support INTERLISP to be available for at least two years, however. Thus, in order to meet the immediate needs of the SUMEX-AIM community, we feel the best approach is to acquire a PDP-10-compatible system as soon as possible. There are tuo alternative systems available that meet our requirements for a satellite machine within budget limitations; the DEC 2020 and the Foonly F2. We have evaluated both of these candidate machines (see Appendix II) and have run benchmarks on the 2020 (the only one of the two machines with fully working system in the field). These data, shown in Figure 3, compare 2020 responsiveness under load against single- and dual-processor KI-10 systems. As can be seen, the 2920 is a bit more than half the speed of a single KI-10 and can be expected to support up to three active LISP users simultaneously. This upper bound is limited principally by page swapping capacity. Based on publ ished specifications, we expect the Foonly F2 would perform comparably. We feel that the DEC 2020 is the more advantageous solution. A used 2020 is deliverable almost immediately at a major discount from list price (pricing details have been submitted separately). It is known to be reliable, runs a 7 E. A. Feigenbaum Technical Progress Section 2.1.2.1 monitor compatible with INTERLISP and the most current DEC software, and will be maintainable by DEC for many years. It will also likely retain a better resale value in future years. Whereas the F2 is potentially more cost-effective (its quoted purchase price is below that of the discounted 2020), it has a highly uncertain delivery schedule and no performance track record. It also has no assurance of routine maintenance, vendor support, or resale value. In the long term we feel these uncertainties and the extra in-house effort that would be required te maintain and support the F2 offset its initial price advantage. Thus, the DEC 2020 is the better choice to provide an immediate, effective, and reliable solution to SUMEX-AIM community computing needs. Based on benchmark performance and needs for integrating a 2020 system into the SUMEX facility, we have proposed the following configuration for the machine: 2020 Processor and console 512K words of memory 1 200 Mbyte disk drive (RP-06) 16 asynchronous communication lines TU-45 tape drive TOPS-20 software A proposal is pending with NIH/BRP to approve purchase of this machine. Satellite Machine Integration The introduction of satellite machines into the SUMEX facility raises important issues about how best to integrate such systems with the existing machines. We seek to minimize duplication of peripheral equipment and interdependence among machines that would increase failure modes. We also require high-speed intermachine file transfer capabilities and terminal access arrangements allowing a user to connect flexibly to any machine of choice in the resource. The initial design of the SUMEX system was that of a "star" topology centered on the KI-10 processors. In this configuration, all peripheral equipment and terminal ports were connected directly to the KI-10 busses. With the addition of a satellite machine, a unique focus no longer exists and some pieces of equipment need to be able to “connect” to more than one host. For example, a user coming into SUMEX over TYMNET will want to be able to make a selection of which machine he connects to. Another TYMNET user may want to make another choice of machine and so the TYMNET interface needs to be able to connect to any of the hosts. This could be accomplished by creating separate interfaces for each of the hosts to the TYMNET, each with a different address. Besides being expensive to duplicate such interfaces, it would be inconvenient for a user to reconnect his terminal from one host to another. He would have to break his existing connection and go through another connect/login process to get to another machine. Since we want to facilitate user movement between various machines in the SUMEX resource, this process needs to be as Simple as possible - in fact a user may have jobs running simultaneously on more than one machine at a time. Similarly, we need to be able to quickly transfer files between any two machines in the resource, connect common peripheral devices (e.g. printer or E. A. Feigenbaum 3 Section 2.1.2.1 Technical Progress plotter) to any machine desiring to use them, and allow any host to access other remote resources such as Stanford campus printers or terminal clusters. If we were to establish direct connections pairwise between machines and devices, the number of such connections would go up quadratically with the number of devices. A more effective solution lies in the implementation of a tocal network in which all devices (host CPU's, peripheral devices, network gateways, ete.) are tied to a common communications medium and can thereby establish logical connections as needed between any pair of nodes. Such network systems have been under development for a number of years, taking on various topological configurations and control structures depending on bandwidth requirements and interdevice distances. A very attractive design for a highly‘ localized system configuration from the viewpoint of simplicity, reliability, and bandwidth is the Ethernet which has been under development for several years at Xerox Palo Alto Research Center (3). The simplest form of Ethernet interconnection for a facility like SUMEX would be a single bus shared by’ all devices (see Figure 2). The Ethernet utilizes a fully distributed control structure in that each device connected to the net can independently decide to send a message to any other device on the net depending on the functions it is actively performing. ‘Of course, decisions about which devices need to communicate with each other at a given time and what the precise message content is are determined by higher level system activities and requests, for example to implement a file transfer, mail forwarding, teletype connection, printer output, etc. As Tong as the net is not in use and only one device at a time is attempting to transmit, no problem occurs. The sending device transmits its packet of information which contains a destination address, packet type designator, and error detection codes. All other devices on the net continuously "listen" to what is being sent and the one assigned the appropriate destination address picks up the packet, acknowledges its receipt, and processes it. If the packet address is garbled by errors or no device with the appropriate address exists, the sender "times-out" and decides how to proceed based on the higher level function being performed. Packets are kept short relative to network bandwidth so that a given device cannot “hog” the net. However, if two or more devices decide to transmit over the shared medium at the same time, a "collision" occurs and a mechanism must exist te detect the collision and to select one of the contending devices to go first. Since this contention arbitration is the fundamental characteristic of the control structure of such nets, they are commonly called "contention" networks. In the Ethernet, a collision is detected by each sending device listening to what is being transmitted on the bus. If a transmission is already in progress, the device waits until the net is quiet for a period before starting to send. When it does transmit, it continues to listen to what is going over the communications line and compares that data with what it is sending. If a disagreement is detected the device assumes that some other device has started to transmit at the same time and aborts its transmission. A time window exists between the start of a transmission and when al] devices can be assumed to know that a transmission is in progress. This interval is given by the speed of the net and the distance between the sending node and its most distant neighbor. If a collision is detected, the net is "jammed" with noise for a period such that all devices know (3) See Metcalfe, R. M. and Boggs, D0. R., “Ethernet: Distributed Packet Switching for Local Computer Networks," Comm. ACM, Vol. 19, No. 7, July 1976. 9 E. A. Feigenbaum Technical Progress Section 2.1.2.1 a collision has occurred and then each sending device waits a random period of time to begin retransmission. This random delay is what sequences devices so that a deadlock of successive collisions is avoided (4). More complex networks can be created with several Ethernets by having one of the nodes on the network be a “gateway” that knows how to communicate with another Ethernet or some other external network. These gateways can translate between packet conventions used in the Ethernet and those used in the ARPANET, TYMNET, TELENET, etc. Xerox has implemented internally an extensive set of Ethernets with interconnections between them and with other external networks. These local networks operate at 5-10 Mbits/sec over distances of about 1 kilometer and perform well in terms of efficient use of the transmission medium and low: latency between deciding to transmit and being able to get access to the medium (5). The Stanford Computer Science Department will be one of three recipients of grants from Xerox that with include Ethernet connection hardware. Since the Computer Science Department systems are integrally connected with a major user group on SUMEX (the Heuristic Programming Project) and since the Ethernet design is ideal for the the integration of new satellite machines with the existing SUMEX facility, we have chosen it as the model for our planned facility changes. The proposed new topological design is shown in Figure 2 and will include creating new interfaces for each host machine, the TYMNET, the local teletype scanner, other peripheral devices, anda gateway to other local networks Ce.g., the Computer Science Department machine and planned terminal clusters). Communications Hardware Development A final area of hardware development concerns communications. We have implemented line disconnect control hardware on local telephone lines similar to what exists logically for our network connections. Previously we were unable to detect when carrier dropped on phone connections, for example when a user hung up without logging out or was accidentally disconnected during a session. This left his job hanging so that the next person dialing up in that line would automatically be connected to the earlier job resulting in possible privacy or security loss. The system now receives a harduare interrupt when a line drops and if the job that was on that line is still active, the job is detached so it can be picked up and continued. Conversely, when a user logs out, we do an automatic disconnect on his phone line so that our incoming rotaries are not congested with unused, hoarded phone connections. (4) A similar type of local network called CHAOSNET has been under development at MIT. It differs from Ethernet in that it uses delay counters to sequence colliding devices. The delay for each sender is determined by counting down at a prespecified rate the arithmetic difference in node address between the last successful transmission and the prospective sender. Thus by selecting node addresses corresponding roughly to the physical position of a node on the net, Proper interleaving can be achieved to arbitrate collisions. (5) See Shoch, J. F. and Hupp, J. A., “Performance of an Ethernet Local Network -- A Preliminary Report," Proceedings of the Local Area Communications Network Symposium, Boston, May 1979. E. A. Feigenbaum 10 Section 2.1.2.1 Technical Progress We are also developing a switch to allow more effective use of the 64 available teletype scanner ports. We typically have about 40-50 jobs on the system during peak loads (mid-afternoon) of which 10 are detached, 10 come from network or pseudo-teletype connections, 10 come from local dialup connections, and 15 come from leased or hard-line connections. With this mix the 64 scanner ports on the system are adequate. However, high speed displays or leased lines require dedicated ports whether or not they are in use and thus the scanner is overloaded with fixed line assignments, many of which are not in simultaneous use. We have looked at the economics of adding another scanner or of making it possible to switch available scanner ports to active lines and the switch is the more cost-effective. A microprocessor-based switch is now being installed and tested that will allow us to selectively connect 32 scanner ports to any of 64 dedicated lines. 11 E. A. Feigenbaum unequestey ‘y *q ~~ to 4800 Bit/Sec Lines 256K Wo AMPEX Memory ARMIO-LX tds DEC Mem MF~10 64K Wo ory DEC Memory MF-10 rds 64K Words DEC Memory MF-10 64K Words DEC Memory MF-10 64K Words L_ DEC Memory Multiplexor MX-10C TYMNET Interface 620-L DEC Central DEC Central | Syst Concepts SA-10 DEC/1TBM Interface | L I ] Calcomp Disk Calcomp Tape Controller Controller 1035 1040A Calcomp Calcomp Calcomp Disk Disk Tape 235-11 235-11 347-A Calcomp Calcomp Calcomp Disk Disk Tape 235-11 235-IT 347-A Total 156M Words Processor Processor KI-10 #0 KI-10 #1 DEC Drum Controller RES-10 ‘\ Dig Dév Dig Dev Fix Disk Fix Disk A-7312D8 A-7312D8 Total 1.7M Words Tine Princer || PEC 84-10 am BN 2410 Controller ave ARPANET 50K Bit ost 513 IMP LA Interface NN Dual DECtape DECtape oe Drives Controller DEC TTY = [—— 64 Lines total TU-56 TD-10 Scanner __———— Local dial-ups DC-10 Cand hardlines Calcomp TTL 1/0 Bus Plotter Extension 565 Figure 1. SUMEX~AIM COMPUTER CONFIGURATION (5/79) ssoiZ0r1g [Teotuysey T°Z°T'Z wozaz09¢ Hl CSD Machines IVP TIr¥ Clusters Quality Printing Local TTY TYMNET 1/0 Stanford Scanner Interface Peripherals Campus (LPT, PLT, ) Gateway w ETHERNET ETUERNET KI TENEX SUMEX 2020 DENDRAL/ SUMEX Syste Pr dd yecem (Proposed) (Proposed) 50K bit/sec ARPANET Lines Link m > 7 1 : . . : a Figure 2. Planned Intermachine Connections via ETHERNET Ou a Lo w& Cc 3 PZ bog uotzposas ssauGoug jeoruysay Technical Progress Section 2.1.2.1 Loading Performance 11- Dual KI-10 (512K) y ——— Single KI-10 (256K) vO 104 ____ 2020 (384K) ; ’ 4 Etapsed Time/K]~10 CPU Min 0 r ft tO v r 0 1 2 3 4 5 6 7 8 Load Average 4 +4 4 Figure 3. OE£C KI-10 Versus 2020 Performance Under Load. For each of the three machine configurations, two graphs are given. The lower graph shows performance for small, CPU-intensive jobs and the upper graph shows performance for large, page-fault-intensive jobs. These curves bound the expected performance for typical user jobs. It is assumed that a KI-10 averages about 1.7 times the speed of a 2029. E. A. Feigenbaum 14 Section 2.1.2.2 Technical Progress 2.1.2.2 System Software Development Our system software work this past year has concentrated on several areas including system changes reflecting hardware development projects, correcting various system bugs, improved community loading controls, and implementing new features for better user community support. Hardware Implementation System work was required to enable the installation of the TELENET equipment (see Section 2.1.2.3) and the local communication line control hardware. We implemented "Xon/Xoff" facilities for the TELENET interface so that all terminals could run at an effective 1200 baud rate with output flow controlled by appropriate network “backpressure” commands when buffers fill for slower terminals. These changes were completed in the fall when the final evaluation of TELENET took place and significantly smoothed network output flow over what had been available before. Servers were also implemented to handle the interrupt and I/0 bus interfaces for the line disconnect control hardware and for the hardline switch interface. The switch interface is still in the process of being debugged. Monitor Bug Fixes and Improvements We found a number of subtle bugs in the system this past year that had been causing periodic problems in hung jobs or crashes. By now, all of the “obvious” bugs have been located and so those remaining are much more elusive, occurring infrequently or only after a long chain of rare events that is difficult to reconstruct. Examples of fixes include problems in DBMP, the program that periodically migrates altered file pages from core to refresh the disk image of these pages. Two bugs existed, one that caused infrequent error logging calls to mishandle the stack and one that overlooked certain pages under the assumption that future core garbage collections would take care of them. This latter bug caused relatively frequent file errors during crashes or when taking the system down because the overlooked pages were never refreshed on disk by core garbage collection since the system halted. We have had a significantly more reliable file system during crashes as a result of this fix. Several bug fixes were made in the ARPANET code having to do with the handling of special control packets when aborting partially created connections and the release of connections after transmission errors had occurred. We also found a bug in the fork manipulation code that caused jobs to hang occasionally when multiple fork manipulations were going on simultaneously. These resulted when two forks were attempting to examine the job fork structure data base, one got interrupted in progress, and the other made some changes that altered information in the tables that the first fork expected to remain as set up When it was interrupted. A number of additional improvements were made to upgrade various monitor routines and JSYS's to conform with TENEX 1.34, to checksum monitor code as loaded to detect 1/0 errors or memory problems, to make the console teletype of the second processor available for use, and to improve operational procedures for taking crash dumps and reloading the system. 15 E. A. Feigenbaum Technical Progress Section 2.1.2.2 System Loading Controls We previously reported on the system load controls we have implemented to allocate available system capacity effectively among projects and users according to Executive Committee guidelines. These include: 12 A "soft" CPU percentage control, assisted by a program which adjusts user percentages for the scheduler based on the dynamic loading of the system. This allocation control structure uses the scheduler's five queue system that ranks processes according to their degree of interactiveness (CPU time between requests for teletype inputs). Processes in the highly interactive queues (text editing, etc.) are scheduled at highest priority without consideration of allocation percentages. If no processes are runnable from these queues, more CPU-bound queues are scanned and processes are selected for running based on how much of their allocated time has been consumed during a given allocation control cycle time (currently 100 seconds). This system is not a reservation system in that it does not guarantee a given user some percentage of the system. It allocates cycles preferentially, trading off a priori allocations with actual demand but does not waste cycles. 2) an overload control mechanism that operates during peak loading periods to limit the number of active processes on the system to those that can be reasonably supported with acceptable response time. This avoids slaving all users to their terminals waiting inefficiently for the machine cycles they need to get useful work done when there are not enough to go around. Each project receives a pro rata share of the active slots the system can accommodate. Rather than allow many users to vie unproductively for each project’s slots (as in a pie-slice system), we ask selected users within each group to restrict their use for periods of 20 minutes so that those remaining can work effectively within the project aliquot. Allocation of active slots is made on the basis of relative community and project percentage allocations (assigned by the AIM Executive committee). Within each project, slots are allocated either on a round-robin basis or taking into account optional project priorities among users. Under overload conditions, active jobs outside of the available slots are asked to slow down, thereby holding the load within tolerable limits. If such jobs do not voluntarily cooperate, they may be forced to comply. This system has been in operation for the past year and has operated quite well. We continued to place no load limiting controls on the national AIM community projects, however, since they have historically consumed been below their allocated quota. Stanford users and staff have adapted their expectations of system response and find it more productive to coordinate their time on the machine with others in their project so as to work on a more lightly loaded system. Indeed, as can be seen from the loading data in Figure 10, the peak load average has been held to an average of 5.5 - 6.0 whereas total CPU time consumption, shown in Figure 8, has continued to rise. E. A. Feigenbaum 16 Section 2.1.2.2 Technical Progress Several problems were noted in the ltoading control system that required improvements in monitor functions this past year: 1) Users frequently wanted to designate a job as low priority or “background" so that it would run only when the system is lightly loaded and "go to sleep" otherwise. 2) Scheduled demonstration jobs were receiving no advantage in performance over other jobs, other than that due to holding the load average down. A scheme was needed to cause demo jobs always to be scheduled preferentially. 3) Forcible control of uncooperative jobs was initially implemented by detaching them or logging them out in extreme cases. This could cause loss of important work and a less destructive yet effective mechanism was needed. 4) A loophole for uncooperative jobs existed that would bypass controls with good probability. If more than one user were asked to slow down at a given time, one of those jobs could refuse to cooperate and continue intensive computing while the others slowed down. Frequently, the load reduction from cooperating jobs was enough to remove the overload condition during common, local bursts of usage. Thus, with the overload gone, the uncooperative user could continue without ever having slowed down. To improve the control system, we implemented two new scheduler control functions. First, a job can be designated to run out of a given queue no matter how much CPU time it wants to consume. This allows demo jobs always to be scheduled out of the highest priority queue assuring a better service level. It also allows background jobs to be scheduled always from the Tow pricrity queues so they only run if nothing else is to be done. Second, a job can be stopped for a specified period of time without ever being scheduled. This function allows uncooperative jobs to be slowed for a large percentage of time (max 97.5% currently) when their load must be reduced forcibly but does not do any other damage to the operation of such jobs that could result in lost work. These new features have substantially improved the effectiveness of the overload control system. The loophole for uncooperative jobs was plugged by noting whether jobs requested to stop make any attempt to cooperate during the assigned grace period. If there is no change in their rate of CPU time consumption, the grace period is shortened so they will be forcibly stopped before more cooperative users stop and remove the overload. Other Enhancements We have made improvements in SUMEX system software in numerous other areas including the EXECutive program, the BSYS system for file archiving and retrieving, the printer spoolers, the CHECKDSK program for verifying file system integrity, system diagnostic programs, a monitor crash analysis program, and many smaller utility extensions and bug fixes. We have updated the EXEC to be compatible with the latest version running at other TENEX sites, incorporating the extensions we have made locally. The BSYS program has been updated to the 17 E. A. Feigenbaum Technical Progress Section 2.1.2.2 latest version available from BBN using their system for file restoration automation. Several bugs in the improved CHECKDSK program for verifying file system integrity have been made and improvements to give users a better idea of file names that might have been lost during a crash. Improved crash and system analysis programs have been developed to assist in sorting through the complex interlinked monitor tables when unraveling a core dump to determine the cause of a crash. These include several display programs to observe the dynamic operation of individual job structures or the ARPANET. These tools have been invaluable in tracking down the difficult bugs that remain in the system. 2.1.2.3 Network Communication Facilities A highly important aspect of the SUMEX system is effective communication with remote users. In addition to the economic arguments for terminal access, networking offers other advantages for shared computing. These include improved inter-user communications, more effective software sharing, uniform user access to multiple machines and special purpose resources, convenient file transfers, more effective backup, and co-processing between remote machines. Until this past year, we have based our remote communication services on tuo networks - TYMNET and ARPANET. These were the only networks existing at the start of the project which allowed foreign host access. A third commercial network system, TELENET, is now competitively operational and offers a growing selection of services. During this report period we established an experimental connection to TELENET to evaluate its technical and economic advantages relative of our existing connections. The results of this experiment are reported below. Users asked to accept a remote computer as if it were next door will use a local telephone call to the computer as a standard of comparison. Current network terminal facilities do not quite accomplish the illusion of a local call. Data loss is not a problem in most network communications - in fact with the more extensive error checking schemes, data integrity is higher than for a long distance phone link. On the other hand, networking relies upon shared community use of telephone lines to procure widespread geographical coverage at substantially reduced cost. However, unless enough total line capacity is provided to meet peak loads, substantial queueing and traffic jams result in the loss of terminal responsiveness. Limited responsiveness for character-oriented TENEX interactions continues to be a problem for network users. TYMNET: TYMNET provides broad geographic coverage for terminal access to SUMEX, spanning the country and also increasingly accessible from foreign countries (see Figure 4 on page 21}. Technical aspects of our connection to TYMNET have remained unchanged this past year and have continued to operate reliably. The total use of TYMNET dropped during the TELENET experimental connection (see Figure 14) but is now increasing again since the TELENET service was dropped. TYMNET has made few technical changes to their network that atfect us other than to broaden geographical coverage. The previous network delay problems are E. A. Feigenbaum 18 Section 2.1.2.3 Technical Progress still apparent although better cross-country trunks into New York and New England are available improving service there. TYMNET is still primarily a terminal network designed te route users to an appropriate host and more general services such as outbound connections originated from a host or interhost connections are only done on an experimental basis. This presumably reflects the lack of current economic justification for these services among the predominantly commercial users of the network. Whereas TYMNET is developing interfaces meeting X.25 protocol standards, the internal workings of the network will likely remain the same, namely, constructing fixed logical circuits for the duration of a connection and multiplexing characters in packets over each link between network nodes from any users sharing that link as part of their logical circuit. We have continued to purchase TYMNET services through the NLM contract with TYMNET, Inc. Because of current tariff provisions, there is no longer an economic advantage to this based on usage volume. SUMEX charges are computed on its usage volume alone and not the aggregate volume with NLM's contribution to achieve a lower rate. A new tariff provision, based on "dedicated port" pricing, is advantageous to us though. This allows purchase of a number of logical network ports at the host for a fixed cost per month, independent of connect time or number of characters transmitted. Based on previous usage data, SUMEX could save approximately $1,000 per month in service charges by taking advantage of this charging scheme. We will continue to work closely with NIH-BRP and NLM to achieve the most cost-effective purchase of these services. ARPANET: We continue our advantageous connection to the Department of Defense's ARPANET, now managed by the Defense Communications Agency (DCA). Current ARPANET geographical and logical maps are shown in Figure 5 and Figure 6 on page 22. Consistent with agreements with ARPA and DCA we are enforcing a policy that restricts the use of ARPANET to users who have affiliations with DoD-supported contractors and system/software interchange with cooperating network sites. We have maintained good working relationships with other sites on the ARPANET for system backup and software interchange. Such day-to-day working interactions with remote facilities would not be possible without the integrated file transfer, communication, and terminal handling capabilities unique to the ARPANET. The ARPANET is also key to maintaining on-going intellectual contacts between SUMEX projects such as the Stanford Heuristic Programming Project authorized to use the net and other active AI research groups in the ARPANET community. TELENET We recognize the importance of effective, economical communication facilities for SUMEX-AIM users and are continuously looking for ways to improve our existing facilities. During the past year, based on the approval of the AIM Executive Committee and the NIH-BRP, we established an experimental connection to the TELENET network to evaluate its performance for support of the SUMEX-AIM community (see Figure 7 on page 24 for an illustration of the current geographic coverage of TELENET). Our connection was via a TP-2200 interface with 12 asynchronous lines to the SUMEX host and one 4800 baud line connecting to the network proper. TELENET has many attractive features in terms of a symmetry analogous to that of the ARPANET for terminal traffic and file transfers and 19 E. A. Feigenbaum Technical Progress Section 2.1.2.3 being a commercial network, it does not have the access restrictions of the ARPANET. Its tariff schedule also affords lower costs than TYMNET for comparable service volume. However, despite system changes we made to optimize TELENET performance (Xon/xXoff facilities to improve traffic flow), users felt a substantial degradation in service when using TELENET as opposed to TYMNET. We insisted that users use TELENET whenever possible between November 1978 and May 1979 to maximize user accommodation so that problems arising from differences in access conventions would not cloud judgements of services. Complaints included poor node reliability, intolerable delays in response, uneven flow of terminal output, and poor operational management of the network in keeping users informed of network and host status. From the system viewpoint at SUMEX, we detected similar problems. We received ineffective system engineering support in trying to tune network parameters to optimize performance for our user community and poor or erroneous feedback about network failures and problem resolution. In practice, TELENET offered no service advantages over TYMNET, since no file transfer connections above 1200 baud are currently allowed, no facilities to control local versus remote echoing exist, and no electronic mail system exists to facilitate communication between network operations staff and host nodes. Also company financial problems portend substantial delays in remedying these problems. Because of grant budget limitations, we were forced to decide between the TYMNET and TELENET connections - only one could be afforded. Based on the distinct user preference expressed for TYMNET, we decided to terminate the TELENET connection as of May 1, 1979. We will continue to monitor TELENET developments (and those of other potential national network servers, e.g., ATET, IBM, and Xerox) and may recommend a reevaluation of an alternative source for network services in the future. E. A. Feigenbaum 20 T? mnequesteg “y *g ‘ 4 : ‘ 3 .- ? : . TYMNET DATA COMMUNICATIONS NETWORK TYMNET INC . “pees buat Coma anon done Cts ~ “SO Nae apn pyae NORA Pee imal ~~ — —. bf j._} MEXICO Cie ee Ee) Oe Mita “7 ol } Vinsanom @ tA Ay le i al 08 ate week mm mee oe oe pre = Ma a. ae eck L a tate ed _. . a ‘ ¢ s s , e e ‘e 46 nu it , “ow “ cy os “ w €E°Z°T°Z wopqoes sseazoig [TeoTuyoe] Figure 4. TYMNET Network Map Figure 5. ARPANET GEOGRAPHIC MAP, MARCH 1979 unequesteg «y °y MIT 44 Oo _ MOFFETT rate ACCA / 0 Lo z ORCCS AMES15-/AMESIG (, NYU ARCC49 + SRI2 coradcomo i SRISTO” “OXEROX nuToeRSQ > Ce "epee: PO TANFORD 4 7,YMSHARE \ LN SEN0 0 CMU SAS HARVARD \ semen / bh ‘oq HAWAII f Vonncowy/ h pangoy/ aS RoEen a { 4 NORSAR alee, IST220 LJ PENTAGON LJ LONDON OD TEXAS ww = SATELLITE CIRCUIT © IMP O TIP 4& PLURIBUS IMP (NOTE: THIS MAP DOES NOT SHOW ARPA'S EXPERIMENTAL SATELLITE CONNECTIONS ) NAMES SHOWN ARE IMP NAMES, NOT (NECESSARILY) HOST NAMES sseigo0oig TeoOTUYDaT €'Z'T*Z worasag Figure 6. ARPANET LOGICAL MAP, MARCH 1979 370/195 475 coOc7600 DATA COMPUTER PDP -10 CCA POP-10 coc6400 LBL DEC -2050T MOFFETT MIT6 - RCcCc5 POP-11 POP -11 POP -10 DEC -1070 POP-11 POP-1) DEC-10 DEC -1090T RCC49 €°Z°T'Z vores HAWAIT POP-10 POP -1 AMESIS5 POP -10 AMESI6 POP - SPS-41 PoP -11 POP -it DEC -1090T “i SPS-41 -AP120B BBN 30 BBNE3 MiT44 PDP-11 cOC6600 PoP-H LINCOLN BBN40 ee mnequasteg "y °F fNOvA-600 ] YMSHARE ILLIAC -IX DEC POP-1i n POP-1Its POP-11 370/168 STANFORD PDP-11 DEC-1077 SUMEX -180oT POP -10 POP-II SPS-41 NPs ARVARD POP-10 CMU RAODC PDP- COC6600 PpP-1) CDC6600 €DC6600 COCc7600 CORADCOM NYU -1080 POP-11 Cc PDP-11 POP-11 SAT- IMP POP-11 OEC-1050 POP-11 NOSC UCLA PoP~0 VDA PDP-10 POP- 1 UNIVAC 1110 POP -1) UNIVAC - 1108 PoP -11 EGLIN PENTAGON POP-11 UNIVAC 1110 ACCAT ABERDEEN POP-}1 RAND [COC 6600 TIZASC POP POP-1 coc 64 L coc 6500 PLI POP-II - 2040 DARCOM PoP-11 8-5500 POP- {PoP- 1 POP -1) “uv XGP PoOP-11 coc 6700 PDP-15 GUNTER LONDON NORD-10 NOR MITRE PDP-11 SDAC ecR POP] PDP-11 O we PLEASE NOTE THAT WHILE THIS MAP SHOWS THE HOST POPUL ATION OF THE NETWORK ACCORUING TO THE BEST INFORMATION O te OBTAINABLE, NO CLAIM CAN BE MADE FOR ITS ACCURACY HOST COMPUTER CONFIGURATION SUPPLIED BY THE NETWORK 4 PLURIBUS IMP INFORMATION CENTER AAA SATELLITE CIRCUIT NAMES SHOWN ARE IMP NAMES, NOT (NECESSARILY) HOST NAMES PX. VERY DISTANT HOST Ssoigoig TeoTuyoe] Technical Progress Section 2.1.2.3 Figure THE TELENET NETWORK It E. A. Feigenbaum 24 Section 2.1.2.4 Technical Progress 2.1.2.4 System Reliability and Backup System reliability has been very good on average with several periods of particular hardware or software problems. The table below shows monthly system reloads and downtime for the past year. It should be noted that the number of system reloads is greater than the actual number of system crashes since two or more reloads may have to be done within minutes of each other after a crash to repair file damage or to diagnose the cause of failure. 1978 1979 MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR’ APR RELOADS Hardware 6 8 5 6 8 10 1 4 2 2 6 4 Software 0 0 4 5 9 9 5 3 9 4 7 10 Environmental 3 0 1 0 1 0 0 0 1 0 1 0 Unknown Cause 7 4 1 4 5 5 1 1 0 0 1 1 Totals 16 12 11 15 23 24 7 8 12 6 15 15 DOWNTIME (CHrs) Unscheduled 36 22 33 37 28 37 3 14 8 16 17 14 Scheduled 38 34 22 25 20 31 30 20 22 17 33 16 Totals CHrs) 74 56 55 62 48 68 33 34 30 33 50 30 TABLE 1. System Reliability by Month During the year, we encountered several hardware problems that caused temporary increases in the number of crashes. These were very intermittent problems that were difficult to isolate and account for the increased number of reloads during September and October 1978 and again in March and April of 1979. Several problems resulted from oxidation of electrical contacts and we might expect an increase in such age-related failures as the system gets older. Probably the most serious hardware failure was a head crash on one of the suapping disks. A rubber diaphragm burst forcing one set of heads to contact a platter. The debris from that crash then spread to the other surfaces and caused those heads to crash. We expect repairs to be complete by early July. This may forecast other problems caused by aging of rubber parts in the swapping disks and we Will take steps to replace these if need be before another failure results. We have had an on-going effort to increase software reliability and have fixed a number of bugs that have been perennial causes of crashes or file loss at system shut-down. Some of these fixes have required setting system stops to get appropriate dumps to analyze the problem causes and thereby also temporarily increased the number of crashes. 25 E. A. Feigenbaum