151 An additional factor under very heavy loads at present is the overflow of paging traffic to the moving head disks which occurs occasionally as we do not have the more intelligent fixed head storage manager operating yet. When swapping moves to the disks, two things happen. First, the time to move a page in or out increases by a factor of 4 or 5 and second, the disk channel is tied up more often causing contention between swapping and other disk input/output activities. This may cause particularly long load times for very large programs. This latter problem should be remedied by software changes to allocate fixed head space to active pages and to move dormant ones to moving head storage. From a system design point of view, the remaining response problems are affected by two system resources; memory size and CPU power, Adding more memory would allow more of the runnable jobs to be resident at a given time, thereby giving more continuous attention to more jobs. Increased memory would move the break point in the response curve caused by swapping toward a higher load average. Already though (based on the above comments), at a load average of 2-3 the response is degrading. Since we are running quite efficiently now (15-20% overhead), allowing more jobs to be in. core to compete for CPU power would not effectively solve the response problem. It would have the beneficial effect of "linearizing" the degradation at loads above 4, A second approach would be to increase the inherent processor speed so that jobs would get finished sooner and leave the runnable state more quickly because they are actually completed rather than having their aliquot timed out. This would reduce the load average with the same number of users because jobs would be waiting for user input more of the time. As noted in the report, the experience in up- grading the IMSSS KA-10 to a KI-10 dropped the load average from around 15 to about 5-10. Added CPU power would not change the location of the swapping break on the response vs load average curve, it reduces the load average on the existing curve. It also has the effect of reducing the number of swapping cycles a job has to go through to complete execution when swapping does set in. In the long term both augmentations would likely be desirable. There is an optimum balance between CPU power and system memory size. That balance is reached for interactive jobs when the memory holds enough runnable jobs that when the processor is distributed over all of these jobs, just acceptable user response is achieved. Additional jobs would cause swapping which incurs a more steeply rising response penalty per added job than if more memory were available, but the response would become unacceptable in either case. On the basis of current data and since only one of these resources could be upgraded within the present budget, we feel the CPU augmentation is the better choice. It would alleviate the present bottleneck because it improves the user-perceived response time by improving the actual running time of his job. In addition at lower load averages, it would allow more complex programs to become interactive because they would run faster. 152 For these reasons as the SUMEX-AIM system becomes progressively more heavily loaded with the addition of new users and collaborators, we feel the CPU will be the next most critical resource of the facility. We are currently working on a preliminary plan (see page 51) proposing to allocate available funds within the council-approved award levels to up-grade the present KI-~10 CPU. We expect this plan to be refined in the next few months at which time we will submit it for AIM Executive Committee and NIH review. 153 APPENDIX D PDP-11 SAIL Design Summary SAILEX --- SAIL EXperimental compiler The Portability Problem The number and availability of computers has expanded greatly in recent years, Accompanying this expansion has been a proliferation of programming languages and software systems. Usually, each computer has its own assembly language, and supports one or more high-level languages. The software is written in assembly language, and hence can execute only on the host computer. The many common functions performed by systems software are often obscured by implementations peculiar to a given computer. Assemblers, compilers, text editors, and linkers require separate manuals for each computer. Languages such as FORTRAN, ALGOL and BASIC have restrictions, extensions, and usually many undocumented idiosyncrasies. Operating systems have widely differing capabilities, though the underlying computers may not be so diverse, Programs often have dependencies on the available environment, or employ peculiarities of a particular computer model. Such computer-specific programming limits a program’s portability, i.e. its ability to execute on more than one computer. Some programs are unlikely candidates for portability, e.g. a program which communicates with a one-of-a-kind device. Many programs, however, could be written so that only minimal changes, if any, would be needed to allow execution on another computer. Since assembly languages are designed for specific computers, portable programs must be written in a higher level language, i.e. a language which does not concern itself with the physical makeup of the computer. Machine- independent languages focus attention on the problem to be solved, rather than the implementation of the solution. The M x N problem involves M languages which are to be executable on any of N computers. Each language could have a compiler for each machine, resulting in M * N compilers. Alternately, there could be M compilers which trunslate the languages into a single intermediate code, and N compilers which translate the intermediate code into target code, for a total of M + N compilers. Further reduction in the number of compilers depends on similarities in the languages or target computers. In what follows we shall be concerned with the compilation of a single language for many computers, a reduction of the M x N problem to the 1 x N problem. A distinction snould be made between "language portability" and "program portability." By language portability we mean a language which can be compiled into code for many computers. Program portability is the ability of a particular program to be executed on many computers, 154 Language Portability A number of new developments in computing will have a great impact on language portability. The growing literature on programming techniques is making programmers more aware of general techniques, and more willing to build on the work of others. Computer networks will lead to large-scale sharing of programs, and the desire to execute programs on one computer which were developed on another. Experience on the various computers available over a network should lead to increased awareness of machine-dependent aspects of a program, and an effort to write programs usable by other computers on the network. Memory capacity, available peripherals, even instruction sets sometimes vary little from computer to computer. Computer families with differing options between models are becoming commonplace. A Single language could be compiled into code tailor-made for a particular model, so that programs written in the language could be independent of the target machine. Internal organizations are often comparable, as evidenced by the common general register --- base/displacement architecture. Increases in speed and memory help alleviate inefficiencies caused by not using a language specifically written for a computer. Thus the feasibility of and motivation for language portability seem well established. There is a need for easy compilation of programs for different computers. Each compilation should fit the program to the target machine, making as much use as possible of the target instruction set. The compilation should not spend a great deal of time determining the characteristics of the target machine. Decisions concerning code-generation strategy should be made once, then incorporated into the compiler. The development of compilers for new machines should be an integral part of the compiler system, with as much work as possible done automatically. A simple yet flexible manner of specifying target machines should be available. The compiler itself should be written in the language being implemented, so that it can execute on computers for which it can generate code. Program Portability Program portability is more difficult to obtain than language portability. A program written in a portable language can be compiled into code for many computers, but execution of the code may produce various results on the different computers. For example, such a program could have dependencies on word size, numeric representation, character representation, memory size, addressability, or the file system. Mathematical routines implemented on computers with differing numeric precision are diffieult to port without rather elaborate precautions. Programs which must interact with the operating system, or otherwise "liberate" themselves from the programming language, must be given special attention to minimize such machine-dependence, 155 Portability should be considered in the initial design of a program. A little inefficiency may be acceptable to make a program portable. The resulting implementation is often more efficient, more flexible, and more easily understood and maintained. When machine- dependence is felt necessary, it should be isolated and well documented as such. Many programs which may seem to be machine- dependent can be made portable by judicious design. For example, text editors often have many dependencies on the operating system and file system. Yet the operating and file systems, and the editors, may have very similar capabilities. Thus the dependencies are not inherent to text editing, and an editor capable of running on all computers could be designed with little loss in efficiency. SAILEX --- A Machine-independent Compiling System A machine-independent compiling system is being developed for a subset of the language SAIL, the Stanford Artificial Intelligence Language. SAIL is described in MEMO AIM-204 of the Stanford Artificial Intelligence Laboratory, Stanford University. Changes made to SAIL for use on the TENEX operating system are described in TENEX SAIL, Technical Report No. 248 of IMSSS (Institute for Mathematical Studies in the Social Sciences), Stanford University. SAIL was originally designed for implementation on a PDP-10 computer, but much of the language is machine-independent. Some parts considered too machine-dependent have been eliminated from SAILEX, and some features have been added to enhance the power of SAIL (e.g. double precision). SAIL is certainly not an ideal language for portable programming, but has some characteristics which make it a suitable experimental language for this purpose. A compiler already exists for SAIL, so that the SAILEX system (which is written in SAIL) can be compiled and executed without hand translation to some other implemented language. The language is powerful enough to give a good test of the feasibility of compiling for many computers. A rather large number of users already exist, and there is great interest in using SAIL on machines other that the PDP-10. A means of specifying the semantics of a target machine has been designed to meet the previously discussed criteria. This specification includes: 1, External procedures to be available without declaration. For example, double precision numerical routines may be available on some machines. 2. Registers, and register classes. Examples of register classes are integer registers, floating point registers, and index registers, 156 3. Information needed for storage allocation. For example, the number of addresses per integer. 4, Switches governing operation of the resulting compiler. For example, can the symbol table be kept in memory; is readable intermediate code desired. 5. Code generators. A language has been designed for specifying code generation (the compiler outputs assembly language, not binary code). This language gives the code generators the appearance of the target code being specified. The full power of SAIL can be used to write the code generators, but due to the built-in capabilities of the generator language, such power will probably never be necessary. The compiler makes almost no assumptions about the target machine. Extensive procedures for manipulating registers and their contents are available, but are used only if invoked by a generator. Much of the work of code generation has been found to be machine- independent; only the specifics of a target machine need be determined. The generators are responsible for requesting loading of registers for instructions which require register addressing, most local optimization, choice of instruction sequence, addressing format, and allocation of storage. The compiler does all the bookkeeping tasks such as remembering what is in the registers; loading, storing, clearing and marking registers as requested; searching for the "optimal" free register; remembering what operands have been used, for later allocation; parameter passing schemes; symbol table maintenance; and file manipulation. The semantic specification of a target machine is used to create a compiler specifically for that machine. Extensive conditional compilation insures that those parts of the compiler not necessary for a particular machine will not be present. Compilers have been created for the PDP-11/45, PDP-11/40, PDP-10, IBM-SYSTEM/360, VARIAN-6201, and NOVA. More computers are being considered, e.g. CDC family, INTERDATA, BURROUGHS family, DATAPOINT, and SIGMA, The system will continually be generalized as these machines are examined. Specification of a new machine without radical departures from all previously specified should be straightforward. The PDP-11/40 was specified in two hours, but the specification of the PDP-11/45, which had already been specified, was used as a template. The VARIAN machine took two days, but this included learning the instruction set (the code has not been carefully checked). In general, a poor implementation can be produced very quickly, a compiler can be generated, and the resulting code checked. This indicates how the code ean be improved, and the process starts again. A compiler can be completely created from the semantic specification in a matter of minutes (depending, of course, on the speed of the computer being used). A manual describing how to specify a target machine, with extensive examples from those already described, will be produced. Familiarity with SAIL and the specification manual should be 157 sufficient to produce a compiler for a new target machine with little trouble. The code produced is not globally optimized, but otherwise rather good. For example, SAILEX appears to produce 20-30% less code for the PDP-10 than the currently available SAIL compiler. Comparison with the code generated by the FORTRAN compiler on the PDP-11 and an "equivalent" SAIL program indicates a reduction in size of about 30%. Such measurements are not precise, and are given only to indicate that SAILEX does not output inefficient code, Care has been taken to insure that the SAIL compiler is portable. By compiling the compiler, a SAIL compiler can be created which runs on the target machine (a runtime system must be written for the target machine). This has been done for the PDP-11/45, s0 that the SAIL compiler is available on a PDP-11/45 (also on a PDP-10), (The SAILEX compiler, together with the runtime system and symbol table space, takes about 20K words on the PDP-11/45 (the runtime system is a library). About 2K more words are needed for string space.) Future areas to be considered are: 1, Final testing of the PDP-11/45 runtime system, and preparation for its export to other sites. 2. Specification of more machines, and resulting generalization of the code generation scheme. 3. Removal of any "hidden" machine dependencies in the compiler system. 4, Implementation of more of SAIL (e.g. full macro facility, records and references). 5. Machine independent global code optimization. The compiler has been designed to facilitate code optimization within a variable size window about the intermediate code. This is not yet done. 6. Design of a machine independent runtime system. The PDP=11/45 runtime system is written in PDP-11/45 assembly language for execution under DOS, version 9. Much of the design could in fact be abstracted from this setting. This abstraction might be viewed as a blueprint from which runtime systems for other machines could be developed, or it might be possible to actually write most of the system in SAIL, and directly export it. 158 APPENDIX E Subsystems and Documentation Directories Nancy Smith December 1974 (updated April 1975) The sources of available documentation for these programs will be abbreviated as follows: TUG Tenex User’s Guide (1975 edition) DUH DEC Users Handbook DAL DEC Assembly Language Handbook DML DEC Mathematical Languages Handbook HC a hard-copy manual for the language OL on-line documentation which can be found by @DIR programname.* . The following extensions are used on the directory: «MANUAL complete usually fairly long manual -HELP or .HLP shorter summary, list of commands, etc, . SUPPLEMENT on-line supplement to hard-copy doc » UPDATE list of updates by date » SAMPLE sample program or output See A-LIST-OF-ALL~AVAILABLE=-DOCUMENTS, INFO for complete details on these documents including where and how to order them. Many of the major programs also have a programname.BBD file where messages about new developments, bugs, hints for using the program ete. are sent. These files can be read by any of the mail reading programs (READMAIL, RD, or BANANARD), New programs or new versions of old programs will be put on for a trial period. The file NEW-SYSTEMS.INFO which is a message file will have a message about each program available, The doe for these new programs will also be kept on the directory. These new programs will not be included in the list of programs given here, The EXEC command @HELP prints a file with general help information. SUBSYS DESCRIPTION DOC SF SS A eh Se SS Sa cm OO mT Dr eS OD cn we ee eee we oe ee ens em 2S IDES makes files for multi-colums and/or 2-sided listing OL ACCESS ADDMSG AID AIFAIL BAIL BACKUP BANANARD BASIC BCPL BINCOM BLIS10 BLIS1 1 BLISS BUDGET BYE CALENDAR CAM CCL COP YM CREF CRSREF DCHANGE DCHECK DDT DED DELOLD DELVER DIABLO DIRNUM DO DONE DROP DTACOP DUMPER EE ES EXTR F40 FAIL FED FILCHK FILCOM FILDMP FILES FORTRA FREQ FRKCOM FTP FUDGE2 GETDMP GRIPE HELP 159 gives a list of subsys’s currently available to GUESTs appends a msg to a specified file algebraic interpretive dialog conversational lang. HC assembly lang. - early version of FAIL from SU-AI OL,HC preliminary version of SAIL debugger (on ) OL short term file loss protection OL msg reading program (many extra features) OL conversational programming lang. (DEC version) OL,DML,TUG compiler writing and systems programming lang. HC binary comparison of files (now replaced by FILCOM) DAL compiler for system implementation (DEC version) OL,HC,TUG BLISS for the PDP11 compiler for system implementation (TENEXized) OL,HC(DEC) budget management program (especially proposals) OL @BYE same as @BREAK (LINKS) calendar management and reminder system OL,TUG the compare and merge program of SOUP see SOUP.MANUAL concise command language OL , DUH reading/writing DECtapes OL, TUG cross-reference assembly listing OL,DAL TENEX cross-referencing program (outfile_infile(s)) character set conversion for "foreign" tapes OL reads blocks of file into core & calls DDT to examine OL debugger (single-stepping added at IMSSS) OL, TUG, DAL text-editor (designed for TENEX) OL deletes files by cutoff date of last access OL deletes excess versions of files TUG prints final copy of PUB-produced documents on DIABLO OL translates directory name to number for DEC programs OL creates or appends a line to a reminder file OL deletes a line from a reminder file OL Similar to DELVER, deletes oldest and 2nd newest on #.# DECtape to DECtape copy reads/writes magnetic tapes @EE runs program on your directory as ephemeral @ES runs program on as ephemeral "EXTRactor"™ processes MACRO/FAIL source files to produce .FAI listing of labels defined FORTRAN IV (see also FORTRAN.HELP and OL ,TUG, DML LISP-FORTRAN-INTERFACE.HELP ) assembly language (BBN version of FAIL) OL ,HC e also JSYS manual & JSYS.INFO) the final edit program of SOUP see SOUP, MANUAL checks SAIL programs for loader incompatibilities OL complete file comparison package OL,DAL,TUG dumps files in variety of formats OL multiple to multiple copies, renames, protections FORTRAN10(version 1A) (see also FORTRAN. HELP) ranks words in text file according to frequency compares an address space with address space of file TUG ARPANET file transfers updates/manipulates files containing rel programs loads into core .dmp file from SU-AI (SAV only to 677777) type filename to * prompt sends comments or complaints about system to staff TUG prints out short general help file for SUMEX or help for OL ,HC TUG DAL ,TUG HOSTAT IFAIL ILISP IMSSS LD LINK10 LINK11 LINKSTAT LISP LOADER LOADGT MACRO MAILSTAT MANTIS MLAB MULTI MY~- ACCOUNTS NETSTAT NON PCSAMP PIP PIP11 PNTMAK POET PPL PROFIL PUB RD READMAIL RECORD REDUCE RUNFIL RUNOFF SAIL SCAN SEARCH SEGSAV SITBOL SNDMSG SNOBOL SORT SOS SPELL SRCCOM SWITCH SYSIN TABLE TALK TAPCNV TBASIC TCTALK 160 programs prints network site status information TUG assembly language (IMSSS version of FAIL) OL,HC UC Irvine LISP (extension of LISP 1.6) OL direct link to IMSSS prints SYSTAT-like info including msg if MAIL WAITING DEC loader OL,DAL,TUG linker for PDP11 DOS operating system prints status of IMSSS link INTERLISP-see also LISP-FORTRAN~ INTERFACE, INFO OL , HC (from IMSSS)-see LINK10-LOADER-DIFFERENCES, HELP TUG GT40 standard format loader assembly lang.-see JSYS manual & JSYS. INFO TUG, DAL info on queued mail TUG Fortran debugger mathematical modeling and graphics package OL multiple-fork supervisor~-switches between forks prints user’s valid accountnames prints info on ARPANET status TUG zero-compresses file, options to remove linenumbers, pagemarks, etc. measures the operation of other user programs TUG DEC utilities program OL,DUH transfers PDP11 DOS DECtapes to/from TENEX files OL converts underlines to suitable format for LPT: OL text editor designed for TENEX use OL an interactive extensible programming lang. TUG gives freq of execution of SAIL statements OL , HC document preparation lang. OL mail reading program (BANANARD is better) TUG mail reading program " 7 TUG for pseudo-ttys, typescript of job, detaching OL from running job symbolic algebraic language OL uses file instead of tty for input commands TUG document~-preparation language (DEC not BBN version) OL ALGOL~like lang.-see also LEAP. MANUAL OL ,HC scans multi-directories for a variety of file info OL searches multi-text files for English words or SAIL OL identifiers, can be used with TV editor reads .shr & .low files to produce TENEX .sav OL compiler version of SNOBOL OL message sender OL,TUG string-processing programming lang. OL,HC stand alone COBOL column-oriented text file sorter OL ,TUG text editor OL spelling checker/corrector for text files (not TENEX) OL compares text files TUG Switches the format of a reminder file OL executes LISP SYSOUT’s OL creates conversion tables for DCHANGE used with LINK command to eliminate need for ;‘s reads card image file processed by MTACPY TUG TENEXized version of DARTMOUTH BASIC OL teleconferencing over ARPANET OL TECO TMERGE TODAY TRITAP TTYTRB TTYTST TV TVFIX TYMS TAT TYPBIN TYPREL WATCH WATCH. IMS WHAT WHO WHOIS XED XT Z 161 text editor (see TENEX TECO manual) merges specified text pages from files into new file lists the contents of today’s reminder file processes magtapes from XEROX, IMSSS, BBN used to report terminal line problems prints test patterns for diagnosing terminal text editor for TEC and DATAMEDIA displays restores bad TV files (see TV.MANUAL) (for TYMNET lines only) gives measure of current efficiency of TYMNET transmission does an octal dump of a packed file analyzes contents of .REL files continuous on-line monitoring of system activity TUG IMSSS version of WATCH lists the contents of a reminder file prints SYSTAT-like information OL, TUG OL OL OL TUG TUG OL TUG TUG OL looks up username & prints name/address info on user OL text~editor (used with BANANARD) reformats and prints text file logs jobs off from inferior forks OL OL 162 DIRECTORY LISTING - The following is a listing of the directory which contains most of the on-line formal documentation about the system and subsystem, 18=@MAY-75 16:06:22 2SIDES. HELP; 1 A-GENERAL.HELP;11 A-GUIDE-TO-TENEX-USER “S-GUIDE. INFO; 1 A-LIST-OF-AVAILABLE-DOCUMENTATION. INFO;7 A~SURVEY-OF -THE-DEC-HANDBOOKS. INFO;9 ACCOUNT~NAME-USAGE. INFO; 1 ALL~SUBSYS “S~AVAILABLE-AT-SUMEX. INFO ;4 BACKUP .HELP; 1 BAIL.HELP;2 BANANARD.MANUAL:2 UPDATE; 3 BASIC. HLP;1 BBN-PROGRAM-VERSION-NUMBER~STANDARDS. INFO; 1 BLIS10.HLP;2 BLISS.HELP; 1 BSYS.MANUAL;2 BUDGET. SAMPLE; 2, 1 MANUAL $5 ,4, 3 CALENDAR.MANUAL; 1 CCL.HELP;4 - UPDATE; 1 CHECKDSK.HELP 32 CHESS .HELP; 1 CLEAN.HELP;1 COPYM.HELP;2 CREF.HLP;1 - UPDATE; 1 DCHANG .HLP;1 DCHANGE. MANUAL: 1 DCHECK .HELP; 1 DDT.SUPPLEMENT; 1 DEC-HANDBOOK-GLOSSARY-UPDATE. INFO; 1 DEC/TENEX-COMMAND-EQUIVALENTS. INFO; 1 DED.MANUAL; 1 DELOLD .HELP;1 DESCR IPTION-OF-SUMEX-AIM=PROJECTS . INFO;3 DIABLO. HELP; 3 DIRNUM.HELP;1 DO .HELP;4 DUMP .INFO; 1 EDIT. INFO; 1 EDITOR~PROGRAM-INTERFACE. INFO; 1 FAIL. MANUAL; 1 »HELP;4 163 FILCHK.HELP;1 FILCOM.HLP;4 FILDMP .HELP;2 FILEX.UPDATE; 1 »DOC;1 FORDDT.DOC; 1 »-HLP; 1 FORTRA.HLP; 1 FORTRAN.HELP3;2 HOW-TO-UPDATE-DOC. INFO;3 IDDT.HELP; 1 ILISP.MANUAL; 1 INTERROGATE.HELP;4 INTRO-TO-SUMEX-AIM-TENEX. INFO 34 LEAP .MANUAL; 1 LINK10.HLP; 1 -DOC;1 LINK10-LOADER-DIFFERENCES.HELP;1 LISP.HELP; 3 .UPDATE; 1 LISP=FORTRAN~+INTERFACE.HELP;2 LIST. HELP; 3 MACRO.HLP; 1 -DOC; 1 MINI-DUMP.LISTING;1 MLAB. HLP;2 NAIVE. PUB; 1 NSOS .SUPPLEMENT; 1 . MANUAL; 1 - INTRO; 1 OVERVIEW-OF -COMPUTER=SYSTEM. INFO; 1 PIP.HLP3;1 - SUPPLEMENT; 1 . UPDATE; 1 PIP11.HELP;1 PNTMAK .HELP;1 POET.HELP;1 . MANUAL; 1 PROFIL.UPDATE32 PROJECTS~AND-ASSOCIATED-USERS. INFO;8 PUB.MANUAL;2 -HELP 34 -UPDATE;9 PUB-MANUAL. PUB; 10 RECORD,MANUAL; 1 REDUCE .MANUAL; 1 RUNOFF .MANUAL;2 SAIL.HELP;1 . SUPPLEMENT; 2 -UPDATE; 3 . TENEX-SUPPLEMENT; 1 - BEGIN-MANUAL; 1 SAMPLE .PUB; 1 SCAN.HELP;1 SEARCH. MANUAL;3 . LNFO;3 164 SEGSAV.HELP; 1 SETTING-UP-NEW-USER-DIRECTORIES. INFO; 1 SITBOL HELP; 1 SNDMSG.HELP; 6 SNOBOL. MANUAL; 1 SORT.HLP;1 SOS .UPDATE;6 -HELP;4 «MANUAL; 1 SOUP .HLP;1 » MANUAL; 1 SPELL. MANUAL ;9 SUMEX-JSYS “S. INFO; 1 SYSIN.HELP;1 SYSTEM-SCHEDULE. INFO; 3 TBASIC. HELP; 2 .SAMPLE; 1 . MANUAL; 4 TCTALK.DOC; 1 TECO.SAMPLE; 1 - COMMANDS; 1 «HELP; 4 -TEXT131 »TEXT2;1 -OUMMARY; 1 TELNET. INFO; 1 TENEX-EXEC-MANUAL-UPDATE. INFO;5 TERMINAL-LINKING. INFO; 1 TMERGE.HELP;3 TRITAP.HELP;1 TV .UPDATE;8 « MANUAL ;6 TV-STRINGS.PMAP; 1 TYMNET-INSTRUCTIONS. INFO; 1 USER~NAME~ADDRESS-PHONE. INFO; 29 WHOIS.HELP;1 XED.HLP3;1 «MANUAL; 1 XT .HELP;1 142 FILES, 1332 PAGES 165 DIRECTORY LISTING - The following is a listing of the directory which is a repository of informal or transient information about the system, subsystems, current events, and items of interest. 18-MAY-75 16:06:42 12-MAR-75.SYSLTR; 1 ARPANET.BBD; 1 ASCII. BBD; 1 BASIC. BED; 1 BULLETINS.BBD; 1 CALENDAR. BBD;4,3,1 COMPATIBILITY. BBD; 1 CONSTANTS (PHYSICAL-OR-CHEMICAL ). BBD; 1 DIURNAL-LOADING~WEEKDAYS. MAR; 1 3/31/7531 2/17/7531 2/24/7531 3/3/75; 1 - 3/10/75; 1 3/17/7531 .3/24/7531 .APR;1 . FEB; 1 DO .BBD;1 EDIT.BBD; 1 EMPLOYMENT-WANTED. BBD; 1 FILES.BBD; 1 FORTRAN. BBD; 1 GOOD-LISP-USAGE.BBD;2 GOOD-SYSTEM-USAGE .BBD; 1 GUEST-LIST.BBD;2 IN-WATS.BBD; 1 KWIC .BBD; 1 LIBRARY~SAIL. BBD; 1 LINK10. BBD; 4 LINKING. BBD; 1 LISP. BBD; 2 LIST.BBD; 74 LOGIN-CMD. BBD; 1 LOGIN-MESSAGES.BBD;2 MACRO. BED; 1 MEETINGS.BBD; 1 MLAB. BBD; 1 NEW-EXEC. INFO; 1 OLD-SYSTEM-MESSAGES. BBD; 1 °DP11-GT40. BBD; 1 SOS ITIONS-AVAILABLE.BBD; 1 PROTECTION. BBD; 1 nECORD. BBD; 1 166 SAIL. BBD; 1 . s1 SEARCH.BBD; 1 SNDMSG-READMAIL, BBD; 1 SORT.BBD; 1 SOS .BBD; 1 SPELL. BBD; 1 SYSTEM-MESSAGES, BBD; 1 TECC.BBD; 1 TENEX. BBD; 1 TESTIMONIALS .BBD; 1 TV .BBD;1 TV-STRINGS.PMAP; 1 TYMNET.BBD; 1 60 FILES, 149 PAGES 167 APPENDIX F Networking and Collaborative Research ~ DENDRAL Project The following is a preprint of a paper to be presented at the 170th meeting of the American Chemical Society in Chicago during August 1975 ~- the symposium title is "Computer Networking in Chemistry". This paper will appear in the proceedings and reflects well the orientation and activities of the SUMEX-AIM resource and its collaborating projects (DENDRAL in this case). NETWORKING AND A COLLABORATIVE RESEARCH COMMUNITY: A CASE STUDY USING THE DENDRAL PROGRAMS. Raymond E. Carhart*, Suzanne M. Johnson, Dennis H. Smith, Bruce G. Buchanan, R. Geoffrey Dromey, and Joshua Lederberg. Departments of *Computer Science, Genetics, and Chemistry, Stanford University, Stanford, California, 94305, Computer Science is one of the newest, but also one of the least "cumulative" of the sciences. Gordon (1) has recently pointed out the upsetting disparity between the number of potentially sharable programs in existence and the number which are easily accessible to a given researcher, Although some mechanisms exist for the systematic exchange of program resources, for example the World List of Crystallographic Computer Programs (2), a great deal of programming effort is duplicated among different research groups with common interests. The reasons for this are understandable: these groups are separated by geography, by incompatibilities in computer facilities and by a lack of a means to keep abreast of a a rapidly changing field. The emergence of more economical technologies for data communications provides, in principle, a method for lowering these geographical and operational barriers; for creating, through computer networking, remote sites at which functionally specialized capabilities are concentrated. The SUMEX-AIM (Stanford University Medical EXperimental computer - Artificial Intelligence in Medicine) project is an experiment in reducing this principle to practice, in the specific area of artificial intelligence research applied to health sciences, : The SUMEX-AIM computer facility (3) is a National Shared Computing Resource being developed and operated by Stanford University, in partnership with and with financial support from the Biotechnology Resources Branch of the the Division of Research 168 Resources, National Institutes of Health. It is national in seope in that a major portion of its computing capacity is being made available to authorized research groups throughout the country by means of communications networks. Aside from demonstrating, on Managerial, administrative and technical levels, that such a national computing resource is a viable concept, the primary objective of SUMEX-AIM is the building of a collaborative research community. The aim is to encourage individual participants not only to investigate applications of artificial intelligence in health science, but also to share their programs and discuss their ideas with other researchers, This places a responsibility upon SUMEX-AIM to develop effective means of communication among community members and among the programs they write. It also places responsibility upon those members to design and document programs that readily can be used and understood by others, Another aspect of the SUMEX facility is providing service to individuals whose interest is in using, rather than developing, the available computer programs. Although this is not a primary consideration, it is an essential part of the growth of these Programs. Most of the SUMEX-AIM projects have formed, or are forming, their own user communities which provide valuable "real world" experience. Figure 1 depicts the typical interaction of such a project with its user community, and with other projects. The participation by users in program development is not just restricted to suggestions, but can also include software created by computer- oriented users to satisfy Special needs. In some projects, methods are being considered to further promote this kind of participation. The purposes of this paper are threefold: first, to indicate the range of research projects currently active at SUMEX; second, to describe in detail one of these projects, DENDRAL, which is of particular interest to chemists; and third, to discuss some problems and possible solutions related to networking and community-building, I. Research Activities at SUMEX-AIM The community of participants in SUMEX-AIM can be divided geographically into local (1. e., Stanford-based) projects and remote projects, and below is given a brief description of the major representatives of each. Communication with the remote projects is accomplished through one or both of the communications networks shown in Figure 2. In most cases, connection with SUMEX-AIM from these remote sites involves only a local telephone call to the nearest network "node", The SUMEX-AIM system is itself undergoing constant improvement which deserves to be called research, and thus a third section is included here to represent system developments. 169 Remote projects The Rutgers project. Originating from Rutgers University are several research efforts designed to introduce advanced methods in computer science - particularly in artificial intelligence and interactive data base systems - into specific areas of biomedical research. One such effort involves the development of computer-based consultation systems for diseases of the eye, specifically the establishment of a national network of collaborators for diagnosis and recommendations for treatment of glaucoma by computer. Another project concerns the BELIEVER program, which represents a theory of how people arrive at an interpretation of the social actions of others. SUMEX-AIM provides an excellent medium for collaboration in the development and testing of this theory. The Rutgers project includes, in addition, several fundamental studies in artificial intelligence and system design, which provide much of the support needed for the development of such complex systems. The DIALOG project. The DIAgnostic LOGiec project, based at the University of Pittsburgh, is a large scale, computerized medical diagnostic system that makes use of the methods and structures of artificial intelligence. Unlike most other computer diagnostic programs, which are oriented to differential diagnosis in a rather limited area, the DIALOG system has been designed to deal with the general problem of diagnosis in internal medicine and currently accesses a medical data base which encompasses approximately fifty percent of the major diseases in internal medicine. The MISL Project. The Medical Information Systems Laboratory at the University of Illinois (Chicago Circle campus) has been established to explore inferential relationships between analytic data and the natural history of selected eye diseases, both in treated and untreated forms. This project will utilize the SUMEX-AIM resource to build a data base which could then be used as a test bed for the development of clinical decision support algorithms. Distributed Data-Base System for Chronic Diseases. This project, based at the University of Hawaii, seeks to use the SUMEX-AIM facility to establish a resource sharing project for the development of computer systems for consultation and research, and to make these Systems available to clinical facilities from a set of distributed data bases. The radio and satellite links which compose the communication network known as the ALOHANET, in conjunction with the ARPANET, will make these programs available to other Hawaiian islands and to remote areas of the Pacific basin. This project could well have a significantly beneficial effect on the quality of health care delivery in these locations, Modeling of Higher Mental Functions. A project at the University of California at Los angeles is using the SUMEX-AIM facility to construct, test, and validate an improved version of the computer simulation of paranoid processes which has been developed. These simulations have clinical implications for the understanding, treatment, and prevention of paranoid disorders. The current interactive version (PARRY) has been running on SUMEX-AIM and has 170 provided a basis for improvement of the future version’s language recognition capability. Local Projects The Protein Crystallography Project. The Protein Crystallography project involves scientists at two different universities (Stanford and the University of California at San Diego), pooling their respective talents in protein crystallography and computer science, and using the SUMEX-AIM facility as the central repository for programs, data and other information of common interest. The general objective of the project is to apply problem solving techniques, which have emerged from artificial intelligence research, to the well known "phase problem" of x-ray crystallography, in order to determine the three-dimensional structures of proteins. The work is intended to be of practical as well as theoretical value to both computer science (particularly artificial intelligence research) and protein crystallography. The MYCIN project. MYCIN is an evolving computer program that has been developed to assist physician nonspecialists with the selection of therapy for patients with bacterial infections, The project has involved both physicians, with expertise in the clinical pharmacology of bacterial infections, and computer scientists, with interests in artificial intelligence and medical computing. The MYCIN program attempts to model the decision processes of the medical experts. It consists of three closely integrated components: the Consultation System asks questions, makes conclusions, and gives advice; the Explanation System answers questions from the user to justify the program’s advice and explain its methods; and the Rule- Acquisition System permits the user to teach the system new decision rules, or to alter pre-existing rules that are judged to be inadequate or incorrect. The DENDRAL project. This project, being of particular chemical interest, is described in detail in Section II. Through the SUMEX-AIM facility DENDRAL has gained a growing community of production-level users whose experience with the programs is a valuable guide to further development. Although technically users, some members of this community might better be described as collaborators because they have provided SUMEX-AIM with various special-purpose programs which are of interest to other chemists and which extend the usefulness of the DENDRAL programs. SUMEX-AIM System Development Current research activities at SUMEX-AIM are developing along several lines. On a system development level there are ongoing projects designed to make the system more user oriented. Currently, the system can be expected to provide help to the user who is confused about what is expected in response to a certain prompt. A "?" typed by the user, will, in most cases, provide a list of possible responses 171 from which to choose. Also available in response to typing "HELP" to the monitor is a general help description containing pointers to files likely to be of interest to a new user. In an effort to facilitate communication between collaborators, a program called CONFER has been developed to provide an orderly method for multiple participant teletype "conference calls". Basically, the program acts as a character processor for all the terminals linked in the conference, accepting input from only one at a time, and passing it out to the remaining terminals. In this way, the conference, in effect, has a "moderator" terminal, thus allowing for a more orderly transfer of ideas and information. SUMEX-AIM is also aware of the necessity of making its facilities available for trial use by potential users and collaborators. To this end, a GUEST mechanism has been established for persons who wish to have brief, trial access to certain programs they feel may be of value to them, and about which they would like to obtain more knowledge. This provides a convenient mechanism whereby persons, who have been given an appropriate phone number and LOGIN procedure, can dial up SUMEX-AIM and receive actual experience using a program they may only have heard about. Another area of system development currently being explored at SUMEX-AIM is that of creating a comprehensive "bulletin board" facility where users can file "bulletins", that is, messages of interest to the SUMEX-AIM community. The facility will also alert users to new bulletins which are likely to be of interest to them, as determined by individual user-interest profile. II. DENDRAL - CHEMICAL APPLICATIONS OF INTERACTIVE COMPUTING IN A NETWORK ENVIRONMENT The major research interest of the DENDRAL Project at Stanford University is application of artificial intelligence techniques for chemical inference, focusing in particular on molecular structure elucidation. Portions of our research are in the area of combined gas chromatography/high resolution mass spectrometry and include instrumentation and data acquisition hardware and software development. This area is beyond the scope of this report; we focus instead on the concurrent development of programs to assist chemists in various phases of structure elucidation beyond the point of initial data collection. SUMEX~AIM provides the computer support for development and application of these programs. Another aspect of our research is our commitment to share developments among a wider community. We feel that several of our programs are advanced enough to be useful to chemists engaged in related work in mass spectrometry and structure elucidation in general. These programs are written primarily in the programming language INTERLISP, and thus are not easily exportable (exceptions are 172 indicated subsequently). SUMEX-AIM provides a mechanism for allowing others access to the programs without the requirement for any special programming or computer system expertise. The availability of the SUMEX facility over nationwide networks allows remote users to access the programs, in many instances via a local telephone call. Much of the following discussion is preliminary because our programs have only recently been released for outside use. Some announcement of their availability has been made, and other announcements will occur in the near future, through talks, publications in press, demonstrations and informal discussions. Although most of our experience has been with local users, they have been good models of remote users in that their previous exposure to the actual programs and computer systems is minimal. Their experience has been extremely useful in helping us to smooth out clumsy interactions with programs and to locate and fix program bugs. Such polishing is important for programs which may be utilized by users from widely differing backgrounds with respect to computers, networks and time sharing systems. We are in the processes of building a community of remote users. We actively encourage such use for two reasons: 1) we feel the programs are capable of assisting others in solving certain molecular structure problems, and 2) such experience with outside users will be a tremendous assistance in increasing the power of cur programs as the programs are forced to confront new real- world problems. The remainder of this section outlines the programs which are available via SUMEX, the utilization of these programs in helping to solve structure elucidation problems and the limitations we see to their use. We discuss current applications of the programs to our research anc the research of other users to illustrate better the variety of potential applications and to stimulate an interchange of ideas. Where appropriate, we point out current difficulties with the use bota of our programs and of SUMEX. New applications and wider use will certainly change the nature of these problems; we strive to solve current problems, but new ones will always arise to take their place. DENDAAL Programs We have several programs which we employ in dealing with various aspects of problems involving unknown structures. Some of these progratis are exportable, while the remainder are available at SUMEX. The availability of each program is discussed below. Our icitial emphasis in studying applications of artificial intelligence for chemical inference was in the area of mass spectrometry (4-6). This emphasis remains because many of our problems require mass spectrometry as the analytical tool of choice in providing structural information on small quantities of sample. More recently, we have been developing a program (CONGEN, below) directed at more general aspects of structure elucidation. This has extended the secpe of problems for which we can provide computer assistance. We will begin, however, with discussion of the mass spectrometry 173 programs, The examples used in the discussion are characteristic of our current research problems, although we have focused on relatively simple problems to keep the presentation brief. We trace, in what might be chronological terms, the application of the programs to various phases of a structure problem. In this way we hope to illustrate the place of each program in the analysis. We begin by discussing preprocessing of mass spectral data (CLEANUP and MOLION). Subsequent analysis of such data in terms of structure is then covered (PLANNER). The use of CONGEN is discussed for problems which cannot be handled by the previous programs. Finally, we discuss efforts to discover, with the use of the computer, systematics in the behavior of known substances in the mass spectrometer as a means of extending the knowledge of the system for applications in new areas (INTSUM and RULEGEN). Programs for Molecular Structure Problems The first three programs, CLEANUP, the lLibrary-search program and MOLION are in a sense utility programs, but all three play a critical role in processing mass spectral data. Subsequent applications of programs (e.g., PLANNER) for more detailed spectral analysis in terms of structure depend on the successful treatment of the data by CLEANUP and MOLION, while the library search program filters out common spectra which need not undergo a full analysis. The examples used are drawn from our collaboration with persons in the Genetics Research Center at Stanford Hospital. The experimental data which are collected are the results of combined gas chromatographic/low resolution mass spectral (GC/LRMS) analysis of organic components (chemically fractionated and derivatized where necessary) of body fluids, e.g. blood, urine. A typical experiment consists of 500-600 individual mass spectra for each fraction, taken sequentially over time as the various components, largely separated from one another, elute from the gas chromatograph and pass into the mass spectrometer. Each mass spectrum consists of the mass analyzed fragment ions of the component(s) in the mass spectrometer at the time the spectrum was taken. Such spectra are related, indirectly, to the molecular structure of the component(s). CLEANUP(7). The individual mass spectra obtained from fractionated GC/LRMS analysis are quite often poor representations of corresponding spectra taken from pure compounds. They can be contaminated by the presence of additional peaks and/or distortions of the intensities of existing peaks in the spectrum. Fragment ions from either the liquid phase of the GC column or from components incompletely separated by the gas chromatograph are are responsible for the contamination. We have developed a program, referred to here as CLEANUP, which examines all mass spectra in a GC/LRMS run, selects those spectra which contain ions other than background impurities, and remove contributions from background and overlapping components. A spectrum results which compares favorably with the spectrum of a pure component. Biller and Biemann (8) have developed a similar but less powerful program. 174 For example, the CLEANUP program detected components at points marked with a vertical bar in the (partial) plot of total ion current vs. scan number (time), Figure 3. Note that overlapping components were detected under the envelopes of the GC peaks in the region of scans 485-488, 525-529 and 539-552. We focus our attention on the spectrum recorded at sean 492, The raw data, prior to cleanup, are presented in Figure 4 (top). The spectrum resulting from CLEANUP is presented in Figure 4 (bottom). Note that the large ions (e.g., m/e 207, 221 and 315) from background impurities are removed, and that the intensity ratios of peaks at lower masses (e.g., 51 and 77) have been adjusted to reflect their true intensities in the spectrum, The CLEANUP program is capable of detection of quite low-level components in complex mixtures as indicated by some of the areas of the total ion current plot (Figure 3) where components were detected. It is completely general because nothing in the program code is sensitive to the types of compounds analyzed or the characteristics of possible impurities associated with the compounds or from the GC column. Its major limitation is that mass Spectra must be taken repetitively during the course of a GC/MS run. Its performance is enhanced when such spectra are measured closely in time. The program is offered via SUMEX as an adjunct to use of our other programs; it is not offered as a routine service. Because the program is written in FORTRAN, we routinely use it on our data acquisition computer system so as not to burden SUMEX with tasks better done elsewhere, Similarly, we would assist other frequent users to mount the program on their own systems, Latrary Searok With a set of "clean" mass spectra available, the next oroplem is identification of the various components. Over the course of several years, libraries of mass Spectral data have been assembled(9). These libraries oan be very useful in weeding out from a group of spectra those which represent known compounds(10). Clearly, one should spend time on solving the structures of unknown compounds, not on rediscovering old ones. The CLEANUP program provides mass spectra which are of sufficient quality to expect that known compounds would be identified easily from such libraries, 4+ fe The spectra detected by CLEANUP in the region of scans 480 to 580 (Figure 3) were matched against the library of biomedically relevant spectra compiled by S. Markey (National Institutes of Health) and cur extensions to that library (we wish to thank S. Grotch, Jet Propulsion Laboratory, Pasadena, Ca. for providing some of the library matchings), Excellent matches with the library were obtained for scans 492, 496, 509, 519, 529 and 548. The components are indole acetic acid methyl ester, di-n-butylphthalate, caffeine, salicyluric acid methyl ester, methoxyhippuric acid methyl ester and n-C24 hydrocarbon respectively (structures given in Figures 3 and 4). Spectra scans at 485, 487, 525, 530, 536, 539, 554 and 576 did not Match weli with any spectrum in the library and thus must be examined further for structural information. The necessity for preprocessing the data using CLEANUP pricr to library matching is illustrated from indole acetic acid methyl ester (scan 492). The "clean" Spectrum 175 (Figure 4, bottom) was matched to the library spectrum of this compound much better than to that of any other compound. The raw spectrum (Figure 4, top), when compared to the library resulted in eleven other compounds which matched more closely than the correct one. This brief example illustrates the obvious value and limitations of library searching. The most interesting compounds for subsequent analysis are those which are unknown. The fractions of urine extracts are replete with unidentified compounds because of the inadequacy of current library compilations. As new compounds are identified they are, of course, added to the library, so that future analyses need not reinvestigate the same material. We currently perform library searching on our data acquisition and reduction computer systems. We can, if necessary, offer limited library search facilities via SUMEX, However, because commercial facilities are available (e.g., over the GE network), routine Library search service is not available on SUMEX. MOLION(11). At this stage we are left with a collection of mass spectra of unknown compounds. The library search results may have provided some clues as to the type of compound present, e.g., compound class. Structure elucidation now begins in earnest. The key elements in problems of structure elucidation are the molecular weight and empirical formula of a compound. Without these essential data, the structural possibilities are usually too immense to proceed further. Mass spectrometry is frequently used to determine molecular weights and formulae, but there is no guarantee that the mass spectrum of a compound displays an ion corresponding to the intact molecule. For example, many of the derivatives of the amino acid fractions of urine display no molecular ions. When we are given only the mass spectrum (and for GC/MS analysis a mass spectrum may be all that is available) we must somehow predict likely molecular ion candidates. The program MOLION performs this task. Given a mass spectrum, it predicts and ranks likely molecular ion candidates independent of the presence or absence of an ion in the spectrum corresponding to the intact molecule. The published manuscript(11) provides many examples of the performance of the program. The mass spectrum of an example, unknown X, (which we will pursue in more detail below) is given in Figure 5. The results obtained from MOLION are summarized in Table I. The observed ion at m/e 263 is ranked as the most likely candidate.