Operating System

Published on December 2016 | Categories: Documents | Downloads: 23 | Comments: 0 | Views: 361
of 216
Download PDF   Embed   Report

Comments

Content

OPERATING SYSTEM
TABLE OF CONTENTS

Chapter 1: Introduction to Operatin S!"te#
• • • • What Is Operating System? History of Operating System Features Examples of Operating Systems

Chapter $: Operatin S!"te# Structure
• • • • • System Components Operating Systems Services System Calls and System Programs ayered !pproach "esign #echanisms and Policies

Chapter %: Proce""
• • • • • • • • • • "efinition of Process Process State Process Operations Process Control $loc% Process &computing' Su()processes and multi)threading *epresentation Process management in multi)tas%ing operating systems Processes in !ction Some Scheduling "isciplines

1

Chapter &: Thread"
• • • • +hreads +hread Creation, #anipulation and Synchroni-ation .ser evel +hreads and /ernel evel +hreads Context S0itch

Chapter ': The Centra( Proce""in )nit *CP)+
• • • • • • • • • +he !rchitecture of #ic)1 Simple #odel of a Computer ) Part 2 +he Fetch)"ecode)Execute Cycle Instruction Set #icroprogram Control versus Hard0are Control CISC versus *ISC CP. Scheduling CP.3Process Scheduling Scheduling !lgorithms

Chapter ,: Inter-proce"" Co##unication
• • • • Critical Section #utual Exclusion Proposals for !chieving #utual Exclusion Semaphores

Chapter .: /ead(oc0
• • • "efinition "eadloc% Condition "ealing 0ith "eadloc% Pro(lem

2

Chapter 1: Me#or! Mana e#ent
• • • !(out #emory Heap #anagement .sing #emory

Chapter 2: Cachin and Intro to Fi(e S!"te#"
• • • • • • Introduction to File Systems File System Implementation !n old Home0or% pro(lem File Systems Files on dis% or C")*O# #emory #apping Files

Chapter 13: /irectorie" and Securit!
• • • • • • • • • Security Protection #echanisms "irectories Hierarchical "irectories "irectory Operations 4aming Systems Security and the File System "esign Principles ! Sampling of Protection #echanisms

Chapter 11: Fi(e S!"te# I#p(e#entation
• • • +he .ser Interface to Files +he .ser Interface to "irectories Implementing File Systems
3

• • •

4ode Soft0are evels #ultiplexing and !rm Scheduling

Chapter 1$: Net4or0in
• • • 4et0or% $asic Concepts Other 5lo(al Issues

4

C5APTER 1 INTRO/)CTION TO OPERATING SYSTEM
6hat i" Operatin S!"te#7
Operating system (commonly abbreviated to either OS or O/S) is an interface between hardware and user; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer. The operating system acts as a host for applications that are run on the machine. As a host one of the purposes of an operating system is to handle the details of the operation of the hardware. This relieves application programs from having to manage these details and ma!es it easier to write applications. Almost all computers including handheld computers des!top computers supercomputers and even video game consoles use an operating system of some type. Some of the oldest models may however use an embedded operating system that may be contained on a compact dis! or other data storage device. Operating systems offer a number of services to application programs and users. Applications access these services through application programming interfaces (A"#s) or system calls. $y invo!ing these interfaces the application can re%uest a service from the operating system pass parameters and receive the results of the operation. &sers may also interact with the operating system with some !ind of software user interface (&#) li!e typing commands by using command line interface ('(#) or using a graphical user interface ()&# commonly pronounced *gooey+). ,or hand-held and des!top computers the user interface is generally considered part of the operating system. On large multiuser systems li!e &ni. and &ni.-li!e systems the user interface is generally implemented as an application program that runs outside the operating system. (/hether the user interface should be included as part of the operating system is a point of contention.) 'ommon contemporary operating systems include 0ac OS /indows (inu. $S1 and Solaris. /hile servers generally run on &ni. or &ni.-li!e systems embedded device mar!ets are split amongst several operating systems.

5

The most important program that runs on a computer. 2very general-purpose computer must have an operating system to run other programs. Operating systems perform basic tas!s such as recogni3ing input from the !eyboard sending output to the display screen !eeping trac! of files and directories on the dis! and controlling peripheral devices such as dis! drives and printers. ,or large systems the operating system has even greater responsibilities and powers. #t is li!e a traffic cop -- it ma!es sure that different programs and users running at the same time do not interfere with each other. The operating system is also responsible for security ensuring that unauthori3ed users do not access the system. Operating systems can be classified as follows4 • • • • • 0ulti-user4 Allows two or more users to run programs at the same time. Some operating systems permit hundreds or even thousands of concurrent users. multiprocessing4 Supports running a program on more than one '"&. multitas!ing4 Allows more than one program to run concurrently. 0ultithreading4 Allows different parts of a single program to run concurrently. real time4 5esponds to input instantly. )eneral-purpose operating systems such as 1OS and &6#7 are not real-time.

Operating systems provide a software platform on top of which other programs called application programs can run. The application programs must be written to run on top of a particular operating system. 8our choice of operating system therefore determines to a great e.tent the applications you can run. ,or "'s the most popular operating systems are 1OS OS/9 and /indows but others are available such as (inu.. As a user you normally interact with the operating system through a set of commands. ,or e.ample the 1OS operating system contains commands such as 'O"8 and 526A02 for copying files and changing the names of files respectively. The commands are accepted and e.ecuted by a part of the operating system called the command processor or command line interpreter. )raphical user interfaces allow you to enter commands by pointing and clic!ing at ob:ects that appear on the screen.

5i"tor! o8 Operatin S!"te#
The history of computer operating systems recapitulates to a degree the recent history of computer hardware. Operating systems (OSes) provide a set of functions needed and used by most applicationprograms on a computer and the necessary lin!ages for the control and synchroni3ation of the computer;s hardware. On the first computers without an operating system every program needed the full hardware specification to run correctly and perform standard tas!s and its own drivers for peripheral devices li!e printers and card-readers. The growing comple.ity of hardware and application-programs eventually made operating systems a necessity

6

Bac0 round 2arly computers lac!ed any form of operating system. The user had sole use of the machine and would arrive armed with program and data often on punched paper and tape. The program would be loaded into the machine and the machine would be set to wor! until the program completed or crashed. "rograms could generally be debugged via a front panel using switches and lights. #t is said that Alan Turing was a master of this on the early 0anchester 0ar! < machine and he was already deriving the primitive conception of an operating system from the principles of the &niversal Turing machine. (ater machines came with libraries of support code which would be lin!ed to the user;s program to assist in operations such as input and output. This was the genesis of the modern-day operating system. =owever machines still ran a single :ob at a time; at 'ambridge &niversity in 2ngland the :ob %ueue was at one time a washing line from which tapes were hung with different colored clothes-pegs to indicate :ob-priority. As machines became more powerful the time to run programs diminished and the time to hand off the e%uipment became very large by comparison. Accounting for and paying for machine usage moved on from chec!ing the wall cloc! to automatic logging by the computer. 5un %ueues evolved from a literal %ueue of people at the door to a heap of media on a :obs-waiting table or batches of punch-cards stac!ed one on top of the other in the reader until the machine itself was able to select and se%uence which magnetic tape drives were online. /here program developers had originally had access to run their own :obs on the machine they were supplanted by dedicated machine operators who loo!ed after the well-being and maintenance of the machine and were less and less concerned with implementing tas!s manually. /hen commercially available computer centers were faced with the implications of data lost through tampering or operational errors e%uipment vendors were put under pressure to enhance the runtime libraries to prevent misuse of system resources. Automated monitoring was needed not :ust for '"& usage but for counting pages printed cards punched cards read dis! storage used and for signaling when operator intervention was re%uired by :obs such as changing magnetic tapes. All these features were building up towards the repertoire of a fully capable operating system. 2ventually the runtime libraries became an amalgamated program that was started before the first customer :ob and could read in the customer :ob control its e.ecution clean up after it record its usage and immediately go on to process the ne.t :ob. Significantly it became possible for programmers to use symbolic program-code instead of having to hand-encode binary images once tas!-switching allowed a computer to perform translation of a program into binary form before running it. These resident bac!ground programs capable of managing multistep processes were often called monitors or monitor-programs before the term OS established itself. An underlying program offering basic hardware-management software-scheduling and resource-monitoring may seem a remote ancestor to the user-oriented OSes of the personal computing era. $ut there has been a shift in meaning. /ith the era of commercial computing more and more >secondary> software was bundled in the OS pac!age leading eventually to the perception of an OS as a complete user-system with utilities applications (such as te.t editors and file managers) and configuration tools and having an integrated graphical user interface. The true descendant of the early operating systems is what is now called the >!ernel>. #n technical and development circles the old restricted sense of an OS persists because of the continued active development of embedded operating systems for all !inds of devices with a data-processing component from hand-held gadgets up to industrial robots and real-time control-systems which do 7

not run user-applications at the front-end. An embedded OS in a device today is not so far removed as one might thin! from its ancestor of the <?@As. The broader categories of systems and application software are discussed in the computer software article. The #ain8ra#e era #t is generally thought that the first operating system used for real wor! was )0-6AA #/O produced in <?@B by )eneral 0otors; 5esearch division for its #$0 CAD. 0ost other early operating systems for #$0 mainframes were also produced by customers. 2arly operating systems were very diverse with each vendor or customer producing one or more operating systems specific to their particular mainframe computer. 2very operating system even from the same vendor could have radically different models of commands operating procedures and such facilities as debugging aids. Typically each time the manufacturer brought out a new machine there would be a new operating system and most applications would have to be manually ad:usted recompiled and retested. S!"te#" on IBM hard4are: The state of affairs continued until the <?BAs when #$0 already a leading hardware vendor stopped the wor! on e.isting systems and put all the effort into developing the System/EBA series of machines all of which used the same instruction architecture. #$0 intended to develop also a single operating system for the new hardware the OS/EBA. The problems encountered in the development of the OS/EBA are legendary and are described by ,red $roo!s in The 0ythical 0an-0onthFa boo! that has become a classic of software engineering. $ecause of performance differences across the hardware range and delays with software development a whole family of operating systems were introduced instead of a single OS/EBA. #$0 wound up releasing a series of stop-gaps followed by three longer-lived operating systems4 • • OS/0,T for mid-range systems. This had one successor OS/GS< which was discontinued in the <?HAs. OS/0GT for large systems. This was similar in most ways to OS/0,T (programs could be ported between the two without being re-compiled) but has more sophisticated memory management and a time-sharing facility TSO. 0GT had several successors including the current 3/OS. 1OS/EBA for small System/EBA models had several successors including the current 3/GS2. #t was significantly different from OS/0,T and OS/0GT. #$0 maintained full compatibility with the past so that programs developed in the si.ties can still run under 3/GS2 (if developed for 1OS/EBA) or 3/OS (if developed for OS/0,T or OS/0GT) with no change.

• •

Other #ain8ra#e operatin "!"te#": 'ontrol 1ata 'orporation developed the S'O"2 operating system in the <?BAs for batch processing. #n cooperation with the &niversity of 0innesota the I5O6OS and later the 6OS operating systems were developed during the <?CAs which supported simultaneous batch and timesharing use. (i!e many commercial timesharing systems its interface was an e.tension of the 1TSS time sharing system one of the pioneering efforts in timesharing and programming languages. #n the late <?CAs 'ontrol 1ata and the &niversity of #llinois developed the "(ATO system which used plasma panel displays and long-distance time sharing networ!s. "(ATO was remar!ably 8

innovative for its time; the shared memory model of "(ATO;s T&TO5 programming language allowed applications such as real-time chat and multi-user graphical games. &6#GA' the first commercial computer manufacturer produced a series of 272' operating systems. (i!e all early main-frame systems this was a batch-oriented system that managed magnetic drums dis!s card readers and line printers. #n the <?CAs &6#GA' produced the 5eal-Time $asic (5T$) system to support large-scale time sharing also patterned after the 1artmouth $AS#' system. $urroughs 'orporation introduced the $@AAA in <?B< with the 0'" (0aster 'ontrol "rogram) operating system. The $@AAA was a stac! machine designed to e.clusively support high-level languages with no machine language or assembler and indeed the 0'" was the first OS to be written e.clusively in a high-level language (2S"O( a dialect of A()O(). 0'" also introduced many other ground-brea!ing innovations such as being the first commercial implementation of virtual memory. 0'" is still in use today in the &nisys 'lear"ath/0'" line of computers. "ro:ect 0A' at 0#T wor!ing with )2 developed 0ultics and )eneral 2lectric 'omprehensive Operating Supervisor ()2'OS) which introduced the concept of ringed security privilege levels. After =oneywell ac%uired )2;s computer business it was renamed to )eneral 'omprehensive Operating System ()'OS). 1igital 2%uipment 'orporation developed many operating systems for its various computer lines including TO"S-<A and TO"S-9A time sharing systems for the EB-bit "1"-<A class systems. "rior to the widespread use of &6#7 TO"S-<A was a particularly popular system in universities and in the early A5"A62T community. #n the late <?BAs through the late <?CAs several hardware capabilities evolved that allowed similar or ported software to run on more than one system. 2arly systems had utili3ed microprogramming to implement features on their systems in order to permit different underlying architecture to appear to be the same as others in a series. #n fact most EBA;s after the EBA/DA (e.cept the EBA/<B@ and EBA/<BH) were microprogrammed implementations. $ut soon other means of achieving application compatibility were proven to be more significant. Minico#puter" and the ri"e o8 )NI9 The beginnings of the &6#7 operating system was developed at ATJT $ell (aboratories in the late <?BAs. $ecause it was essentially free in early editions easily obtainable and easily modified it achieved wide acceptance. #t also became a re%uirement within the $ell systems operating companies. Since it was written in a high level ' language when that language was ported to a new machine architecture &6#7 was also able to be ported. This portability permitted it to become the choice for a second generation of minicomputers and the first generation of wor!stations. $y widespread use it e.emplified the idea of an operating system that was conceptually the same across various hardware platforms. #t still was owned by ATJT and that limited its use to groups or corporations who could afford to license it. #t became one of the roots of the open source movement. Other than that 1igital 2%uipment 'orporation created the simple 5T-<< system for its <B-bit "1"-<< class machines and the G0S system for the E9-bit GA7 computer. Another system which evolved in this time frame was the "ic! operating system. The "ic! system was developed and sold by 0icrodata 'orporation who created the precursors of the system. The system is an e.ample of a system which started as a database application support program and graduated to system wor!. The ca"e o8 1-:it ho#e co#puter" and a#e con"o(e" 9

5o#e co#puter": Although most small H-bit home computers of the <?HAs such as the 'ommodore BD the Atari H-bit the Amstrad '"' K7 Spectrum series and others could use a dis!loading operating system such as '"/0 or )2OS they could generally wor! without one. #n fact most if not all of these computers shipped with a built-in $AS#' interpreter on 5O0 which also served as a crude operating system allowing minimal file management operations (such as deletion copying etc.) to be performed and sometimes dis! formatting along of course with application loading and e.ecution which sometimes re%uired a non-trivial command se%uence li!e with the 'ommodore BD. The fact that the ma:ority of these machines were bought for entertainment and educational purposes and were seldom used for more >serious> or business/science oriented applications partly e.plains why a >true> operating system was not necessary. Another reason is that they were usually single-tas! and single-user machines and shipped with minimal amounts of 5A0 usually between D and 9@B !ilobytes with BD and <9H being common figures and H-bit processors so an operating system;s overhead would li!ely compromise the performance of the machine without really being necessary. 2ven the available word processor and integrated software applications were mostly selfcontained programs which too! over the machine completely as also did video games. Ga#e con"o(e" and ;ideo a#e": Since virtually all video game consoles and arcade cabinets designed and built after <?HA were true digital machines (unli!e the analog "ong clones and derivatives) some of them carried a minimal form of $#OS or built-in game such as the 'olecoGision the Sega 0aster System and the S6I 6eo )eo. There were however successful designs where a $#OS was not necessary such as the 6intendo 62S and its clones. 0odern day game consoles and videogames starting with the "'-2ngine all have a minimal $#OS that also provides some interactive utilities such as memory card management Audio or Gideo '1 playbac! copy protection and sometimes carry libraries for developers to use etc. ,ew of these cases however would %ualify as a >true> operating system. The most notable e.ceptions are probably the 1reamcast game console which includes a minimal $#OS li!e the "layStation but can load the /indows '2 operating system from the game dis! allowing easily porting of games from the "' world and the 7bo. game console which is little more than a disguised #ntel-based "' running a secret modified version of 0icrosoft /indows in the bac!ground. ,urthermore there are (inu. versions that will run on a 1reamcast and later game consoles as well. (ong before that Sony had released a !ind of development !it called the 6et 8aro3e for its first "layStation platform which provided a series of programming and developing tools to be used with a normal "' and a specially modified >$lac! "layStation> that could be interfaced with a "' and download programs from it. These operations re%uire in general a functional OS on both platforms involved. #n general it can be said that videogame consoles and arcade coin operated machines used at most a built-in $#OS during the <?CAs <?HAs and most of the <??As while from the "layStation era and beyond they started getting more and more sophisticated to the point of re%uiring a generic or custom-built OS for aiding in development and e.pandability. The per"ona( co#puter era: App(e< PC=MS=/R-/OS and :e!ond The development of microprocessors made ine.pensive computing available for the small business and hobbyist which in turn led to the widespread use of interchangeable hardware 10

components using a common interconnection (such as the S-<AA SS-@A Apple ## #SA and "'# buses) and an increasing need for ;standard; operating systems to control them. The most important of the early OSes on these machines was 1igital 5esearch;s '"/0-HA for the HAHA / HAH@ / K-HA '"&s. #t was based on several 1igital 2%uipment 'orporation operating systems mostly for the "1"<< architecture. 0icrosoft;s first Operating System 0-1OS was designed along many of the "1"-<< features but for microprocessor based system. 0S-1OS (or "'-1OS when supplied by #$0) was based originally on '"/0-HA. 2ach of these machines had a small boot program in 5O0 which loaded the OS itself from dis!. The $#OS on the #$0-"' class machines was an e.tension of this idea and has accreted more features and functions in the 9A years since the first #$0-"' was introduced in <?H<. The decreasing cost of display e%uipment and processors made it practical to provide graphical user interfaces for many operating systems such as the generic 7 /indow System that is provided with many &6#7 systems or other graphical systems such as 0icrosoft /indows the 5adioShac! 'olor 'omputer;s OS-? (evel ##/0ultiGue 'ommodore;s AmigaOS Apple;s 0ac OS or even #$0;s OS/9. The original )&# was developed at 7ero. "alo Alto 5esearch 'enter in the early ;CAs (the Alto computer system) and imitated by many vendors. The ri"e o8 ;irtua(i>ation Operating systems were originally running directly on the hardware itself and provided services to applications. /ith G0/'0S on System/ECA #$0 introduced the notion of virtual machine where the operating system itself runs under the control of an hypervisor instead of being in direct control of the hardware. G0ware populari3ed this technology on personal computers. Over time the line between virtual machines monitors and operating systems was blurred4 • • • • • =ypervisors grew more comple. gaining their own application programming interface memory management or file system Girtuali3ation becomes a !ey feature of operating systems as e.emplified by =yper-G in /indows Server 9AAH or =" #ntegrity Girtual 0achines in ="-&7 #n some systems such as "O/25@ and "O/25B-based servers from #$0 the hypervisor is no longer optional. Applications have been re-designed to run directly on a virtual machine monitor. #n many ways virtual machine software today plays the role formerly held by the operating system including managing the hardware resources (processor memory #/O devices) applying scheduling policies or allowing system administrators to manage the system.

Feature"
Pro ra# e?ecution: The operating system acts as an interface between an application and the hardware. The user interacts with the hardware from >the other side>. The operating system is a set of services which simplifies development of applications. 2.ecuting a program involves the creation of a process by the operating system. The !ernel creates a process by assigning memory and other resources establishing a priority for the process (in multi-tas!ing systems) loading program code into memory and e.ecuting the program. The program then interacts with the user and/or other devices performing its intended function. 11

Interrupt": #nterrupts are central to operating systems as they provide an efficient way for the operating system to interact and react to its environment. The alternative is to have the operating system >watch> the various sources of input for events (polling) that re%uire action -- not a good use of '"& resources. #nterrupt-based programming is directly supported by most '"&s. #nterrupts provide a computer with a way of automatically running specific code in response to events. 2ven very basic computers support hardware interrupts and allow the programmer to specify code which may be run when that event ta!es place. /hen an interrupt is received the computer;s hardware automatically suspends whatever program is currently running saves its status and runs computer code previously associated with the interrupt. This is analogous to placing a boo!mar! in a boo! when someone is interrupted by a phone call and then ta!ing the call. #n modern operating systems interrupts are handled by the operating system;s !ernel. #nterrupts may come from either the computer;s hardware or from the running program. /hen a hardware device triggers an interrupt the operating system;s !ernel decides how to deal with this event generally by running some processing code. =ow much code gets run depends on the priority of the interrupt (for e.ample4 a person usually responds to a smo!e detector alarm before answering the phone). The processing of hardware interrupts is a tas! that is usually delegated to software called device drivers which may be either part of the operating system;s !ernel part of another program or both. 1evice drivers may then relay information to a running program by various means. A program may also trigger an interrupt to the operating system. #f a program wishes to access hardware for e.ample it may interrupt the operating system;s !ernel which causes control to be passed bac! to the !ernel. The !ernel will then process the re%uest. #f a program wishes additional resources (or wishes to shed resources) such as memory it will trigger an interrupt to get the !ernel;s attention. Protected #ode and "uper;i"or #ode: 0odern '"&s support something called dual mode operation. '"&s with this capability use two modes4 protected mode and supervisor mode which allow certain '"& functions to be controlled and affected only by the operating system !ernel. =ere protected mode does not refer specifically to the HA9HB (#ntel;s .HB <B-bit microprocessor) '"& feature although its protected mode is very similar to it. '"&s might have other modes similar to HA9HB protected mode as well such as the virtual HAHB mode of the HAEHB (#ntel;s .HB E9-bit microprocessor or iEHB). =owever the term is used here more generally in operating system theory to refer to all modes which limit the capabilities of programs running in that mode providing things li!e virtual memory addressing and limiting access to hardware in a manner determined by a program running in supervisor mode. Similar modes have e.isted in supercomputers minicomputers and mainframes as they are essential to fully supporting &6#7-li!e multi-user operating systems. /hen a computer first starts up it is automatically running in supervisor mode. The first few programs to run on the computer being the $#OS bootloader and the operating system have unlimited access to hardware - and this is re%uired because by definition initiali3ing a protected environment can only be done outside of one. =owever when the operating system passes control to another program it can place the '"& into protected mode. #n protected mode programs may have access to a more limited set of the '"&;s instructions. A user program may leave protected mode only by triggering an interrupt causing control to be passed bac! to the !ernel. #n this way the operating system can maintain e.clusive control over things li!e access to hardware and memory. 12

The term >protected mode resource> generally refers to one or more '"& registers which contain information that the running program isn;t allowed to alter. Attempts to alter these resources generally causes a switch to supervisor mode where the operating system can deal with the illegal operation the program was attempting (for e.ample by !illing the program). Me#or! #ana e#ent: Among other things a multiprogramming operating system !ernel must be responsible for managing all system memory which is currently in use by programs. This ensures that a program does not interfere with memory already used by another program. Since programs time share each program must have independent access to memory. 'ooperative memory management used by many early operating systems assumes that all programs ma!e voluntary use of the !ernel;s memory manager and do not e.ceed their allocated memory. This system of memory management is almost never seen anymore since programs often contain bugs which can cause them to e.ceed their allocated memory. #f a program fails it may cause memory used by one or more other programs to be affected or overwritten. 0alicious programs or viruses may purposefully alter another program;s memory or may affect the operation of the operating system itself. /ith cooperative memory management it ta!es only one misbehaved program to crash the system. 0emory protection enables the !ernel to limit a process; access to the computer;s memory. Garious methods of memory protection e.ist including memory segmentation and paging. All methods re%uire some level of hardware support (such as the HA9HB 00&) which doesn;t e.ist in all computers. #n both segmentation and paging certain protected mode registers specify to the '"& what memory address it should allow a running program to access. Attempts to access other addresses will trigger an interrupt which will cause the '"& to re-enter supervisor mode placing the !ernel in charge. This is called a segmentation violation or Seg-G for short and since it is both difficult to assign a meaningful result to such an operation and because it is usually a sign of a misbehaving program the !ernel will generally resort to terminating the offending program and will report the error. /indows E.<-0e had some level of memory protection but programs could easily circumvent the need to use it. &nder /indows ?. all 0S-1OS applications ran in supervisor mode giving them almost unlimited control over the computer. A general protection fault would be produced indicating a segmentation violation had occurred however the system would often crash anyway. #n most (inu. systems part of the hard dis! is reserved for virtual memory when the Operating system is being installed on the system. This part is !nown as swap space. /indows systems use a swap file instead of a partition.

@irtua( #e#or!: The use of virtual memory addressing (such as paging or segmentation) means that the !ernel can choose what memory each program may use at any given time allowing the operating system to use the same memory locations for multiple tas!s. #f a program tries to access memory that isn;t in its current range of accessible memory but nonetheless has been allocated to it the !ernel will be interrupted in the same way as it would if the program were to e.ceed its allocated memory. (See section on memory management.) &nder &6#7 this !ind of interrupt is referred to as a page fault.

13

/hen the !ernel detects a page fault it will generally ad:ust the virtual memory range of the program which triggered it granting it access to the memory re%uested. This gives the !ernel discretionary power over where a particular application;s memory is stored or even whether or not it has actually been allocated yet. #n modern operating systems application memory which is accessed less fre%uently can be temporarily stored on dis! or other media to ma!e that space available for use by other programs. This is called swapping as an area of memory can be used by multiple programs and what that memory area contains can be swapped or e.changed on demand. Mu(tita"0in : 0ultitas!ing refers to the running of multiple independent computer programs on the same computer; giving the appearance that it is performing the tas!s at the same time. Since most computers can do at most one or two things at one time this is generally done via time sharing which means that each program uses a share of the computer;s time to e.ecute. An operating system !ernel contains a piece of software called a scheduler which determines how much time each program will spend e.ecuting and in which order e.ecution control should be passed to programs. 'ontrol is passed to a process by the !ernel which allows the program access to the '"& and memory. At a later time control is returned to the !ernel through some mechanism so that another program may be allowed to use the '"&. This so-called passing of control between the !ernel and applications is called a conte.t switch. An early model which governed the allocation of time to programs was called cooperative multitas!ing. #n this model when control is passed to a program by the !ernel it may e.ecute for as long as it wants before e.plicitly returning control to the !ernel. This means that a malicious or malfunctioning program may not only prevent any other programs from using the '"& but it can hang the entire system if it enters an infinite loop. The philosophy governing preemptive multitas!ing is that of ensuring that all programs are given regular time on the '"&. This implies that all programs must be limited in how much time they are allowed to spend on the '"& without being interrupted. To accomplish this modern operating system !ernels ma!e use of a timed interrupt. A protected mode timer is set by the !ernel which triggers a return to supervisor mode after the specified time has elapsed. (See above sections on #nterrupts and 1ual 0ode Operation.) On many single user operating systems cooperative multitas!ing is perfectly ade%uate as home computers generally run a small number of well tested programs. /indows 6T was the first version of 0icrosoft /indows which enforced preemptive multitas!ing but it didn;t reach the home user mar!et until /indows 7" (since /indows 6T was targeted at professionals.) Aerne( Pree#ption: #n recent years concerns have arisen because of long latencies often associated with some !ernel run-times sometimes on the order of <AAms or more in systems with monolithic !ernels. These latencies often produce noticeable slowness in des!top systems and can prevent operating systems from performing time-sensitive operations such as audio recording and some communications. 0odern operating systems e.tend the concepts of application preemption to device drivers and !ernel code so that the operating system has preemptive control over internal run-times as well. &nder /indows Gista the introduction of the /indows 1isplay 1river 0odel (/110) accomplishes this for display drivers and in (inu. the preemptable !ernel model introduced in version 9.B allows all device drivers and some other parts of !ernel code to ta!e advantage of preemptive multi-tas!ing.

14

&nder /indows prior to /indows Gista and (inu. prior to version 9.B all driver e.ecution was co-operative meaning that if a driver entered an infinite loop it would free3e the system. /i"0 acce"" and 8i(e "!"te#": Access to files stored on dis!s is a central feature of all operating systems. 'omputers store data on dis!s using files which are structured in specific ways in order to allow for faster access higher reliability and to ma!e better use out of the drive;s available space. The specific way in which files are stored on a dis! is called a file system and enables files to have names and attributes. #t also allows them to be stored in a hierarchy of directories or folders arranged in a directory tree. 2arly operating systems generally supported a single type of dis! drive and only one !ind of file system. 2arly file systems were limited in their capacity speed and in the !inds of file names and directory structures they could use. These limitations often reflected limitations in the operating systems they were designed for ma!ing it very difficult for an operating system to support more than one file system. /hile many simpler operating systems support a limited range of options for accessing storage systems operating systems li!e &6#7 and (inu. support a technology !nown as a virtual file system or G,S. An operating system li!e &6#7 supports a wide array of storage devices regardless of their design or file systems to be accessed through a common application programming interface (A"#). This ma!es it unnecessary for programs to have any !nowledge about the device they are accessing. A G,S allows the operating system to provide programs with access to an unlimited number of devices with an infinite variety of file systems installed on them through the use of specific device drivers and file system drivers. A connected storage device such as a hard drive is accessed through a device driver. The device driver understands the specific language of the drive and is able to translate that language into a standard language used by the operating system to access all dis! drives. On &6#7 this is the language of bloc! devices. /hen the !ernel has an appropriate device driver in place it can then access the contents of the dis! drive in raw format which may contain one or more file systems. A file system driver is used to translate the commands used to access each specific file system into a standard set of commands that the operating system can use to tal! to all file systems. "rograms can then deal with these file systems on the basis of filenames and directories/folders contained within a hierarchical structure. They can create delete open and close files as well as gather various information about them including access permissions si3e free space and creation and modification dates. Garious differences between file systems ma!e supporting all file systems difficult. Allowed characters in file names case sensitivity and the presence of various !inds of file attributes ma!es the implementation of a single interface for every file system a daunting tas!. Operating systems tend to recommend the use of (and so support natively) file systems specifically designed for them; for e.ample 6T,S in /indows and e.tE and 5eiser,S in (inu.. =owever in practice third party drives are usually available to give support for the most widely used filesystems in most general-purpose operating systems (for e.ample 6T,S is available in (inu. through 6T,S-Eg and e.t9/E and 5eiser,S are available in /indows through ,S-driver and rfstool).

/e;ice dri;er": A device driver is a specific type of computer software developed to allow interaction with hardware devices. Typically this constitutes an interface for communicating with the device through the specific computer bus or communications subsystem that the hardware is connected to providing commands to and/or receiving data from the device and on the other end 15

the re%uisite interfaces to the operating system and software applications. #t is a speciali3ed hardware-dependent computer program which is also operating system specific that enables another program typically an operating system or applications software pac!age or computer program running under the operating system !ernel to interact transparently with a hardware device and usually provides the re%uisite interrupt handling necessary for any necessary asynchronous timedependent hardware interfacing needs. The !ey design goal of device drivers is abstraction. 2very model of hardware (even within the same class of device) is different. 6ewer models also are released by manufacturers that provide more reliable or better performance and these newer models are often controlled differently. 'omputers and their operating systems cannot be e.pected to !now how to control every device both now and in the future. To solve this problem OSes essentially dictate how every type of device should be controlled. The function of the device driver is then to translate these OS mandated function calls into device specific calls. #n theory a new device which is controlled in a new manner should function correctly if a suitable driver is available. This new driver will ensure that the device appears to operate as usual from the operating systems; point of view. Net4or0in : 'urrently most operating systems support a variety of networ!ing protocols hardware and applications for using them. This means that computers running dissimilar operating systems can participate in a common networ! for sharing resources such as computing files printers and scanners using either wired or wireless connections. 6etwor!s can essentially allow a computer;s operating system to access the resources of a remote computer to support the same functions as it could if those resources were connected directly to the local computer. This includes everything from simple communication to using networ!ed file systems or even sharing another computer;s graphics or sound hardware. Some networ! services allow the resources of a computer to be accessed transparently such as SS= which allows networ!ed users direct access to a computer;s command line interface. 'lient/server networ!ing involves a program on a computer somewhere which connects via a networ! to another computer called a server. Servers usually running &6#7 or (inu. offer (or host) various services to other networ! computers and users. These services are usually provided through ports or numbered access points beyond the server;s networ! address. 2ach port number is usually associated with a ma.imum of one running program which is responsible for handling re%uests to that port. A daemon being a user program can in turn access the local hardware resources of that computer by passing re%uests to the operating system !ernel. 0any operating systems support one or more vendor-specific or open networ!ing protocols as well for e.ample S6A on #$0 systems 12'net on systems from 1igital 2%uipment 'orporation and 0icrosoft-specific protocols (S0$) on /indows. Specific protocols for specific tas!s may also be supported such as 6,S for file access. "rotocols li!e 2Sound or esd can be easily e.tended over the networ! to provide sound from local applications on a remote system;s sound hardware. Securit!: A computer being secure depends on a number of technologies wor!ing properly. A modern operating system provides access to a number of resources which are available to software running on the system and to e.ternal devices li!e networ!s via the !ernel. The operating system must be capable of distinguishing between re%uests which should be allowed to be processed and others which should not be processed. /hile some systems may simply distinguish between >privileged> and >non-privileged> systems commonly have a form of re%uester identity such as a user name. To establish identity there may be a process of authentication. Often a username must be %uoted and each username may have a password. Other methods of authentication such as magnetic cards or biometric data might be used instead. #n some 16

cases especially connections from the networ! resources may be accessed with no authentication at all (such as reading files over a networ! share). Also covered by the concept of re%uester identity is authori3ation; the particular services and resources accessible by the re%uester once logged into a system and tied to either the re%uester;s user account or to the variously configured groups of users to which the re%uester belongs. #n addition to the allow/disallow model of security a system with a high level of security will also offer auditing options. These would allow trac!ing of re%uests for access to resources (such as >who has been reading this fileL>). #nternal security or security from an already running program is only possible if all possibly harmful re%uests must be carried out through interrupts to the operating system !ernel. #f programs can directly access hardware and resources they cannot be secured. 2.ternal security involves a re%uest from outside the computer such as a login at a connected console or some !ind of networ! connection. 2.ternal re%uests are often passed through device drivers to the operating system;s !ernel where they can be passed onto applications or carried out directly. Security of operating systems has long been a concern because of highly sensitive data held on computers both of a commercial and military nature. The &nited States )overnment 1epartment of 1efense (1o1) created the Trusted 'omputer System 2valuation 'riteria (T'S2') which is a standard that sets basic re%uirements for assessing the effectiveness of security. This became of vital importance to operating system ma!ers because the T'S2' was used to evaluate classify and select computer systems being considered for the processing storage and retrieval of sensitive or classified information. 6etwor! services include offerings such as file sharing print services email web sites and file transfer protocols (,T") most of which can have compromised security. At the front line of security are hardware devices !nown as firewalls or intrusion detection/prevention systems. At the operating system level there are a number of software firewalls available as well as intrusion detection/prevention systems. 0ost modern operating systems include a software firewall which is enabled by default. A software firewall can be configured to allow or deny networ! traffic to or from a service or application running on the operating system. Therefore one can install and be running an insecure service such as Telnet or ,T" and not have to be threatened by a security breach because the firewall would deny all traffic trying to connect to the service on that port. An alternative strategy and the only sandbo. strategy available in systems that do not meet the "ope! and )oldberg virtuali3ation re%uirements is the operating system not running user programs as native code but instead either emulates a processor or provides a host for a p-code based system such as Mava. #nternal security is especially relevant for multi-user systems; it allows each user of the system to have private files that the other users cannot tamper with or read. #nternal security is also vital if auditing is to be of any use since a program can potentially bypass the operating system inclusive of bypassing auditing. E?a#p(e: Micro"o8t 6indo4": /hile the /indows ?. series offered the option of having profiles for multiple users they had no concept of access privileges and did not allow concurrent access; and so were not true multi-user operating systems. #n addition they implemented only partial memory protection. They were accordingly widely criticised for lac! of security. The /indows 6T series of operating systems by contrast are true multi-user and implement absolute memory protection. =owever a lot of the advantages of being a true multi-user operating system were nullified by the fact that prior to /indows Gista the first user account created during the setup process was an administrator account which was also the default for new accounts. Though /indows 7" did have limited accounts the ma:ority of home users did not change to an account type 17

with fewer rights N partially due to the number of programs which unnecessarily re%uired administrator rights N and so most home users ran as administrator all the time. /indows Gista changes this by introducing a privilege elevation system called &ser Account 'ontrol. /hen logging in as a standard user a logon session is created and a to!en containing only the most basic privileges is assigned. #n this way the new logon session is incapable of ma!ing changes that would affect the entire system. /hen logging in as a user in the Administrators group two separate to!ens are assigned. The first to!en contains all privileges typically awarded to an administrator and the second is a restricted to!en similar to what a standard user would receive. &ser applications including the /indows Shell are then started with the restricted to!en resulting in a reduced privilege environment even under an Administrator account. /hen an application re%uests higher privileges or >5un as administrator> is clic!ed &A' will prompt for confirmation and if consent is given (including administrator credentials if the account re%uesting the elevation is not a member of the administrators group) start the process using the unrestricted to!en. E?a#p(e: Linu?=)ni?: (inu. and &6#7 both have two tier security which limits any systemwide changes to the root user a special user account on all &6#7-li!e systems. /hile the root user has virtually unlimited permission to affect system changes programs running as a regular user are limited in where they can save files what hardware they can access etc. #n many systems a user;s memory usage their selection of available programs their total dis! usage or %uota available range of programs; priority settings and other functions can also be loc!ed down. This provides the user with plenty of freedom to do what needs to be done without being able to put any part of the system in :eopardy (barring accidental triggering of system-level bugs) or ma!e sweeping system-wide changes. The user;s settings are stored in an area of the computer;s file system called the user;s home directory which is also provided as a location where the user may store their wor! a concept later adopted by /indows as the ;0y 1ocuments; folder. Should a user have to install software outside of his home directory or ma!e system-wide changes they must become the root user temporarily usually with the su or sudo command which is answered with the computer;s root password when prompted. Some systems (such as &buntu and its derivatives) are configured by default to allow select users to run programs as the root user via the sudo command using the user;s own password for authentication instead of the system;s root password. One is sometimes said to >go root> or >drop to root> when elevating oneself to root access. Fi(e "!"te# "upport in #odern operatin "!"te#": Support for file systems is highly varied among modern operating systems although there are several common file systems which almost all operating systems include support and drivers for. So(ari": The S&6 0icrosystems Solaris Operating System in earlier releases defaulted to (non-:ournaled or non-logging) &,S for bootable and supplementary file systems. Solaris (as most Operating Systems based upon Open Standards and/or Open Source) defaulted to supported and e.tended &,S. Support for other file systems and significant enhancements were added over time including Geritas Software 'orp. (Mournaling) G.,S S&6 0icrosystems ('lustering) O,S S&6 0icrosystems (Mournaling) &,S and S&6 0icrosystems (open source poolable <9H bit compressible and errorcorrecting) K,S. Iernel e.tensions were added to Solaris to allow for bootable Geritas G.,S operation. (ogging or Mournaling was added to &,S in S&6;s Solaris C. 5eleases of Solaris <A Solaris 2.press OpenSolaris and other Open Source variants of Solaris Operating System later supported bootable K,S. 18

(ogical Golume 0anagement allows for spanning a file system across multiple devices for the purpose of adding redundancy capacity and/or throughput. (egacy environments in Solaris may use Solaris Golume 0anager (formerly !nown as Solstice 1is!Suite.) 0ultiple operating systems (including Solaris) may use Geritas Golume 0anager. 0odern Solaris based Operating Systems eclipse the need for Golume 0anagement through leveraging virtual storage pools in K,S. Linu?: 0any (inu. distributions support some or all of e.t9 e.tE e.tD 5eiser,S 5eiserD M,S 7,S ),S ),S9 O',S O',S9 and 6#(,S. The e.t file systems namely e.t9 e.tE and e.tD are based on the original (inu. file system. Others have been developed by companies to meet their specific needs hobbyists or adapted from &6#7 0icrosoft /indows and other operating systems. (inu. has full support for 7,S and M,S along with ,AT (the 0S-1OS file system) and =,S which is the primary file system for the 0acintosh. #n recent years support for 0icrosoft /indows 6T;s 6T,S file system has appeared in (inu. and is now comparable to the support available for other native &6#7 file systems. #SO ?BBA and &niversal 1is! ,ormat (&1,) are supported which are standard file systems used on '1s 1G1s and $lu5ay discs. #t is possible to install (inu. on the ma:ority of these file systems. &nli!e other operating systems (inu. and &6#7 allow any file system to be used regardless of the media it is stored in whether it is a hard drive a disc ('1 1G1...) an &S$ !ey or even contained within a file located on another file system. Micro"o8t 6indo4": 0icrosoft /indows currently supports 6T,S and ,AT file systems along with networ! file systems shared from other computers and the #SO ?BBA and &1, filesystems used for '1s 1G1s and other optical discs such as $lu5ay. &nder /indows each file system is usually limited in application to certain media for e.ample '1s must use #SO ?BBA or &1, and as of /indows Gista 6T,S is the only file system which the operating system can be installed on. /indows 2mbedded '2 B.A /indows Gista Service "ac! < and /indows Server 9AAH support 2.,AT a file system more suitable for flash drives. Mac OS 9: 0ac OS 7 supports =,SP with :ournaling as its primary file system. #t is derived from the =ierarchical ,ile System of the earlier 0ac OS. 0ac OS 7 has facilities to read and write ,AT 6T,S (only read although an open-source cross platform implementation !nown as 6T,S E) provides read-write support to 0icrosoft /indows 6T,S file system for 0ac OS 7 users.) &1, and other file systems but cannot be installed to them. 1ue to its &6#7 heritage 0ac OS 7 now supports virtually all the file systems supported by the &6#7 G,S. 5ecently Apple #nc. started wor! on porting Sun 0icrosystem;s K,S filesystem to 0ac OS 7 and preliminary support is already available in 0ac OS 7 <A.@. Specia(-purpo"e 8i(e" "!"te#": ,AT file systems are commonly found on floppy dis!s flash memory cards digital cameras and many other portable devices because of their relative simplicity. "erformance of ,AT compares poorly to most other file systems as it uses overly simplistic data structures ma!ing file operations time-consuming and ma!es poor use of dis! space in situations where many small files are present. #SO ?BBA and &niversal 1is! ,ormat are two common formats that target 'ompact 1iscs and 1G1s. 0ount 5ainier is a newer e.tension to &1, supported by (inu. 9.B !ernels and /indows Gista that facilitates rewriting to 1G1s in the same fashion as has been possible with floppy dis!s. Bourna(i>ed 8i(e "!"te#": ,ile systems may provide :ournaling which provides safe recovery in the event of a system crash. A :ournaled file system writes some information twice4 first to the :ournal which is a log of file system operations then to its proper place in the ordinary file system. Mournaling is handled by the file system driver and !eeps trac! of each operation ta!ing place that changes the contents of the dis!. #n the event of a crash the system can recover to a consistent state 19

by replaying a portion of the :ournal. 0any &6#7 file systems provide :ournaling including 5eiser,S M,S and 2.tE. #n contrast non-:ournaled file systems typically need to be e.amined in their entirety by a utility such as fsc! or ch!ds! for any inconsistencies after an unclean shutdown. Soft updates is an alternative to :ournaling that avoids the redundant writes by carefully ordering the update operations. (og-structured file systems and K,S also differ from traditional :ournaled file systems in that they avoid inconsistencies by always writing new copies of the data eschewing in-place updates. Graphica( u"er inter8ace": 0ost modern computer systems support graphical user interfaces ()&#) and often include them. #n some computer systems such as the original implementations of 0icrosoft /indows and the 0ac OS the )&# is integrated into the !ernel. /hile technically a graphical user interface is not an operating system service incorporating support for one into the operating system !ernel can allow the )&# to be more responsive by reducing the number of conte.t switches re%uired for the )&# to perform its output functions. Other operating systems are modular separating the graphics subsystem from the !ernel and the Operating System. #n the <?HAs &6#7 G0S and many others had operating systems that were built this way. (inu. and 0ac OS 7 are also built this way. 0odern releases of 0icrosoft /indows such as /indows Gista implement a graphics subsystem that is mostly in user-space however versions between /indows 6T D.A and /indows Server 9AAE;s graphics drawing routines e.ist mostly in !ernel space. /indows ?. had very little distinction between the interface and the !ernel. 0any computer operating systems allow the user to install or create any user interface they desire. The 7 /indow System in con:unction with )6O02 or I12 is a commonly-found setup on most &ni. and &ni.-li!e ($S1 (inu. 0ini.) systems. A number of /indows shell replacements have been released for 0icrosoft /indows which offer alternatives to the included /indows shell but the shell itself cannot be separated from /indows. 6umerous &ni.-based )&#s have e.isted over time most derived from 7<<. 'ompetition among the various vendors of &ni. (=" #$0 Sun) led to much fragmentation though an effort to standardi3e in the <??As to 'OS2 and '12 failed for the most part due to various reasons eventually eclipsed by the widespread adoption of )6O02 and I12. "rior to open source-based tool!its and des!top environments 0otif was the prevalent tool!it/des!top combination (and was the basis upon which '12 was developed). )raphical user interfaces evolve over time. ,or e.ample /indows has modified its user interface almost every time a new ma:or version of /indows is released and the 0ac OS )&# changed dramatically with the introduction of 0ac OS 7 in <???.

E?a#p(e" o8 Operatin S!"te#"
Micro"o8t 6indo4" 0icrosoft /indows is a family of proprietary operating systems that originated as an add-on to the older 0S-1OS operating system for the #$0 "'. 0odern versions are based on the newer /indows 6T !ernel that was originally intended for OS/9. /indows runs on .HB .HB-BD and #tanium processors. 2arlier versions also ran on the 12' Alpha 0#"S ,airchild (later #ntergraph) 'lipper and "ower"' architectures (some wor! was done to port it to the S"A5' architecture). As of Mune 9AAH 0icrosoft /indows holds a large amount of the worldwide des!top mar!et share. /indows is also used on servers supporting applications such as web servers and database servers. #n recent years 0icrosoft has spent significant mar!eting and research J development 20

money to demonstrate that /indows is capable of running any enterprise application which has resulted in consistent price/performance records (see the T"') and significant acceptance in the enterprise mar!et. The most widely used version of the 0icrosoft /indows family is /indows 7" released on October 9@ 9AA<. #n 6ovember 9AAB after more than five years of development wor! 0icrosoft released /indows Gista a ma:or new operating system version of 0icrosoft /indows family which contains a large number of new features and architectural changes. 'hief amongst these are a new user interface and visual style called /indows Aero a number of new security features such as &ser Account 'ontrol and a few new multimedia applications such as /indows 1G1 0a!er. A server variant based on the same !ernel /indows Server 9AAH was released in early 9AAH. /indows C is currently under development; 0icrosoft has stated that it intends to scope its development to a three-year timeline placing its release sometime after mid-9AA?. )NI9 and )NI9-(i0e operatin "!"te#" Ien Thompson wrote $ mainly based on $'"( which he used to write &ni. based on his e.perience in the 0&(T#'S pro:ect. $ was replaced by ' and &ni. developed into a large comple. family of inter-related operating systems which have been influential in every modern operating system. The &ni.-li!e family is a diverse group of operating systems with several ma:or sub-categories including System G $S1 and (inu.. The name >&6#7> is a trademar! of The Open )roup which licenses it for use with any operating system that has been shown to conform to their definitions. >&ni.-li!e> is commonly used to refer to the large set of operating systems which resemble the original &ni.. &ni.-li!e systems run on a wide variety of machine architectures. They are used heavily for servers in business as well as wor!stations in academic and engineering environments. ,ree software &ni. variants such as )6& (inu. and $S1 are popular in these areas. 0ar!et share statistics for freely available operating systems are usually inaccurate since most free operating systems are not purchased ma!ing usage under-represented. On the other hand mar!et share statistics based on total downloads of free operating systems are often inflated as there is no economic disincentive to ac%uire multiple operating systems so users can download multiple systems test them and decide which they li!e best. Some &ni. variants li!e =";s ="-&7 and #$0;s A#7 are designed to run only on that vendor;s hardware. Others such as Solaris can run on multiple types of hardware including .HB servers and "'s. Apple;s 0ac OS 7 a hybrid !ernel-based $S1 variant derived from 6e7TST2" 0ach and ,ree$S1 has replaced Apple;s earlier (non-&ni.) 0ac OS. &ni. interoperability was sought by establishing the "OS#7 standard. The "OS#7 standard can be applied to any operating system although it was originally created for various &ni. variants. Mac OS 9 0ac OS 7 is a line of proprietary graphical operating systems developed mar!eted and sold by Apple #nc. the latest of which is pre-loaded on all currently shipping 0acintosh computers. 0ac OS 7 is the successor to the original 0ac OS which had been Apple;s primary operating system since <?HD. &nli!e its predecessor 0ac OS 7 is a &6#7 operating system built on technology that 21

had been developed at 6e7T through the second half of the <?HAs and up until Apple purchased the company in early <??C. The operating system was first released in <??? as 0ac OS 7 Server <.A with a des!toporiented version (0ac OS 7 v<A.A) following in 0arch 9AA<. Since then five more distinct >end-user> and >server> editions of 0ac OS 7 have been released the most recent being 0ac OS 7 v<A.@ which was first made available in October 9AAC. 5eleases of 0ac OS 7 are named after big cats; 0ac OS 7 v<A.@ is usually referred to by Apple and users as >(eopard>. The server edition 0ac OS 7 Server is architecturally identical to its des!top counterpart but usually runs on Apple;s line of 0acintosh server hardware. 0ac OS 7 Server includes wor! group management and administration software tools that provide simplified access to !ey networ! services including a mail transfer agent a Samba server an (1A" server a domain name server and others.

P(an 2 Ien Thompson 1ennis 5itchie and 1ouglas 0c#lroy at $ell (abs designed and developed the ' programming language to build the operating system &ni.. "rogrammers at $ell (abs went on to develop "lan ? and #nferno which were engineered for modern distributed environments. "lan ? was designed from the start to be a networ!ed operating system and had graphics built-in unli!e &ni. which added these features to the design later. "lan ? has yet to become as popular as &ni. derivatives but it has an e.panding community of developers. #t is currently released under the (ucent "ublic (icense. #nferno was sold to Gita 6uova =oldings and has been released under a )"(/0#T license. Rea(-ti#e operatin "!"te#" A real-time operating system (5TOS) is a multitas!ing operating system intended for applications with fi.ed deadlines (real-time computing). Such applications include some small embedded systems automobile engine controllers industrial robots spacecraft industrial control and some large-scale computing systems. An early e.ample of a large-scale real-time operating system was Transaction "rocessing ,acility developed by American Airlines and #$0 for the Sabre Airline 5eservations System. E#:edded "!"te#" 2mbedded systems use a variety of dedicated operating systems. #n some cases the >operating system> software is directly lin!ed to the application to produce a monolithic specialpurpose program. #n the simplest embedded systems there is no distinction between the OS and the application. 2mbedded systems that have fi.ed deadlines use a real-time operating system such as G./or!s e'os O67 0ontaGista (inu. and 5T(inu.. Some embedded systems use operating systems such as Symbian OS "alm OS /indows '2 $S1 and (inu. although such operating systems do not support real-time computing. /indows '2 shares similar A"#s to des!top /indows but shares none of des!top /indows; codebase. 5o::! de;e(op#ent 22

Operating system development or OS1ev for short as a hobby has a large cult-li!e following. As such operating systems such as (inu. have derived from hobby operating system pro:ects. The design and implementation of an operating system re%uires s!ill and determination and the term can cover anything from a basic >=ello /orld> boot loader to a fully featured !ernel. One classical e.ample of this is the 0ini. Operating SystemFan OS that was designed by A.S. Tanenbaum as a teaching tool but was heavily used by hobbyists before (inu. eclipsed it in popularity. Other Older operating systems which are still used in niche mar!ets include OS/9 from #$0; 0ac OS the non-&ni. precursor to Apple;s 0ac OS 7; $eOS; 7TS-EAA. Some most notably AmigaOS D and 5#S' OS continue to be developed as minority platforms for enthusiast communities and specialist applications. OpenG0S formerly from 12' is still under active development by =ewlett-"ac!ard. There were a number of operating systems for H bit computers - Apple;s 1OS (1is! Operating System) E.9 J E.E for Apple QR "ro1OS &'S1 '"/0 - available for various H and <B bit environments. 5esearch and development of new operating systems continues. )6& =urd is designed to be bac!wards compatible with &ni. but with enhanced functionality and a micro!ernel architecture. Singularity is a pro:ect at 0icrosoft 5esearch to develop an operating system with better memory protection based on the .6et managed code model. Systems development follows the same model used by other Software development which involves maintainers version control >trees> for!s >patches> and specifications. ,rom the ATJT-$er!eley lawsuit the new unencumbered systems were based on D.D$S1 which for!ed as ,ree$S1 and 6et$S1 efforts to replace missing code after the &ni. wars. 5ecent for!s include 1ragon,ly $S1 and 1arwin from $S1 &ni..

23

24

C5APTER $ OPERATING SYSTEM STR)CT)RE
S!"te# Co#ponent"
2ven though not all systems have the same structure many modern operating systems share the same goal of supporting the following types of system components. Proce"" Mana e#ent The operating system manages many !inds of activities ranging from user programs to system programs li!e printer spooler name servers file server etc. 2ach of these activities is encapsulated in a process. A process includes the complete e.ecution conte.t (code data "' registers OS resources in use etc.). #t is important to note that a process is not a program. A process is only O62 instant of a program in e.ecution. There are many processes can be running the same program. The five ma:or activities of an operating system in regard to process management are • • • • • 'reation and deletion of user and system processes. Suspension and resumption of processes. A mechanism for process synchroni3ation. A mechanism for process communication. A mechanism for deadloc! handling.

Main-Me#or! Mana e#ent "rimary-0emory or 0ain-0emory is a large array of words or bytes. 2ach word or byte has its own address. 0ain-memory provides storage that can be access directly by the '"&. That is to say for a program to be e.ecuted it must in the main memory. The ma:or activities of an operating in regard to memory-management are4 • • • Ieep trac! of which part of memory are currently being used and by whom. 1ecide which process are loaded into memory when memory space becomes available. Allocate and deallocate memory space as needed.

Fi(e Mana e#ent A file is a collected of related information defined by its creator. 'omputer can store files on the dis! (secondary storage) which provide long term storage. Some e.amples of storage media are magnetic tape magnetic dis! and optical dis!. 2ach of these media has its own properties li!e speed capacity data transfer rate and access methods. 25

A file system normally organi3ed into directories to ease their use. These directories may contain files and other directions. The five main ma:or activities of an operating system in regard to file management are <. The creation and deletion of files. 9. The creation and deletion of directions. E. The support of primitives for manipulating files and directions. D. The mapping of files onto secondary storage. @. The bac! up of files on stable storage media. I=O S!"te# Mana e#ent #/O subsystem hides the peculiarities of specific hardware devices from the user. Only the device driver !nows the peculiarities of the specific device to whom it is assigned. Secondar!-Stora e Mana e#ent )enerally spea!ing systems have several levels of storage including primary storage secondary storage and cache storage. #nstructions and data must be placed in primary storage or cache to be referenced by a running program. $ecause main memory is too small to accommodate all data and programs and its data are lost when power is lost the computer system must provide secondary storage to bac! up main memory. Secondary storage consists of tapes dis!s and other media designed to hold information that will eventually be accessed in primary storage (primary secondary cache) is ordinarily divided into bytes or words consisting of a fi.ed number of bytes. 2ach location in storage has an address; the set of all addresses available to a program is called an address space. The three ma:or activities of an operating system in regard to secondary storage management are4 <. 0anaging the free space available on the secondary-storage device. 9. Allocation of storage space when new files have to be written. E. Scheduling the re%uests for memory access. Net4or0in A distributed system is a collection of processors that do not share memory peripheral devices or a cloc!. The processors communicate with one another through communication lines called networ!. The communication-networ! design must consider routing and connection strategies and the problems of contention and security. Protection S!"te# #f a computer systems has multiple users and allows the concurrent e.ecution of multiple processes then the various processes must be protected from one another;s activities. "rotection refers to mechanism for controlling the access of programs processes or users to the resources defined by a computer systems. Co##and Interpreter S!"te# 26

A command interpreter is an interface of the operating system with the user. The user gives commands with are e.ecuted by operating system (usually by turning them into system calls). The main function of a command interpreter is to get and e.ecute the ne.t user specified command. 'ommand-#nterpreter is usually not part of the !ernel since multiple command interpreters (shell in &6#7 terminology) may be support by an operating system and they do not really need to run in !ernel mode. There are two main advantages to separating the command interpreter from the !ernel. #f we want to change the way the command interpreter loo!s i.e. # want to change the interface of command interpreter # am able to do that if the command interpreter is separate from the !ernel. # cannot change the code of the !ernel so # cannot modify the interface. #f the command interpreter is a part of the !ernel it is possible for a malicious process to gain access to certain part of the !ernel that it showed not have to avoid this ugly scenario it is advantageous to have the command interpreter separate from !ernel.

Operatin S!"te#" Ser;ice"
,ollowing are the five services provided by an operating systems to the convenience of the users. Pro ra# E?ecution The purpose of a computer systems is to allow the user to e.ecute programs. So the operating systems provides an environment where the user can conveniently run programs. The user does not have to worry about the memory allocation or multitas!ing or anything. These things are ta!en care of by the operating systems. 5unning a program involves the allocating and deallocating memory '"& scheduling in case of multiprocess. These functions cannot be given to the user-level programs. So user-level programs cannot help the user to run programs independently without the help from operating systems. I=O Operation" 2ach program re%uires an input and produces output. This involves the use of #/O. The operating systems hides the user the details of underlying hardware for the #/O. All the user sees is that the #/O has been performed without any details. So the operating system by providing #/O ma!es it convenient for the users to run programs. ,or efficiently and protection users cannot control #/O so this service cannot be provided by user-level programs. Fi(e S!"te# Manipu(ation The output of a program may need to be written into new files or input ta!en from some files. The operating systems provides this service. The user does not have to worry about secondary storage management. &ser gives a command for reading or writing to a file and sees his/her tas! accomplished. Thus operating systems ma!es it easier for user programs to accomplished their tas!. This service involves secondary storage management. The speed of #/O that depends on secondary storage management is critical to the speed of many programs and hence # thin! it is best relegated to the operating systems to manage it than giving individual users the control of it. #t is not difficult for the user-level programs to provide these services but for above mentioned reasons it is best if this service s left with operating system. 27

Co##unication" There are instances where processes need to communicate with each other to e.change information. #t may be between processes running on the same computer or running on the different computers. $y providing this service the operating system relieves the user of the worry of passing messages between processes. #n case where the messages need to be passed to processes on the other computers through a networ! it can be done by the user programs. The user program may be customi3ed to the specifics of the hardware through which the message transits and provides the service interface to the operating system. Error /etection An error is one part of the system may cause malfunctioning of the complete system. To avoid such a situation the operating system constantly monitors the system for detecting the errors. This relieves the user of the worry of errors propagating to various part of the system and causing malfunctioning. This service cannot allowed to be handled by user programs because it involves monitoring and in cases altering area of memory or deallocation of memory for a faulty process. Or may be relin%uishing the '"& of a process that goes into an infinite loop. These tas!s are too critical to be handed over to the user programs. A user program if given these privileges can interfere with the correct (normal) operation of the operating systems.

S!"te# Ca((" and S!"te# Pro ra#"
System calls provide an interface between the process and the operating system. System calls allow user-level processes to re%uest some services from the operating system which process itself is not allowed to do. #n handling the trap the operating system will enter in the !ernel mode where it has access to privileged instructions and can perform the desired service on the behalf of user-level process. #t is because of the critical nature of operations that the operating system itself does them every time they are needed. ,or e.ample for #/O a process involves a system call telling the operating system to read or write particular area and this re%uest is satisfied by the operating system. System programs provide basic functioning to users so that they do not need to write their own environment for program development (editors compilers) and program e.ecution (shells). #n some sense they are bundles of useful system calls.

La!ered Approach /e"i n
#n this case the system is easier to debug and modify because changes affect only limited portions of the code and programmer does not have to !now the details of the other layers. #nformation is also !ept only where it is needed and is accessible only in certain ways so bugs affecting that data are limited to a specific module or layer.

Mechani"#" and Po(icie"
The policies what is to be done while the mechanism specifies how it is to be done. ,or instance the timer construct for ensuring '"& protection is mechanism. On the other hand the decision of how long the timer is set for a particular user is a policy decision. The separation of mechanism and policy is important to provide fle.ibility to a system. #f the interface between mechanism and policy is well defined the change of policy may affect only a few 28

parameters. On the other hand if interface between these two is vague or not well defined it might involve much deeper change to the system. Once the policy has been decided it gives the programmer the choice of using his/her own implementation. Also the underlying implementation may be changed for a more efficient one without much trouble if the mechanism and policy are well defined. Specifically separating these two provides fle.ibility in a variety of ways. ,irst the same mechanism can be used to implement a variety of policies so changing the policy might not re%uire the development of a new mechanism but :ust a change in parameters for that mechanism but :ust a change in parameters for that mechanism from a library of mechanisms. Second the mechanism can be changed for e.ample to increase its efficiency or to move to a new platform without changing the overall policy.

29

C5APTER % PROCESS

/e8inition o8 Proce""
The term >process> was first used by the designers of the 0&(T#'S in <?BA;s. Since then the term process used somewhat interchangeably with ;tas!; or ;:ob;. The process has been given many definitions for instance • • • • • A program in 2.ecution. An asynchronous activity. The ;animated sprit; of a procedure in e.ecution. The entity to which processors are assigned. The ;dispatchable; unit.

And many more definitions have given. As we can see from above that there is no universally agreed upon definition but the definition >"rogram in 2.ecution> seem to be most fre%uently used. And this is a concept are will use in the present study of operating systems. 6ow that we agreed upon the definition of process the %uestion is what is the relation between process and program. #t is same beast with different name or when this beast is sleeping (not e.ecuting) it is called program and when it is e.ecuting becomes process. /ell to be very precise. "rocess is not the same as program. #n the following discussion we point out some of the difference between process and program. As we have mentioned earlier. "rocess is not the same as program. A process is more than a program code. A process is an ;active; entity as oppose to program which consider to be a ;passive; entity. As we all !now that a program is an algorithm e.pressed in some suitable notation (e.g. programming language). $eing a passive a program is only a part of process. "rocess on the other hand includes4 • • • • • • 'urrent value of "rogram 'ounter ("') 'ontents of the processors registers Galue of the variables The process stac! (S") which typically contains temporary data such as subroutine parameter return address and temporary variables. A data section that contains global variables. A process is the unit of wor! in a system.

30

Proce"" State
The process state consist of everything necessary to resume the process e.ecution if it is somehow put aside temporarily. The process state consists of at least following4 • 'ode for the program. • "rogram;s static data. • "rogram;s dynamic data. • "rogram;s procedure call stac!. • 'ontents of general purpose registers. • 'ontents of program counter ("') • 'ontents of program status word ("S/). • Operating Systems resource in use. A process goes through a series of discrete process states. 6ew State4 The process being created. 5unning State4 A process is said to be running if it has the '"& that is process actually using the '"& at that particular instant. $loc!ed (or waiting) State4 A process is said to be bloc!ed if it is waiting for some event to happen such that as an #/O completion before it can proceed. 6ote that a process is unable to run until some e.ternal event happens. 5eady State4 A process is said to be ready if it use a '"& if one were available. A ready state process is runable but temporarily stopped running to let another process run. Terminated state4 The process has finished e.ecution.

Proce"" Operation"
Proce"" Creation #n general-purpose systems some way is needed to create processes as needed during operation. There are four principal events led to processes creation. • System initiali3ation. • 2.ecution of a process 'reation System calls by a running process. • A user re%uest to create a new process. • #nitiali3ation of a batch :ob. ,oreground processes interact with users. $ac!ground processes that stay in bac!ground sleeping but suddenly springing to life to handle activity such as email webpage printing and so on. $ac!ground processes are called daemons. This call creates an e.act clone of the calling process. A process may create a new process by some create process such as ;for!;. #t choose to does so creating process is called parent process and the created one is called the child processes. Only 31

one parent is needed to create a child process. 6ote that unli!e plants and animals that use se.ual representation a process has only one parent. This creation of process (processes) yields a hierarchical structure of processes li!e one in the figure. 6otice that each child has only one parent but each parent may have many children. After the for! the two processes the parent and the child have the same memory image the same environment strings and the same open files. After a process is created both the parent and child have their own distinct address space. #f either process changes a word in its address space the change is not visible to the other process. ,ollowing are some reasons for creation of a process • &ser logs on. • &ser starts a program. • Operating systems creates process to provide service e.g. to manage printer. • Some program starts another process e.g. 6etscape calls .v to display a picture. Proce"" Ter#ination A process terminates when it finishes e.ecuting its last statement. #ts resources are returned to the system it is purged from any system lists or tables and its process control bloc! ("'$) is erased i.e. the "'$;s memory space is returned to a free memory pool. The new process terminates the e.isting process usually due to following reasons4 • 6ormal 2.ist 0ost processes terminates because they have done their :ob. This call is e.ist in &6#7. • 2rror 2.ist /hen process discovers a fatal error. ,or e.ample a user tries to compile a program that does not e.ist. • ,atal 2rror An error caused by process due to a bug in program for e.ample e.ecuting an illegal instruction referring non-e.isting memory or dividing by 3ero. • Iilled by another "rocess A process e.ecutes a system call telling the Operating Systems to terminate some other process. #n &6#7 this call is !ill. #n some systems when a process !ills all processes it created are !illed as well (&6#7 does not wor! this way). Proce"" State" A process goes through a series of discrete process states. • 6ew State The process being created. • Terminated State The process has finished e.ecution. • $loc!ed (waiting) State /hen a process bloc!s it does so because logically it cannot continue typically because it is waiting for input that is not yet available. ,ormally a process is said to be bloc!ed if it is waiting for some event to happen (such as an #/O completion) before it can proceed. #n this state a process is unable to run until some e.ternal event happens. • 5unning State A process is said t be running if it currently has the '"& that is actually using the '"& at that particular instant. • 5eady State A process is said to be ready if it use a '"& if one were available. #t is runable but temporarily stopped to let another process run. (ogically the ;5unning; and ;5eady; states are similar. #n both cases the process is willing to run only in the case of ;5eady; state there is temporarily no '"& available for it. The ;$loc!ed; state is different from the ;5unning; and ;5eady; states in that the process cannot run even if the '"& is available.

32

Proce"" State Tran"ition" ,ollowing are si. (B) possible transitions among above mentioned five (@) states Tran"ition 1 occurs when process discovers that it cannot continue. #f running process initiates an #/O operation before its allotted time e.pires the running process voluntarily relin%uishes the '"&. This state transition is4 $loc! (process-name)4 5unning S $loc!. Tran"ition $ occurs when the scheduler decides that the running process has run long enough and it is time to let another process have '"& time. This state transition is4 Time-5un-Out (process-name)4 5unning S 5eady. Tran"ition % occurs when all other processes have had their share and it is time for the first process to run again This state transition is4 1ispatch (process-name)4 5eady S 5unning. Tran"ition & occurs when the e.ternal event for which a process was waiting (such as arrival of input) happens. This state transition is4 /a!eup (process-name)4 $loc!ed S 5eady. Tran"ition ' occurs when the process is created. This state transition is4 Admitted (process-name)4 6ew S 5eady. Tran"ition , occurs when the process has finished e.ecution. This state transition is4 2.it (process-name)4 5unning S Terminated.

33

Proce"" Contro( B(oc0
A process in an operating system is represented by a data structure !nown as a process control bloc! ("'$) or process descriptor. The "'$ contains important information about the specific process including • • • • • • • • The current state of the process i.e. whether it is ready running waiting or whatever. &ni%ue identification of the process in order to trac! >which is which> information. A pointer to parent process. Similarly a pointer to child process (if it e.ists). The priority of process (a part of '"& scheduling information). "ointers to locate memory of processes. A register save area. The processor it is running on.

The "'$ is a certain store that allows the operating systems to locate !ey information about a process. Thus the "'$ is the data structure that defines a process to the operating systems.

Proce"" *co#putin +
#n computing a process is an instance of a computer program consisting of one or more threads that is being se%uentially e.ecuted by a computer system that has the ability to run several computer programs concurrently. A computer program itself is :ust a passive collection of instructions while a process is the actual e.ecution of those instructions. Several processes may be associated with the same program; for e.ample opening up several instances of the same program often means more than one process is being e.ecuted. #n the computing world processes are formally defined by the operating system (OS) running them and so may differ in detail from one OS to another. A single computer processor e.ecutes one or more (multiple) instructions at a time (per cloc! cycle) one after the other (this is a simplification; for the full story see superscalar '"& architecture). To allow users to run several programs at once (e.g. so that processor time is not wasted waiting for input from a resource) single-processor computer systems can perform time-sharing. Time-sharing allows processes to switch between being e.ecuted and waiting (to continue) to be e.ecuted. #n most cases this is done very rapidly providing the illusion that several processes are e.ecuting ;at once;. (This is !nown as concurrency or multiprogramming.) &sing more than one physical processor on a computer permits true simultaneous e.ecution of more than one stream of instructions from different processes but time-sharing is still typically used to allow more than one process to run at a time. ('oncurrency is the term generally used to refer to several independent processes sharing a single processor; simultaneously is used to refer to several processes each with their own processor.) 1ifferent processes may share the same set of instructions in memory (to save storage) but this is not !nown to any one process. 2ach e.ecution of the same set of instructions is !nown as an instanceF a completely separate instantiation of the program.

34

,or security and reliability reasons most modern operating systems prevent direct communication between ;independent; processes providing strictly mediated and controlled interprocess communication functionality.

Su:-proce""e" and #u(ti-threadin
Thread *co#puter "cience+ A process may split itself into multiple ;daughter; sub-processes or threads that e.ecute in parallel running different instructions on much of the same resources and data (or as noted the same instructions on logically different resources and data). 0ultithreading is useful when various ;events; are occurring in an unpredictable order and should be processed in another order than they occur for e.ample based on response time constraints. 0ultithreading ma!es it possible for the processing of one event to be temporarily interrupted by an event of higher priority. 0ultithreading may result in more efficient '"& time utili3ation since the '"& may switch to low-priority tas!s while waiting for other events to occur. ,or e.ample a word processor could perform a spell chec! as the user types without >free3ing> the application - a high-priority thread could handle user input and update the display while a low-priority bac!ground process runs the time-consuming spell chec!ing utility. This results in that the entered te.t is shown immediately on the screen while spelling mista!es are indicated or corrected after a longer time. 0ultithreading allows a server such as a web server to serve re%uests from several users concurrently. Thus we can avoid that re%uests are left unheard if the server is busy with processing a re%uest. One simple solution to that problem is one thread that puts every incoming re%uest in a %ueue and a second thread that processes the re%uests one by one in a first-come first-served manner. =owever if the processing time is very long for some re%uests (such as large file re%uests or re%uests from users with slow networ! access data rate) this approach would result in long response time also for re%uests that do not re%uire long processing time since they may have to wait in %ueue. One thread per re%uest would reduce the response time substantially for many users and may reduce the '"& idle time and increase the utili3ation of '"& and networ! capacity. #n case the communication protocol between the client and server is a communication session involving a se%uence of several messages and responses in each direction (which is the case in the T'" transport protocol used in for web browsing) creating one thread per communication session would reduce the comple.ity of the program substantially since each thread is an instance with its own state and variables. #n a similar fashion multi-threading would ma!e it possible for a client such as a web browser to communicate efficiently with several servers concurrently. A process that has only one thread is referred to as a single-threaded process while a process with multiple threads is referred to as a multi-threaded process. 0ulti-threaded processes have the advantage over multi-process systems that they can perform several tas!s concurrently without the e.tra overhead needed to create a new process and handle synchronised communication between these processes. =owever single-threaded processes have the advantage of even lower overhead.

Repre"entation
#n general a computer system process consists of (or is said to ;own;) the following resources4 • An image of the e.ecutable machine code associated with a program. 35



0emory (typically some region of virtual memory); which includes the e.ecutable code process-specific data (input and output) a call stac! (to !eep trac! of active subroutines and/or other events) and a heap to hold intermediate computation data generated during run time. Operating system descriptors of resources that are allocated to the process such as file descriptors (&ni. terminology) or handles (/indows) and data sources and sin!s. Security attributes such as the process owner and the process; set of permissions (allowable operations). "rocessor state (conte.t) such as the content of registers physical memory addressing etc. The state is typically stored in computer registers when the process is e.ecuting and in memory otherwise.

• • •

The operating system holds most of this information about active processes in data structures called process control bloc!s ("'$). Any subset of resources but typically at least the processor state may be associated with each of the process; threads in operating systems that support threads or ;daughter; processes. The operating system !eeps its processes separated and allocates the resources they need so that they are less li!ely to interfere with each other and cause system failures (e.g. deadloc! or thrashing). The operating system may also provide mechanisms for inter-process communication to enable processes to interact in safe and predictable ways.

Proce"" #ana e#ent in #u(ti-ta"0in operatin "!"te#"
Proce"" #ana e#ent *co#putin + A multitas!ingT operating system may :ust switch between processes to give the appearance of many processes e.ecuting concurrently or simultaneously though in fact only one process can be e.ecuting at any one time on a single-core '"& (unless using multi-threading or other similar technology). #t is usual to associate a single process with a main program and ;daughter; (;child;) processes with any spin-off parallel processes which behave li!e asynchronous subroutines. A process is said to own resources of which an image of its program (in memory) is one such resource. (6ote however that in multiprocessing systems many processes may run off of or share the same reentrant program at the same location in memoryF but each process is said to own its own image of the program.) "rocesses are often called tas!s in embedded operating systems. The sense of ;process; (or tas!) is ;something that ta!es up time; as opposed to ;memory; which is ;something that ta!es up space;. (=istorically the terms ;tas!; and ;process; were used interchangeably but the term ;tas!; seems to be dropping from the computer le.icon.) The above description applies to both processes managed by an operating system and processes as defined by process calculi. #f a process re%uests something for which it must wait it will be bloc!ed. /hen the process is in the $loc!ed State it is eligible for swapping to dis! but this is transparent in a virtual memory system where bloc!s of memory values may be really on dis! and not in main memory at any time. 6ote that even unused portions of active processes/tas!s (e.ecuting programs) are eligible for 36

swapping to dis!. All parts of an e.ecuting program and its data do not have to be in physical memory for the associated process to be active. TTas!s and processes refer essentially to the same entity. And although they have somewhat different terminological histories they have come to be used as synonyms. Today the term process is generally preferred over tas! e.cept when referring to ;multitas!ing; since the alternative term ;multiprocessing; is too easy to confuse with multiprocessor (which is a computer with two or more '"&s). #n "rocess model all software on the computer is organi3ed into a number of se%uential processes. A process includes "' registers and variables. 'onceptually each process has its own virtual '"&. #n reality the '"& switches bac! and forth among processes. (The rapid switching bac! and forth is called multiprogramming). /eUre starting with '"& as a resource so we need an abstraction of '"& use. /e define a process as the OSUs representation of a program in e.ecution so that we can allocate '"& time to it. (Other defs4 *the thing pointed to by a "'$+ to *the animated spirit of a procedure+). 6ote the difference between a program and a process. The ls program on dis! is a program; the ls instance running on a computer is a process. Proce"" "tate" The various process states displayed in a state diagram with arrows indicating possible transitions between states. Operating system !ernel which allows multi-tas!ing needs process to have certain states. 6ames of these states are not standardised but they have similar functionality. • • ,irst the process is >created> - it is loaded from secondary storage device (hard dis! or '15O0...) into main memory. After that process scheduler assigns him state >waiting>. /hen process is >waiting> it waits for scheduler to do so-called conte.t switch and load the process into the processor. The process state then becomes >running> and processor e.ecutes processes instructions. #f a process needs to wait for a resource (wait for user input or file to open ...)it is assigned >bloc!ed> state. "rocess state is changed bac! to >waiting> state when process no longer needs to wait. Once the process finishes e.ecution or is terminated by the operating system it is no longer needed. "rocess is removed instantly or is moved to the >terminated> state. /hen remove it :ust waits to be removed from main memory.





Inter-proce"" co##unication /hen processes communicate with each other it is called >#nter-process communication> (#"'). #t is possible for both processes to run even on different machines. The operating system (OS) differ one to another therefore some mediators (called protocols) are needed. 5i"tor!

37

$y the early BAs computer control software had evolved from 0onitor control software e.g. #$S8S to 2.ecutive control software. 'omputers got >faster> and computer time was still neither >cheap> nor fully used. #t made multiprogramming possible and necessary. 0ultiprogramming means that several programs run >at the same time> (concurrently). At first they ran on a single processor (i.e. uniprocessor) and shared scarce resources. 0ultiprogramming is also basic form of multiprocessing a much broader term. "rograms consist of se%uence of instruction for processor. Single processor can run only one instruction at a time. Therefore it is impossible to run more programs at the same time. "rogram might need some resource (input ...) which has >big> delay. "rogram might start some slow operation (output to printer ...). This all leads to processor being >idle> (unused). To use processor at all time the e.ecution of such program was halted. At that point a second (or nth) program was started or restarted. &ser percieved that programs run >at the same time> (hence the term concurrent). Shortly thereafter the notion of a ;program; was e.panded to the notion of an ;e.ecuting program and its conte.t;. The concept of a process was born. This became necessary with the invention of re-entrant code. Threads came somewhat later. =owever with the advent of time-sharing; computer networ!s; multiple-'"& shared memory computers; etc. the old >multiprogramming> gave way to true multitas!ing multiprocessing and later multithreading.

Proce""e" in Action
At any given moment a process is in one of several states4 These gender-neutral terms are something of an innovation. #n the early days of computer science tal! of father and son processes was more common. This tradition wor!ed in reverse at #$0 where processes were female. $ecause male mammals donUt bear young this is one of the few times where #$0 nomenclature is more sensible. • • • • • running ready bloc!ed 1ispatch Ouantum e.pired

$loc! for #O #O completes The functions of the states are4 • • • 5unning the process is e.ecuting on the processor. Only one process is in this state on a given processor. $loc!ed The process is waiting for some e.ternal event for e.ample dis! #/O 5eady The process is ready to run 38

• •

These states may be defined implicitly. A process is in the ready state if its on the ready %ueue or $loc!ed if itUs on the bloc!ed %ueue. ,re%uently there is more than one ready or bloc!ed %ueue. There may be multiple ready %ueues to reflect :obs priorities and multiple bloc!ed %ueues to represent the events for which the process is waiting. The act of removing one process from the running state and putting another there is called a conte.t switch because the conte.t (that is the running environment4 all the user credentials open files etc.) of one "rocess is changed for another. /eUll tal! more about the details of this ne.t lecture but you should thin! about what constitutes a process conte.t. )ood %uestions to as! are *why does a process leave the running stateL+ and *how does the OS pic! The process to runL+ The answers to those %uestions ma!e up the subtopic of process scheduling.

• • • • •

There are several !inds of schedulers4 • • • • schedulers preemptive nonpreemptive cooperative run-to completion

5un-to-completion schedulers are the easiest to understand. The process leaves the running state e.actly once when it e.its. A process never enters the bloc!ed state. 2.amples are batch systems. Some web servers are conceptually run to completion but because they are usually implemented on systems with a more comple. scheduler their behavior is more comple.. "rocesses in a coopertive multitas!ing environment tell the OS when to switch them. They e.plicitly bloc! for #/O or they specifically give up the '"& to other processes. An e.ample is AppleUs original multitas!ing S!"te# and "o#e Ba;a "!"te#"C A preemptive multiprocessing system interrupts (preempts) a running process if it has had the '"& too long and forces a conte.t switch. &6#7 is a preemptive multitas!ing system. The time a process can !eep the '"& is called the systemUs time %uantum. The choice of time %uantum can have a profound effect on system performance. Small time %uanta give good interactive performance to short interactive :obs (which are li!ely to bloc! for #/O). (arger %uanta are better for long-running '"& bound :obs because they do not ma!e as many conte.t switches (which donUt move their computations forward). #f the time %uantum is so small that the system spends more time switching processes than doing useful wor! the system is said to be thrashing. Thrashing is a condition we shall see in other subsystems as well. The general definition is when a system is spending more time on overhead than on useful wor!. 39

Proce"" Schedu(in and I#p(e#entation Schedu(in (ast lecture we discussed half of process scheduling when a process gives up the '"&. Today we start with the other half which process is scheduled to ta!e its place. This is our first introduction to scheduling algorithms which will be a repeating topic in the course. Operating systems schedule pages of memory dis! bloc!s and several other things. The algorithms discussed today and variations on them tuned for specific other applications are important tools for your bag of OS design tric!s. /hy not :ust pic! a process at randomL 'ongratulations that are a scheduling discipline random scheduling. #t has the advantages that itUs easy to implement but gives somewhat unpredictable results. 6. $. if you have a homogeneous set of :obs it may be an effective scheduling mechanismV All scheduling mechanisms involve design tradeoffs. The relevant parameters to trade off in process Scheduling includes4 • • • • • • • • • • • 5esponse Time for processes to complete. the OS may want to favor certain types of processes or to minimi3e a statistical property li!e average time #mplementation Time This includes the comple.ity of the algorithm and the maintenance Overhead Time to decide which process to schedule and to collect the data needed to ma!e that selection ,airness To what e.tent are different usersU processes treated differently

So#e Schedu(in /i"cip(ine"
Fir"t-In-Fir"t-Out *FIFO+ and Round Ro:in The ready %ueue is a single ,#,O %ueue where the ne.t process to be run is the one at the front of the %ueue. "rocesses are added to the bac! of the ready %ueue. This is a simple discipline to implement and with e%ual si3ed %uanta on a preemptive scheduling system results in each process getting roughly an e%ual time on the processor. #n the limit i.e. a preemptive system with a %uanta the si3e of one machine instruction and no conte.t switch overhead the discipline is called processor sharing and each of n processes gets </n of the '"& time. As %uanta get larger ,#,O tends to discriminate against short :obs that give up the '"& %uic!ly for #/O while long '"&-bound :obs hold it for their full %uantum. 40

Priorit! Schedu(in ,#,O is e%alitarian - all processes are treated e%ually. #t is often reasonable to discriminate between processes based on their relative importance. (The payroll calculations may be more important than my video game.) One method of handling this is to assign each process a priority and run the highest priority process. (/hat to do at on a tie puts us bac! to s%uare < - we pic! a scheduling policy). This solves ,#,Os problem with interactive :obs in a mi.ed wor!load. #nteractive :obs are given high priority and run when there are some. (ower-priority '"&-bound :obs share whatUs left. "articularly aggressive priority schedulers reschedule :obs whenever a :ob moves on any %ueue so interactive :obs would be able to run immediately after their #/O completes. The 'TSS system in Tannenbaum uses a different %uantum at each priority scheduling level. 0ore comple. systems have rules about moving processes between priority levels. (Systems that move processes between mutliply priorities based on their behavior are sometimes called multilevel feedbac! %ueues.)

Priorit! Pro:(e#" - Star;ation and In;er"ion /hen processes cooperate in a priority scheduling system there can be interactions between the processes that confuse the priority system. 'onsider three processes A $ '; A has the highest priority (runs first) and ' the lowest with $ having a priority between them. A bloc!s waiting for ' to do something. $ will run to completion even though A a higher priority process could continue if ' would run. This is sometimes referred to as a priority inversion. This happens in real systems - the 0ars 5over a couple years ago suffered a failure due to a priority inversion. Starvation is simpler to understand. #magine our 9-level priority system above with an endless fast stream of interactive :obs. Any '"&-bound :obs will be never run. The final problem with priority systems is how to determine priorities. The can be statically allocated to each program (ls always runs with priority E) or each user (root always runs with priority E) or computed on the fly (process aging). All of these have their problems. ,or every scheduling strategy there is a counter-strategy. Shorte"t Bo: Fir"t *SBF+ An important metric of interactive :ob performance is the response time of the process (the amount of time that the process is in the sysytem i.e. on some %ueue). SM, minimi3es the average repsonse time for the system. "rocesses are labelled with their e.pected processing time and the shortest one is scheduled first. The problem of course is determining those response times. ,or batch processes that run fre%uently guesses are easy to come by. Other programs have run times that vary widely (e.g. a prime tester runs %uic!ly on even numbers and slowly on primes). #n general the problem of determining run times a priori is impossible. 41

There is hope however in the form of heuristics (that is algorithms that provide good guesses). The simplest is to use a moving average. An average run time is !ept for each program and after each run of that program it is recomputed as4 estimate W X (old estimate) P (< Y X ) measurement (for A Z X [ < and constant 0oving averages are another powerful tool for your design tool!it. Proce"" I#p(e#entation The operating system represents a process primarily in a data structure called a "rocess 'ontrol $loc! ("'$). 8ouUll see Tac! 'ontrol $loc! (T'$) and other variants. /hen a process is created it is allocated a "'$ that includes • • • • • • • • • • • • • • • • • '"& 5egisters "ointer to Te.t (program code) "ointer to uninitiali3ed data Stac! "ointer "rogram 'ounter "ointer to 1ata 5oot directory 1efault ,ile "ermissions /or!ing directory "rocess State 2.it Status ,ile 1escriptors "rocess #dentifier (pid) &ser #dentifier (uid) "ending Signals Signal 0aps Other OS-dependent information

These are some of the ma:or elements that ma!e up the process conte.t. Although not all of them are directly manipulated on a conte.t switch. Conte?t S4itchin The act of switching from one process to another is somewhat machine-dependent. A general outline is4 42

• • • •

The OS gets control (either because of a timer interrupt or because the process made a system call. Operating system processing info is updated (pointer to the current "'$ etc.) "rocessor state is saved (registers memory map and floating point state etc) This process is replaced on the ready %ueue and the ne.t process selected by the scheduling algorithm

• •

The new processUs operating system and processor state is restored The new process continues (to this process it loo!s li!e a bloc! call has :ust returned or as if An interrupt service routine (not a signal handler) has :ust returned

'onte.t switches must be made as safe and fast as possible. Safe because isolation must be maintained and fast because any time spent doing them is stolen from processes doing useful wor!. (inu.Us well-tuned conte.t switch code runs in about @ microseconds on a high-end "entium. Proce"" Creation There are two main models of process creation - the for!/e.ec and the spawn models. On systems that support for! a new process is created as a copy of the original one and then e.plicitly e.ecutes (e.ec) a new program to run. #n the spawn model the new program and arguments are named in the system call a new process is created and that program run directly. ,or! is the more fle.ible model. #t allows a program to arbitrarily change the environment of the child process before starting the new program. Typical for! pseudo-code loo!s li!e4 if (for!() WW A ) /T 'hild process T/ change standard input bloc! signals for timers run the new program else /T "arent process T/ wait for child to complete Any parameters of the child processUs operating environment that must be changed must be included in the parameters to spawn and spawn will have a standard way of handling them. There are various ways to handle the proliferation of parameters that results for e.ample Amiga1OS\ uses tag lists - lin!ed lists of self-describing parameters - to solve the problem. The steps to process creation are similar for both models. The OS gains control after the for! or spawn system call and creates and fills a new "'$. Then a new address space (memory) is allocated for the process. ,or! creates a copy of the parent address space and spawn creates a new 43

address space derived from the program. Then the "'$ is put on the run list and the system call returns. An important difference between the two systems is that the for! call must create a copy of the parent address space. This can be wasteful if that address space will be deleted and rewritten in a few instructionsU time. One solution to this problem has been a second system call vfor! that lets the child process use the parentUs memory until an e.ec is made. /eUll discuss other systems to mitigate the cost of for! when we tal! about memory management.

C5APTER & T5REA/S
Thread"
1espite of the fact that a thread must e.ecute in process the process and its associated threads are different concept. "rocesses are used to group resources together and threads are the entities scheduled for e.ecution on the '"&. A thread is a single se%uence stream within in a process. $ecause threads have some of the properties of processes they are sometimes called lightweight processes. #n a process threads allow multiple e.ecutions of streams. #n many respect threads are popular way to improve application through parallelism. The '"& switches rapidly bac! and forth among the threads giving illusion that the threads are running in parallel. (i!e a traditional process i.e. process with one thread a thread can be in any of several states (5unning $loc!ed 5eady or Terminated). 2ach thread has its own stac!. Since thread will generally call different procedures and thus a different e.ecution history. This is why thread needs its own stac!. An operating system that has thread facility the basic unit of '"& utili3ation is a thread. A thread has or consists of a program counter ("') a register set and a stac! space. Threads are not independent of one other li!e processes as a result threads shares with other threads their code section data section OS resources also !nown as tas! such as open files and signals. 44

Proce"" and Thread" A process is an e.ecution stream in the conte.t of a particular process state. • • An e.ecution stream is a se%uence of instructions. "rocess state determines the effect of the instructions. #t usually includes (but is not restricted to)4 o 5egisters o Stac! o 0emory (global variables and dynamically allocated memory) o Open file tables o Signal management information Iey concept4 processes are separated4 no process can directly affect the state of another process. "rocess is a !ey OS abstraction that users see - the environment you interact with when you use a computer is built up out of processes. • • • The shell you type stuff into is a process. /hen you e.ecute a program you have :ust compiled the OS generates a process to run the program. 8our /// browser is a process.

Organi3ing system activities around processes has proved to be a useful way of separating out different activities into coherent units. Two concepts4 uniprogramming and multiprogramming. • &niprogramming4 only one process at a time. Typical e.ample4 1OS. "roblem4 users often wish to perform more than one activity at a time (load a remote file while editing a program for e.ample) and uniprogramming does not allow this. So 1OS and other uniprogrammed systems put in things li!e memory-resident programs that invo!ed asynchronously but still have separation problems. One !ey problem with 1OS is that there is no memory protection one program may write the memory of another program causing weird bugs. 0ultiprogramming4 multiple processes at a time. Typical of &ni. plus all currently envisioned new operating systems. Allows system to separate out activities cleanly.



0ultiprogramming introduces the resource sharing problem - which processes get to use the physical resources of the machine whenL One crucial resource4 '"&. Standard solution is to use preemptive multitas!ing - OS runs one process for a while then ta!es the '"& away from that process and lets another process run. 0ust save and restore process state. Iey issue4 fairness. 0ust ensure that all processes get their fair share of the '"&. =ow does the OS implement the process abstractionL &ses a conte.t switch to switch from running one process to running another process. 45

=ow does machine implement conte.t switchL A processor has a limited amount of physical resources. ,or e.ample it has only one register set. $ut every process on the machine has its own set of registers. Solution4 save and restore hardware state on a conte.t switch. Save the state in "rocess 'ontrol $loc! ("'$). /hat is in "'$L 1epends on the hardware. • • • 5egisters - almost all machines save registers in "'$. "rocessor Status /ord. /hat about memoryL 0ost machines allow memory from multiple processes to coe.ist in the physical memory of the machine. Some may re%uire 0emory 0anagement &nit (00&) changes on a conte.t switch. $ut some early personal computers switched all of process;s memory out to dis! (VVV).

Operating Systems are fundamentally event-driven systems - they wait for an event to happen respond appropriately to the event then wait for the ne.t event. 2.amples4 • • &ser hits a !ey. The !eystro!e is echoed on the screen. A user program issues a system call to read a file. The operating system figures out which dis! bloc!s to bring in and generates a re%uest to the dis! controller to read the dis! bloc!s into memory. The dis! controller finishes reading in the dis! bloc! and generates and interrupt. The OS moves the read data into the user program and restarts the user program. A 0osaic or 6etscape user as!s for a &5( to be retrieved. This eventually generates re%uests to the OS to send re%uest pac!ets out over the networ! to a remote /// server. The OS sends the pac!ets. The response pac!ets come bac! from the /// server interrupting the processor. The OS figures out which process should get the pac!ets then routes the pac!ets to that process. Time-slice timer goes off. The OS must save the state of the current process choose another process to run the give the '"& to that process.

• •

• •

/hen build an event-driven system with several distinct serial activities threads are a !ey structuring mechanism of the OS. A thread is again an e.ecution stream in the conte.t of a thread state. Iey difference between processes and threads is that multiple threads share parts of their state. Typically allow multiple threads to read and write same memory. (5ecall that no processes could directly access memory of another process). $ut each thread still has its own registers. Also has its own stac! but other threads can read and write the stac! memory. /hat is in a thread control bloc!L Typically :ust registers. 1on;t need to do anything to the 00& when switch threads because all threads can access same memory. Typically an OS will have a separate thread for each distinct activity. #n particular the OS will have a separate thread for each process and that thread will perform OS activities on behalf of the process. #n this case we say that each user process is bac!ed by a !ernel thread. 46



/hen process issues a system call to read a file the process;s thread will ta!e over figure out which dis! accesses to generate and issue the low level instructions re%uired to start the transfer. #t then suspends until the dis! finishes reading in the data. /hen process starts up a remote T'" connection its thread handles the low-level details of sending out networ! pac!ets.



=aving a separate thread for each activity allows the programmer to program the actions associated with that activity as a single serial stream of actions and events. "rogrammer does not have to deal with the comple.ity of interleaving multiple activities on the same thread. /hy allow threads to access same memoryL $ecause inside OS threads must coordinate their activities very closely. • • #f two processes issue read file system calls at close to the same time must ma!e sure that the OS seriali3es the dis! re%uests appropriately. /hen one process allocates memory its thread must find some free memory and give it to the process. 0ust ensure that multiple threads allocate dis:oint pieces of memory.

=aving threads share the same address space ma!es it much easier to coordinate activities can build data structures that represent system state and have threads read and write data structures to figure out what to do when they need to process a re%uest. One complication that threads must deal with4 asynchrony. Asynchronous events happen arbitrarily as the thread is e.ecuting and may interfere with the thread;s activities unless the programmer does something to limit the asynchrony. 2.amples4 • • • An interrupt occurs transferring control away from one thread to an interrupt handler. A time-slice switch occurs transferring control from one thread to another. Two threads running on different processors read and write the same memory.

Asynchronous events if not properly controlled can lead to incorrect behavior. 2.amples4 • Two threads need to issue dis! re%uests. ,irst thread starts to program dis! controller (assume it is memory-mapped and must issue multiple writes to specify a dis! operation). #n the meantime the second thread runs on a different processor and also issues the memorymapped writes to program the dis! controller. The dis! controller gets horribly confused and reads the wrong dis! bloc!. Two threads need to write to the display. The first thread starts to build its re%uest but before it finishes a time-slice switch occurs and the second thread starts its re%uest. The combination of the two threads issues a forbidden re%uest se%uence and smo!e starts pouring out of the display. ,or accounting reasons the operating system !eeps trac! of how much time is spent in each user program. #t also !eeps a running sum of the total amount of time spent in all user programs. Two threads increment their local counters for their processes then concurrently increment the global counter. Their increments interfere and the recorded total time spent in all user processes is less than the sum of the local times. 47





So programmers need to coordinate the activities of the multiple threads so that these bad things don;t happen. Iey mechanism4 synchroni3ation operations. These operations allow threads to control the timing of their events relative to events in other threads. Appropriate use allows programmers to avoid problems li!e the ones outlined above.

Thread 'reation 0anipulation and Synchroni3ation
/e first must postulate a thread creation and manipulation interface. /ill use the one in 6achos4 class Thread public: Thread(char* debugName); ~Thread(); void Fork(void (*func)(int), int arg); void ield();

void Finish(); The Thread constructor creates a new thread. #t allocates a data structure with space for the T'$. • • To actually start the thread running must tell it what function to start running when it runs. The ,or! method gives it the function and a parameter to the function. /hat does ,or! doL #t first allocates a stac! for the thread. #t then sets up the T'$ so that when the thread starts running it will invo!e the function and pass it the correct parameter. #t then puts the thread on a run %ueue someplace. ,or! then returns and the thread that called ,or! continues. =ow does OS set up T'$ so that the thread starts running at the functionL ,irst it sets the stac! pointer in the T'$ to the stac!. Then it sets the "' in the T'$ to be the first instruction in the function. Then it sets the register in the T'$ holding the first parameter to the parameter. /hen the thread system restores the state from the T'$ the function will magically start to run. The system maintains a %ueue of runnable threads. /henever a processor becomes idle the thread scheduler grabs a thread off of the run %ueue and runs the thread. 'onceptually threads e.ecute concurrently. This is the best way to reason about the behavior of threads. $ut in practice the OS only has a finite number of processors and it can;t run all of the runnable threads at once. So must multiple. the runnable threads on the finite number of processors.



• •

(et;s do a few thread e.amples. ,irst e.ample4 two threads that increment a variable. int a ! "; void sum(int p) a##; print($%d : a ! %d&n$, p, a); 48

' void main() ( Thread *t ! ne) Thread ($child$); t*+Fork(sum, ,); sum ("); '

• •

The two calls to sum run concurrently. /hat are the possible results of the programL To understand this fully we must brea! the sum subroutine up into its primitive components. Sum first reads the value of into a register. #t then increments the register then stores the contents of the register bac! into a. #t then reads the values of the control string p and a into the registers that it uses to pass arguments to the print routine. #t then calls print which prints out the data. The best way to understand the instruction se%uence is to loo! at the generated assembly language (cleaned up :ust a bit). 8ou can have the compiler generate assembly code instead of ob:ect code by giving it the -S flag. #t will put the generated assembly in the same file name as the .c or .cc file but with a .s suffi..



la ld add st ld %o" mop la call •

a, %r" -%r".,%r, %r,,,,%r, %r,,-%r". -%r"., %o/ 0 1arameters are passed starting )ith %o", %o, 23,4, %o" print

So when e.ecute concurrently the result depends on how the instructions interleave. /hat are possible resultsL A4 < <4 9 A4 < <4 <

49

<4 9 A4 <

<4 < A4 <

<4 < A4 9

A4 9 <4 9

A4 9 <4 <

<4 9 A4 9

So the results are nondeterministic - you may get different results when you run the program more than once. So it can be very difficult to reproduce bugs. 6ondeterministic e.ecution is one of the things that ma!e writing parallel programs much more difficult than writing serial programs. • 'hances are the programmer is not happy with all of the possible results listed above. "robably wanted the value of to be 9 after both threads finish. To achieve this must ma!e the increment operation atomic. That is must prevent the interleaving of the instructions in a way that would interfere with the additions. 'oncept of atomic operation. An atomic operation is one that e.ecutes without any interference from other operations - in other words it e.ecutes as one unit. Typically build comple. atomic operations up out of se%uences of primitive operations. #n our case the primitive operations are the individual machine instructions. 0ore formally if several atomic operations e.ecute the final result is guaranteed to be the same as if the operations e.ecuted in some serial order. #n our case above build an increment operation up out of loads stores and add machine instructions. /ant the increment operation to be atomic. &se synchroni3ation operations to ma!e code se%uences atomic. ,irst synchroni3ation abstraction4 semaphores. A semaphore is conceptually a counter that supports two atomic operations " and G. =ere is the Semaphore interface from 6achos4 class 5emaphore public: 5emaphore(char* debugName, int initial 6alue); ~5emaphore(); void 1(); void 6(); ' • =ere is what the operations do4 50



• • •

o Semphore (name count)4 creates a semaphore and initiali3es the counter to count. o " ()4 Atomically waits until the counter is greater than A then decrements the counter and returns. o G ()4 Atomically increments the counter. • =ere is how we can use the semaphore to ma!e the sum e.ample wor!4 int a ! "; 5emaphore *s; void sum(int p) ( int t; s*+1 (); a##; t ! a; s*+6 (); print ($%d : a ! %d&n$, p, t); ' void main () Thread *t ! ne) Thread ($child$); s ! ne) 5emaphore ($s$, ,); t*+Fork (sum, ,); sum ("); ' • /e are using semaphores here to implement a mutual e.clusion mechanism. The idea behind mutual e.clusion is that only one thread at a time should be allowed to do something. #n this case only one thread should access a. &se mutual e.clusion to ma!e operations atomic. The code that performs the atomic operation is called a critical section. Semaphores do much more than mutual e.clusion. They can also be used to synchroni3e producer/consumer programs. The idea is that the producer is generating data and the consumer is consuming data. So a &ni. pipe has a producer and a consumer. 8ou can also thin! of a person typing at a !eyboard as a producer and the shell program reading the characters as a consumer. =ere is the synchroni3ation problem4 ma!e sure that the consumer does not get ahead of the producer. $ut we would li!e the producer to be able to produce without waiting for the consumer to consume. 'an use semaphores to do this. =ere is how it wor!s4 51





5emaphore *s; void consumer (int dumm7) )hile (,) s*+1 (); consume the ne8t unit of data void producer(int dumm7) )hile (,) produce the ne8t unit of data s*+6 (); void main ) s ! ne) 5emaphore ($s$, "); Thread *t ! ne) Thread ($consumer$); t*+Fork(consumer, ,); t ! ne) Thread ($producer$); t*+Fork(producer, ,); ' #n some sense the semaphore is an abstraction of the collection of data. • #n the real world pragmatics intrude. #f we let the producer run forever and never run the consumer we have to store all of the produced data somewhere. $ut no machine has an infinite amount of storage. So we want to let the producer to get ahead of the consumer if it can but only a given amount ahead. /e need to implement a bounded buffer which can hold only 6 items. #f the bounded buffer is full the producer must wait before it can put any more data in. 5emaphore *full; 5emaphore *empt7; void consumer (int dumm7) 9hile (,) full*+1 (); consume the ne8t unit of data empt7*+6 (); void producer (int dumm7) )hile (,) 52

empt7*+1 (); produce the ne8t unit of data full*+6 (); void main () empt7 ! ne) 5emaphore ($empt7$, N); full ! ne) 5emaphore($full$, "); Thread *t ! ne) Thread ($consumer$); t*+Fork(consumer, ,); t ! ne) Thread ($producer$); t*+Fork(producer, ,); ' An e.ample of where you might use a producer and consumer in an operating system is the console (a device that reads and writes characters from and to the system console). 8ou would probably use semaphores to ma!e sure you don;t try to read a character before it is typed. • • Semaphores are one synchroni3ation abstraction. There is another called loc!s and condition variables. (oc!s are an abstraction specifically for mutual e.clusion only. =ere is the 6achos loc! interface4 class 3ock public: 3ock (char* debugName); F<== ~3ock (); :: deallocate lock :: initiali;e lock to be

void >c?uire (); :: these are the onl7 operations on a lock void <elease (); :: the7 are both *atomic* ' • A loc! can be in one of two states4 loc!ed and unloc!ed. Semantics of loc! operations4 o (oc! (name)4 creates a loc! that starts out in the unloc!ed state. o Ac%uire ()4 Atomically waits until the loc! state is unloc!ed then sets the loc! state to loc!ed. o 5elease ()4 Atomically changes the loc! state to unloc!ed from loc!ed. #n assignment < you will implement loc!s in 6achos on top of semaphores. 53



/hat are re%uirements for a loc!ing implementationL o Only one thread can ac%uire loc! at a time. (safety) o #f multiple threads try to ac%uire an unloc!ed loc! one of the threads will get it. (liveness) o All unloc!s complete in finite time. (liveness)



/hat are desirable properties for a loc!ing implementationL o 2fficiency4 ta!e up as little resources as possible. o ,airness4 threads ac%uire loc! in the order they as! for it. Are also wea!er forms of fairness. o Simple to use.



/hen use loc!s typically associate a loc! with pieces of data that multiple threads access. /hen one thread wants to access a piece of data it first ac%uires the loc!. #t then performs the access then unloc!s the loc!. So the loc! allows threads to perform complicated atomic operations on each piece of data. 'an you implement unbounded buffer only using loc!sL There is a problem - if the consumer wants to consume a piece of data before the producer produces the data it must wait. $ut loc!s do not allow the consumer to wait until the producer produces the data. So consumer must loop until the data is ready. This is bad because it wastes '"& resources. There is another synchroni3ation abstraction called condition variables :ust for this !ind of situation. =ere is the 6achos interface4 class @ondition public: @ondition (char* debugName); ~@ondition (); void 9ait (3ock *condition3ock); void 5ignal (3ock *condition3ock); void Aroadcast (3ock *condition3ock); '







Semantics of condition variable operations4 o 'ondition (name)4 creates a condition variable. o /ait ((oc! Tl)4 Atomically releases the loc! and waits. /hen /ait returns the loc! will have been reac%uired.

54

o Signal ((oc! Tl)4 2nables one of the waiting threads to run. /hen Signal returns the loc! is still ac%uired. o $roadcast ((oc! Tl)4 2nables all of the waiting threads to run. /hen $roadcast returns the loc! is still ac%uired. • • All loc!s must be the same. #n assignment < you will implement condition variables in 6achos on top of semaphores. Typically you associate a loc! and a condition variable with a data structure. $efore the program performs an operation on the data structure it ac%uires the loc!. #f it has to wait before it can perform the operation it uses the condition variable to wait for another operation to bring the data structure into a state where it can perform the operation. #n some cases you need more than one condition variable. (et;s say that we want to implement an unbounded buffer using loc!s and condition variables. #n this case we have 9 consumers. 3ock *l; @ondition *c; int avail ! "; void consumer (int dumm7) )hile (,) ( l*+>c?uire(); if (avail !! ") c*+9ait(l);



consume the ne8t unit of data avail**; l*+<elease(); void producer (int dumm7) )hile (,) l*+>c?uire(); produce the ne8t unit of data avail##; c*+5ignal (l); l*+<elease(); void main () 55

l ! ne) 3ock ($l$); c ! ne) @ondition ($c$); Thread *t ! ne) Thread ($consumer$); t*+Fork(consumer, ,); Thread *t ! ne) Thread ($consumer$); t*+Fork(consumer, B); t ! ne) Thread ($producer$); t*+Fork(producer, ,); '



There are two variants of condition variables4 =oare condition variables and 0esa condition variables. ,or =oare condition variables when one thread performs a Signal the very ne.t thread to run is the waiting thread. ,or 0esa condition variables there are no guarantees when the signalled thread will run. Other threads that ac%uire the loc! can e.ecute between the signaler and the waiter. The e.ample above will wor! with =oare condition variables but not with 0esa condition variables. /hat is the problem with 0esa condition variablesL 'onsider the following scenario4 Three threads thread < one producing data threads 9 and E consng data. o Thread 9 calls consumer and suspends. o Thread < calls producer and signals thread 9. o #nstead of thread 9 running ne.t thread E runs ne.t calls consumer and consumes the element. (6ote4 with =oare monitors thread 9 would always run ne.t so this would not happen.) o Thread 9 runs and tries to consume an item that is not there. 1epending on the data structure used to store produced items may get some !ind of illegal access error.





=ow can we fi. this problemL 5eplace if with a while. void consumer (int dumm7) )hile (,) l*+>c?uire(); )hile (avail !! ") c*+9ait(l); consume the ne8t unit of data 56

>vail**; l*+<elease(); ' ' #n general this is a crucial point. Always put while;s around your condition variable code. #f you don;t you can get really obscure bugs that show up very infre%uently. • • #n this e.ample what is the data that the loc! and condition variable are associated withL The avail variable. "eople have developed a programming abstraction that automatically associates loc!s and condition variables with data. This abstraction is called a monitor. A monitor is a data structure plus a set of operations (sort of li!e an abstract data type). The monitor also has a loc! and optionally one or more condition variables. The compiler for the monitor language automatically inserts a loc! operation at the beginning of each routine and an unloc! operation at the end of the routine. So programmer does not have to put in the loc! operations. 0onitor languages were popular in the middle HA;s - they are in some sense safer because they eliminate one possible programming error. $ut more recent languages have tended not to support monitors e.plicitly and e.pose the loc!ing operations to the programmer. So the programmer has to insert the loc! and unloc! operations by hand. Mava ta!es a middle ground - it supports monitors but also allows programmers to e.ert finer grain control over the loc!ed sections by supporting synchroni3ed bloc!s within methods. $ut synchroni3ed bloc!s still present a structured model of synchroni3ation so it is not possible to mismatch the loc! ac%uire and release. (aundromat 2.ample4 A local (aundromat has switched to a computeri3ed machine allocation scheme. There are 6 machines numbered < to 6. $y the front door there are " allocation stations. /hen you want to wash your clothes you go to an allocation station and put in your coins. The allocation station gives you a number and you use that machine. There are also " deallocation stations. /hen your clothes finish you give the number bac! to one of the deal location stations and someone else can use the machine. =ere is the alpha release of the machine allocation software4 allocate (int dumm7) ( )hile (,) ( )ait for coins from user n ! get (); give number n to user deallocate (int dumm7) )hile (,) 9ait for number n from user 57







put (i); main () for (i ! "; i C 1; i##) ( t ! ne) Thread ($allocate$); t*+Fork(allocate, "); t ! ne) Thread ($deallocate$); t*+Fork(deallocate, "); ' • The !ey parts of the scheduling are done in the two routines get and put which use an array data structure a to !eep trac! of which machines are in use and which are free. int a-N.; int get() ( for (i ! "; i C N; i##) if (a-i. !! ") ( >-i. ! ,; return (i#,); void put (int i) a -i*,. ! "; ' • • #t seems that the alpha software isn;t doing all that well. Must loo!ing at the software you can see that there are several synchroni3ation problems. The first problem is that sometimes two people are assigned to the same machine. /hy does this happenL /e can fi. this with a loc!4 int a-N.; 3ock *l; int get () l*+>c?uire (); for (i ! "; i C N; i##) if (a -i. !! ") a -i. ! ,; l*+<elease (); 58

return (i#,); l*+<elease (); void put (int i) l*+>c?uire(); a -i*,. ! "; l*+<elease ();

So now have fi.ed the multiple assignment problem. $ut what happens if someone comes in to the laundry when all of the machines are already ta!enL /hat does the machine returnL 0ust fi. it so that the system waits until there is a machine free before it returns a number. The situation calls for condition variables. int a-N.; 3ock *l; @ondition *c; int get() l*+>c?uire(); )hile (,) for (i ! "; i C N; i##) if (a -i. !! ") a -i. ! ,; l*+<elease (); return (i#,); c*+9ait (l); void put (int i) l*+>c?uire (); a-i*,. ! "; c*+5ignal (); l*+<elease ();



/hat data is the loc! protectingL The a array.

59



/hen would you use a broadcast operationL /henever want to wa!e up all waiting threads not :ust one. ,or an event that happens only once. ,or e.ample a bunch of threads may wait until a file is deleted. The thread that actually deleted the file could use a broadcast to wa!e up all of the threads. Also use a broadcast for allocation/deallocation of variable si3ed units. 2.ample4 concurrent malloc/free. 3ock *l; @ondition *c; char *malloc (int s) l*+>c?uire (); )hile (cannot allocate a chunk of si;e s) c*+9ait (l);



allocate chunk of si;e s; l*+<elease (); return pointer to allocated chunk void free (char *m) l*+>c?uire (); deallocate m2 c*+Aroadcast (l); l*+<elease (); 2.ample with malloc/free. #nitially start out with <A bytes free. Time "rocess < malloc(<A) - succeeds < 9 E D free(<A) - broadcast resume malloc(@) - succeeds 60 "rocess 9 malloc(@) - suspends loc! gets loc! - waits gets loc! - waits "rocess E malloc(@) suspends loc!

@ B C H ? <A resume malloc(C) - waits free(@) - broadcast malloc(C) - waits

resume malloc(@) - succeeds

malloc(E) - waits

resume malloc(E) - succeeds

/hat would happen if changed c-]$roadcast(l) to c-]Signal(l)L At step <A process E would not wa!e up and it would not get the chance to allocate available memory. /hat would happen if changed while loop to an ifL • 8ou will be as!ed to implement condition variables as part of assignment <. The following implementation is #6'O552'T. "lease do not turn this implementation in. class @ondition private: int )aiting; 5emaphore *sema; void @ondition::9ait (3ock* l) )aiting##; l*+<elease(); sema*+1(); l*+>c?uire(); void @ondition::5ignal (3ock* l) if ()aiting + ") seam7*+6 (); )aiting**; As we mentioned earlier that in many respect threads operate in the same way as that of processes. Some of the similarities and differences are4 61

Si#i(aritie" • • • • (i!e processes threads share '"& and only one thread active (running) at a time. (i!e processes threads within a processes threads within a processes e.ecute se%uentially. (i!e processes thread can create children. And li!e process if one thread is bloc!ed another thread can run.

/i88erence" • • • &nli!e processes threads are not independent of one another. &nli!e processes all threads can access every address in the tas!. &nli!e processes thread are design to assist one other. 6ote that processes might or might not assist one another because processes may originate from different users.

6h! Thread"7 ,ollowing are some reasons why we use threads in designing operating systems. <. A process with multiple threads ma!e a great server for e.ample printer server. 9. $ecause threads can share common data communication. they do not need to use interprocess

E. $ecause of the very nature threads can ta!e advantage of multiprocessors. Threads are cheap in the sense that <. They only need a stac! and storage for registers therefore threads are cheap to create. 9. Threads use very little resources of an operating system in which they are wor!ing. That is threads do not need new address space global data program code or operating system resources. E. 'onte.t switching are fast when wor!ing with threads. The reason is that we only have to save and/or restore "' S" and registers. $ut this cheapness does not come free - the biggest drawbac! is that there is no protection between threads.

)"er Le;e( Thread" and Aerne( Le;e( Thread"
)"er-Le;e( Thread" &ser-level threads implement in user-level libraries rather than via systems calls so thread switching does not need to call operating system and to cause interrupt to the !ernel. #n fact the !ernel !nows nothing about user-level threads and manages them as if they were single-threaded processes. Ad;anta e": 62

The most obvious advantage of this techni%ue is that a user-level threads pac!age can be implemented on an Operating System that does not support threads. Some other advantages are • • &ser-level threads does not re%uire modification to operating systems. Simple 5epresentation4 2ach thread is represented simply by a "' registers stac! and a small control bloc! all stored in the user process address space. • Simple 0anagement4 This simply means that creating a thread switching between threads and synchroni3ation between threads can all be done without intervention of the !ernel. • ,ast and 2fficient4 Thread switching is not much more e.pensive than a procedure call. /i"ad;anta e": • There is a lac! of coordination between threads and operating system !ernel. Therefore process as whole gets one time slice respect of whether process has one thread or <AAA threads within. #t is up to each thread to relin%uish control to other threads. &ser-level threads re%uires non-bloc!ing systems call i.e. a multithreaded !ernel. Otherwise entire process will bloc!ed in the !ernel even if there are runable threads left in the processes. ,or e.ample if one thread causes a page fault the process bloc!s.



Aerne(-Le;e( Thread" #n this method the !ernel !nows about and manages the threads. 6o runtime system is needed in this case. #nstead of thread table in each process the !ernel has a thread table that !eeps trac! of all threads in the system. #n addition the !ernel also maintains the traditional process table to !eep trac! of processes. Operating Systems !ernel provides system call to create and manage threads. Ad;anta e": • • $ecause !ernel has full !nowledge of all threads Scheduler may decide to give more time to a process having large number of threads than process having small number of threads. Iernel-level threads are especially good for applications that fre%uently bloc!.

/i"ad;anta e": • • The !ernel-level threads are slow and inefficient. ,or instance threads operations are hundreds of times slower than that of user-level threads. Since !ernel must manage and schedule threads as well as processes. #t re%uire a full thread control bloc! (T'$) for each thread to maintain information about threads. As a result there is significant overhead and increased in !ernel comple.ity.

63

Ad;anta e" o8 Thread" o;er Mu(tip(e Proce""e" • 'onte.t Switching Threads are very ine.pensive to create and destroy and they are ine.pensive to represent. ,or e.ample they re%uire space to store the "' the S" and the general-purpose registers but they do not re%uire space to share memory information #nformation about open files of #/O devices in use etc. /ith so little conte.t it is much faster to switch between threads. #n other words it is relatively easier for a conte.t switch using threads. Sharing Treads allow the sharing of a lot resources that cannot be shared in process for e.ample sharing code section data section Operating System resources li!e open file etc.



/i"ad;anta e" o8 Thread" o;er Mu(tiproce""e" • • $loc!ing The ma:or disadvantage if that if the !ernel is single threaded a system call of one thread will bloc! the whole process and '"& may be idle during the bloc!ing period. Security Since there is an e.tensive sharing among threads there is a potential problem of security. #t is %uite possible that one thread over writes the stac! of another thread (or damaged shared data) although it is very unli!ely since threads are meant to cooperate on a single tas!.

App(ication that Bene8it" 8ro# Thread" A pro.y server satisfying the re%uests for a number of computers on a (A6 would be benefited by a multi-threaded process. #n general any program that has to do more than one tas! at a time could benefit from multitas!ing. ,or e.ample a program that reads input process it and outputs could have three threads one for each tas!. App(ication that cannot :ene8it 8ro# Thread" Any se%uential process that cannot be divided into parallel tas! will not benefit from thread as they would bloc! until the previous one completes. ,or e.ample a program that displays the time of the day would not benefit from multiple threads. 5esources used in Thread 'reation and "rocess 'reation

/hen a new thread is created it shares its code section data section and operating system resources li!e open files with other threads. $ut it is allocated its own stac! register set and a program counter. The creation of a new process differs from that of a thread mainly in the fact that all the shared resources of a thread are needed e.plicitly for each process. So though two processes may be 64

running the same piece of code they need to have their own copy of the code in the main memory to be able to run. Two processes also do not share other resources with each other. This ma!es the creation of a new process very costly compared to that of a new thread.

Conte?t S4itch
To give each process on a multiprogrammed machine a fair share of the '"& a hardware cloc! generates interrupts periodically. This allows the operating system to schedule all processes in main memory (using scheduling algorithm) to run on the '"& at e%ual intervals. 2ach time a cloc! interrupt occurs the interrupt handler chec!s how much time the current running process has used. #f it has used up its entire time slice then the '"& scheduling algorithm (in !ernel) pic!s a different process to run. 2ach switch of the '"& from one process to another is called a conte.t switch. MaDor Step" o8 Conte?t S4itchin • • The values of the '"& registers are saved in the process table of the process that was running :ust before the cloc! interrupt occurred. The registers are loaded from the process pic!ed by the '"& scheduler to run ne.t.

#n a multiprogrammed uniprocessor computing system conte.t switches occur fre%uently enough that all processes appear to be running concurrently. #f a process has more than one thread the Operating System can use the conte.t switching techni%ue to schedule the threads so they appear to e.ecute in parallel. This is the case if threads are implemented at the !ernel level. Threads can also be implemented entirely at the user level in run-time libraries. Since in this case no thread scheduling is provided by the Operating System it is the responsibility of the programmer to yield the '"& fre%uently enough in each thread so all threads in the process can ma!e progress. Action o8 Aerne( to Conte?t S4itch a#on Thread" The threads share a lot of resources with other peer threads belonging to the same process. So a conte.t switch among threads for the same process is easy. #t involves switch of register set the program counter and the stac!. #t is relatively easy for the !ernel to accomplished this tas!.

Action o8 0erne( to Conte?t S4itch a#on Proce""e" 'onte.t switches among processes are e.pensive. $efore a process can be switched its process control bloc! ("'$) must be saved by the operating system. The "'$ consists of the following information4 • • • • • • The process state. The program counter "'. The values of the different registers. The '"& scheduling information for the process. 0emory management information regarding the process. "ossible accounting information for this process. 65



#/O status information of the process.

/hen the "'$ of the currently e.ecuting process is saved the operating system loads the "'$ of the ne.t process that has to be run on '"&. This is a heavy tas! and it ta!es a lot of time. )"er Thread" &ser threads are implemented entriely in user space. The programmer of the thread library writes code to synchroni3e threads and to conte.t switch them and they all run in one process. The operating system is unaware that a thread system is even running. &ser-level threads replicate some amount of !ernel level functionality in user space. 2.amples of user-level threads systems are 6achos and Mava (on OSes that donUt support !ernel threads). $ecause the OS treats the running process li!e any other there is no additional !ernel overhead for user-level threads. =owever the user-level threads only run when the OS has scheduled their underlying process (ma!ing a bloc!ing system call bloc!s all the threads.) Aerne( Thread" Some OS !ernels support the notion of threads and schedule them directly. There are system calls to create threads and manipulate them in ways similar to processes. Synchroni3ation and scheduling may be provided by the !ernel. Iernel-level threads have more overhead in the !ernel (a !ernel thread control bloc!) and more overhead in their use (manipulating them re%uires a system call). =owever the abstraction is cleaner (threads can ma!e system calls independently).

66

C5APTER ' T5E CENTRAL PROCESSING )NIT *CP)+
This chapter gives some more detail on the 'entral "rocessing &nit ('"&) and leads up to where we can write significant programs in assembly/machine code. ,irst we will give an overview of how a processor and memory function together to e.ecute a single machine instruction - the famous fetch-decode-e.ecute cycle. A '"& consists of three ma:or parts4 <. The internal registers the A(& and the connecting buses - sometimes called the data path; 9. The input-output interface which is the gateway through which data are sent and received from main memory and input-output devices; E. The control part which directs the activities of the data path and input-output interface e.g. opening and closing access to buses selecting A(& function etc. /e will avoid going into much detail about the control. A fourth part main memory is never far from the '"& but from a logical point of view is best !ept separate. /e will pay most attention to the data path part of the processor and what must happen in it to cause useful things to happen - to cause program instructions to be e.ecuted. #n the system we describe the control part is implemented by microprogram i.e. how the fetching decoding and e.ecution of a machine instruction can be implemented by e.ecution of a set of se%uencing steps called a microprogram. 6ote on terminology4 the term microprogram was devised in the early <?@As long before microprocessors were ever dreamt of.

The Architecture o8 Mic-1
,igure B.< shows the data path part of our hypothetical '"& from (Tanenbaum <??A) page <CA onwards. =ere we briefly describe the components of ,igure B.<. Then we give a %ualitative discussion of how it e.ecutes program instructions. ,inally we describe the e.ecution of instructions in some detail.

,igure B.<4 0ic-< '"& (from Tanenbaum Structured 'omputer Organisation Erd ed.)

67

Re i"ter" There are <B identical <B-bit registers. $ut they are not general purpose each has a special use4 PC< pro ra# counter: The "' points to memory location that holds the ne.t instruction to be e.ecuted; AC< accu#u(ator: The accumulator is li!e the display register in a calculator; most operations use it implicitly as an unmentioned input and the result of any operation is placed in it. ,or now we can ignore all the others though we give brief descriptions below. SP< "tac0-pointer: &sed for maintaining a data area called the stac!; the stac! is used for remembering where we came from when we call subprograms; li!ewise for remembering data when an interrupt is being processed; it is also used as a communication medium for passing data to subprograms; finally it is used as storage area for local variables in subprograms; IR< In"truction Re i"ter: =olds the instruction (the actual instruction data) currently being e.ecuted. TIR< Te#porar! In"truction Re i"ter: =olds temporary versions of the instruction while it is being decoded; A P< -< 'onstants; it is handy to have copies of them close by - avoids wasting time accessing main memory. AMASA: Another constant; used for mas!ing (anding) the address part of the instruction; i.e. A0ASI and #5 address. SMASA: ditto for stac! (relative) addresses. A< B< CCCF: )eneral purpose registers; but general purpose only for the micro programmer i.e. the assembly language cannot address them. Interna( Bu"e" There are three internal buses A and $ (source) buses and ' (destination) bus. E?terna( Bu"e" The address bus and the data bus. 0inor point to note4 many buses in particular those in many of the #ntel HA7HB family use the same physical bus (connections) for both address and data; it;s simple to do - the control part of the bus :ust has to ma!e sure all users of the bus !now when it;s data when it;s address. Latche" A and $ latches hold stable versions of A and $ buses. There would be problems if for e.ample A' was connected straight into the A input of the A(& and meanwhile the output of the A(& was connected to A' i.e.. what version of A' to use; the answer would be continuously changing. A-Mu(tip(e?er *AM)9+ 68

The A(& input A can be fed with either4 (i) the contents of the A latch; or (ii) the contents of 0$5 i.e. what was originally the contents of a memory location. AL) #n 0ac-<a the A(& may perform :ust one of four functions4 A ; note ^plus; rather than or; <. ; 9 straight through E ; . ignored; ; ; ;

Any other functions have to be programmed. Shi8ter The shifter is not a register - it passes the A(& output straight through4 shifted left shifted right or not shifted. Me#or! Addre"" Re i"ter *MAR+ and Me#or! Bu88er Re i"ter *MBR+ and Me#or! The 0A5 is a register which is used as a gateway - a ^buffer; - onto the address bus. (i!ewise the 0$5 (it might be better to call this memory data register) for the data bus. The memory is considered to be a collection of cells or locations each of which can be addressed individually and thus written to or read from. 2ffectively memory is li!e an array in ' $asic or any other high-level language. ,or brevity we shall refer to this memory ^array; as and and so or . the address of a general cell as as

the contents of the cell at address

To read from a memory cell the controller must cause the following to happen4 <. "ut an address in 0A5; 69

9. 5e%uests read - by asserting a read control line; E. At some time later the contents of from where the controller can cause it to be ... D. ...transferred to the A' or somewhere else. To write to a memory cell the controller must cause something similar to happen4 <. "ut an address 9. "ut the data in 0$5; E. 5e%uests write - by asserting a write control line; D. At some time later the data arrive in memory cell . in 0A5; appear in 0$5

#t is a feature of all general purpose computers that e.ecutable instructions and data occupy the same memory space. Often programs are organised so that there are bloc!s of instructions and bloc!s of data. $ut there is no fundamental reason e.cept tidiness and efficiency why instructions and data cannot be mi.ed up together. Re i"ter Tran"8er Lan ua e To describe the details of operation of the '"& we use a simple language called 5egister Transfer (anguage (5T(). The notation is as follows. denotes contents of location :ust . Thin! of an envelope with _<AA in it and your address on it. ; sometimes or even

5eg denotes a register; 5et W "' #5 A' 5< or 59. denotes contents of the address contained in another envelope. /e use to denote transfer4 we say ^A gets contents of .;. . Thin! of an envelope containing

. "ronounce this as ^A gets $;. #n the case of

Si#p(e Mode( o8 a Co#puter - Part %
$ac! in section 9.C we produced a simple model of a computer. =ere we show it again ,igure B.9.

70

,igure B.94 0echanical 'omputer At the end of section 9.C we admitted that we had been telling only half the truthV And we admitted that we had to fit the program into memory as well. ,ine here goes. /ere going to use the same program. #n this more realistic model the person operating the '"& has no list of instructions available on the des! but must read one instruction at a time from memory. 5ecall what was needed4 add the contents of memory cell A to the contents of memory cell < store the result in cell 9; if the result is greater-than-or-e%ual-to DA put < in cell E otherwise put A in cell E. (/e are adding mar!s and cell E contains an indicator of "ass (<) or ,ail (A). And the program with appropriate numerical code (so that instructions be stored in memory). The numerically coded instruction is given in four =e.adecimal digits; the first digit gives the operation re%uired (load add store ) - the opcode; the last three digits give the address or data - the operand. The opcodes are as follows4

,igure B.E4 Opcodes # have to renumber the program steps from "<-"<D to "<A< ... for reasons which will soon become evident. Also we will use he.adecimal numbering. 71

"<A<4 (oad contents of memory A into A'. 'ode4 A AAA "<A94 Add contents of memory < to contents of A'. 'ode4 9 AA< "<AE4 Store the contents of A' in memory 9. 'ode4 < AA9 "<AD4 (oad the constant DA into the A'. 'ode4 C A9H (DAdec is 9H=e.) "<A@4 Store the contents of A' in memory D. 'ode4 < AAD "<AB4 (oad the contents of memory 9 into A'. 'ode4 A AA9 "<AC4 Subtract contents of memory D from contents of A'. 'ode4 E AAD "<AH4 #f A' is positive :ump to instruction "<Ac. 'ode4 D <Ac "<A?4 (oad the constant A into the A'. 'ode C AAA "<Aa4 Store the contents of A' in memory E. 'ode < AAE "<Ab4 Mump to "<Ae. 'ode B <Ae "<Ac4 (oad the constant < into the A'. 'ode C AA< "<Ad4 Store the contents of A' in memory E. 'ode < AAE "<Ae4 Stop. /e now have to revise ,igure B.9 to show the program ,igure B.D. The revisions are as follows4 • • • Show the additional memory (containing the program). Show a "rogram 'ounter ("') register that !eeps trac! of the address of the ne.t instruction. Show the #nstruction 5egister (#5); this tells the '"& operator what to do for the current step.

,igure B.D4 0echanical 'omputer with "rogram

72

#n this revised model the '"& operator has no list of instructions on his/her des! (the '"&); he/she must go through the following cycle of steps for each instruction step4

Fetch: (a) Ta!e the number in the "'; (b) place it in 0A5; (c) shout >$us>; (d) add one to the number in the "' - to ma!e it point to the ne.t step; (e) wait until a new number arrives in the 0$5; (f) ta!e the number in the 0$5 and put it in the #nstruction 5egister; /ecode: (a) Ta!e the number in #5; (b) Ta!e the top digit (opcode) loo! it up in ,igure B.E and see what has to be done; (c) ta!e the number in the bottom three digits - this signifies the operand. E?ecute: "erform the action re%uired. 2.g. Add contents of memory < to contents of A' (9 AA<). Opcode is 9 operand is AA<. /e;ve already done this4 (a) write < on a piece of paper and place it in 0A5; (b) put a tic! against 5ead; (c) shout >$us>; (d) some time later the contents of cell < (EE) will arrive in 0$5; (e) loo! at what is in A' and in 0$5 use the calculator to add them (99 P EE); (f) write down a copy of the result and put it in A'. Thus in the case shown a piece of paper with @@ on it would be put in to A'. #f the operation is a :ump then all the operator does is ta!e the operand (the :ump-to address) and place it in the "' - thus stopping the "' pointing to the ne.t instruction in se%uence. There we have it. The famous fetch-decode-e.ecute cycle. The '"& is a pretty busy placeV

The Fetch-/ecode-E?ecute C!c(e
=ow does the '"& and its controller e.ecute a se%uence of instructionsL (et us start by considering the e.ecution the instruction at location A.<AA; what follows is an endless loop of the socalled fetch-decode-e.ecute cycle. Fetch: 5ead the ne.t instruction and put it in the #nstruction 5egister. "oint to the ne.t instruction ready for the ne.t ,etch. /ecode: ,igure out what the instruction means; E?ecute: 1o what is re%uired that instruction; if it is a M&0" type instruction then revise the pointing to the :umped-to instruction. )o bac! to ,etch.

In"truction Set
/e now e.amine the instruction set by which assembly programmers can program the machine. /e will call the machine 0ac-<a; 0ac-<a is a restricted version of Tanenbaum;s 0ac-<. The main characteristics of 0ac-<a are4 data word length <B-bit; address si3e <9-bits. 2.ercise. /hat is the ma.imum number of words we can have in the main memory of 0ac-<aL (neglect memory mapped input-output). =ow many bytesL There are two addressing modes4 immediate and direct; we will neglect Tanenbaum;s local and indirect for the meanwhile. #t is accumulator based4 that is everything is done through A'; thus ^Add; is done as follows4 put operand < in A' add to memory location result is put in A'; if necessary i.e. we want to retain the result the contents of the A' is now copied to memory. 73

The 0ac-<a programmer has no access to the "' or other '"& registers. Also for present purposes assume that S" does not e.ist. A limited version of the 0ac-< instruction set is shown in ,igure B.@. The columns are as follows4 Binar! code 8or in"truction: #.e. what the instruction loo!s li!e in computer memory 0achine code. Mne#onic: The name given to the instruction. &sed when coding in assembly code. Lon na#e: 1escriptive name for instruction. Action: /hat the instruction actually does described formally in register transfer language (5T().

74

,igure B.@4 0ac-<a #nstruction Set (limited version of 0ac-<)

Micropro ra# Contro( ;er"u" 5ard4are Contro(
'ontrol of the '"& - fetch decode e.ecute - is done by a microcontroller which obeys a program of microinstructions. /e might thin! of the microcontroller as a blac!-bo. such as that shown in ,igure B.B. The microcontroller has a set of inputs and a set of outputs - :ust li!e any other circuit A(& multiple.er etc. Therefore instead of microprogramming it can be made from logic hardware.

,igure B.B4 'ontroller $lac!-bo. either 0icrocontroller or (ogic To design the circuit all you have to do is prepare a truth-table (B input columns - op-code (D bits) and 6 K 99 output columns) and generate the logic. There is no reason why this hardware circuit could not decode an instruction in O62 cloc! period i.e. a lot faster than the microcode solution. The micro programmed solution allows arbitrarily comple. instructions to be built-up. #t may also be more fle.ible for e.ample there were many machines that users could microprogram themselves; and there were computers which differed only by their microcode perhaps one optimised for e.ecution of ' programs another for 'O$O( programs. On the other hand if implemented on a chip control store ta!es up a lot of chip space. And as you can see by e.amining (Tanenbaum <??A) microcode interpretation may be relatively slow -- and gets slower the more instructions there are.

75

Fi ure ,C. "ho4" the 8u(( Mac-1 CP) 4ith it" #icrocontro((er unitC

,igure B.C4 0ac-< '"& including control (from Tanenbaum Structured 'omputer Organisation Erd ed.)

CISC ;er"u" RISC
0achines with large sets of comple. (and perhaps slow) instructions (implemented with microcode) are called '#S' - comple. instruction set computer. Those with small sets of relatively simple instructions probably implemented in logic are called 5#S' - reduced instruction set computer. 0ost early machines - before about <?B@ - were 5#S'. Then the fashion switched to '#S'. 6ow the fashion is switching bac! to 5#S' albeit with some special go-faster features that were not present on early 5#S'. '#S' machines are easier to program in machine and assembly code (see ne.t chapter) because they have a richer set of instructions. $ut nowadays less and less programmers use assembly code and compilers are becoming better. #t comes down to a trade off comple.ity of 76

^silicon; (microcode and '#S') or comple.ity of software (highly efficient optimising compilers and 5#S').

CP) Schedu(in
/hat is '"& schedulingL 1etermining which processes run when there are multiple runnable processes. /hy is it importantL $ecause it can have a big effect on resource utili3ation and the overall performance of the system. $y the way the world went through a long period (late HA;s early ?A;s) in which the most popular operating systems (1OS 0ac) had 6O sophisticated '"& scheduling algorithms. They were single threaded and ran one process at a time until the user directs them to run another process. /hy was this trueL 0ore recent systems (/indows 6T) are bac! to having sophisticated '"& scheduling algorithms. /hat drove the change and what will happen in the futureL $asic assumptions behind most scheduling algorithms4 • • • There is a pool of runnable processes contending for the '"&. The processes are independent and compete for resources. The :ob of the scheduler is to distribute the scarce resource of the '"& to the different processes ^^fairly;; (according to some definition of fairness) and in a way that optimi3es some performance criteria.

#n general these assumptions are starting to brea! down. ,irst of all '"&s are not really that scarce - almost everybody has several and pretty soon people will be able to afford lots. Second many applications are starting to be structured as multiple cooperating processes. So a view of the scheduler as mediating between competing entities may be partially obsolete. =ow do processes behaveL ,irst '"&/#O burst cycle. A process will run for a while (the '"& burst) perform some #O (the #O burst) then run for a while more (the ne.t '"& burst). =ow long between #O operationsL 1epends on the process. • • #O $ound processes4 processes that perform lots of #O operations. 2ach #O operation is followed by a short '"& burst to process the #O then more #O happens. '"& bound processes4 processes that perform lots of computation and do little #O. Tend to have a few long '"& bursts.

One of the things a scheduler will typically do is switch the '"& to another process when one process does #O. /hyL The #O will ta!e a long time and don;t want to leave the '"& idle while wait for the #O to finish. /hen loo! at '"& burst times across the whole system have the e.ponential or hyper e.ponential distribution in ,ig. @.9. /hat are possible process statesL • • • 5unning - process is running on '"&. 5eady - ready to run but not actually running on the '"&. /aiting - waiting for some event li!e #O to happen. 77

/hen do scheduling decisions ta!e placeL /hen does '"& choose which process to runL Are a variety of possibilities4 • • /hen process switches from running to waiting. 'ould be because of #O re%uest because wait for child to terminate or wait for synchroni3ation operation (li!e loc! ac%uisition) to complete. /hen process switches from running to ready - on completion of interrupt handler for e.ample. 'ommon e.ample of interrupt handler - timer interrupt in interactive systems. #f scheduler switches processes in this case it has preempted the running process. Another common case interrupt handler is the #O completion handler. /hen process switches from waiting to ready state (on completion of #O or ac%uisition of a loc! for e.ample). /hen a process terminates.

• •

=ow to evaluate scheduling algorithmL There are many possible criteria4 • • • • • • '"& &tili3ation4 Ieep '"& utili3ation as high as possible. (/hat is utili3ation by the wayL). Throughput4 number of processes completed per unit time. Turnaround Time4 mean time from submission to completion of process. /aiting Time4 Amount of time spent ready to run but not running. 5esponse Time4 Time between submission of re%uests and first response to the re%uest. Scheduler 2fficiency4 The scheduler doesn;t perform any useful wor! so any time it ta!es is pure overhead. So need to ma!e the scheduler very efficient.

$ig difference4 $atch and #nteractive systems. #n batch systems typically want good throughput or turnaround time. #n interactive systems both of these are still usually important (after all want some computation to happen) but response time is usually a primary consideration. And for some systems throughput or turnaround time is not really relevant - some processes conceptually run forever. 1ifference between long and short term scheduling. (ong term scheduler is given a set of processes and decides which ones should start to run. Once they start running they may suspend because of #O or because of preemption. Short term scheduler decides which of the available :obs that long term scheduler has decided are runnable to actually run. (et;s start loo!ing at several vanilla scheduling algorithms. ,irst-'ome ,irst-Served. One ready %ueue OS runs the process at head of %ueue new processes come in at the end of the %ueue. A process does not give up '"& until it either terminates or performs #O. 'onsider performance of ,',S algorithm for three compute-bound processes. /hat if have D processes "< (ta!es 9D seconds) "9 (ta!es E seconds) and "E (ta!es E seconds). #f arrive in order "< "9 "E what is • /aiting TimeL (9D P 9C) / E W <C 78

• •

Turnaround TimeL (9D P 9C P EA) W 9C. ThroughputL EA / E W <A.

/hat about if processes come in order "9 "E "<L /hat is • • • /aiting TimeL (E P E) / 9 W B Turnaround TimeL (E P B P EA) W <E. ThroughputL EA / E W <A.

Shortest-Mob-,irst (SM,) can eliminate some of the variance in /aiting and Turnaround time. #n fact it is optimal with respect to average waiting time. $ig problem4 how does scheduler figure out how long will it ta!e the process to runL ,or long term scheduler running on a batch system user will give an estimate. &sually pretty good - if it is too short system will cancel :ob before it finishes. #f too long system will hold off on running the process. So users give pretty good estimates of overall running time. ,or short-term scheduler must use the past to predict the future. Standard way4 use a timedecayed e.ponentially weighted average of previous '"& bursts for each process. (et Tn be the measured burst time of the nth burst sn be the predicted si3e of ne.t '"& burst. Then choose a weighting factor w where A ZW w ZW < and compute snP< W w Tn P (< - w)sn. sA is defined as some default constant or system average. w tells how to weight the past relative to future. #f choose w W .@ last observation has as much weight as entire rest of the history. #f choose w W < only last observation has any weight. 1o a %uic! e.ample. "reemptive vs. 6on-preemptive SM, scheduler. "reemptive scheduler reruns scheduling decision when process becomes ready. #f the new process has priority over running process the '"& preempts the running process and e.ecutes the new process. 6on-preemptive scheduler only does scheduling decision when running process voluntarily gives up '"&. #n effect it allows every running process to finish its '"& burst. 'onsider D processes "< (burst time H) "9 (burst time D) "E (burst time ?) "D (burst time @) that arrive one time unit apart in order "< "9 "E "D. Assume that after burst happens process is not reenabled for a long time (at least <AA for e.ample). /hat does a preemptive SM, scheduler doL /hat about a non-preemptive schedulerL "riority Scheduling. 2ach process is given a priority then '"& e.ecutes process with highest priority. #f multiple processes with same priority are runnable use some other criteria - typically ,',S. SM, is an e.ample of a priority-based scheduling algorithm. /ith the e.ponential decay algorithm above the priorities of a given process change over time. Assume we have @ processes "< (burst time <A priority E) "9 (burst time < priority <) "E (burst time 9 priority E) "D (burst time < priority D) "@ (burst time @ priority 9). (ower numbers represent higher priorities. /hat would a standard priority scheduler doL $ig problem with priority scheduling algorithms4 starvation or bloc!ing of low-priority processes. 'an use aging to prevent this - ma!e the priority of a process go up the longer it stays runnable but isn;t runL 79

/hat about interactive systemsL 'annot :ust let any process run on the '"& until it gives it up - must give response to users in a reasonable time. So use an algorithm called round-robin scheduling. Similar to ,',S but with preemption. =ave a time %uantum or time slice. (et the first process in the %ueue run until it e.pires its %uantum (i.e. runs for as long as the time %uantum) then run the ne.t process in the %ueue. #mplementing round-robin re%uires timer interrupts. /hen schedule a process set the timer to go off after the time %uantum amount of time e.pires. #f process does #O before timer goes off no problem - :ust run ne.t process. $ut if process e.pires its %uantum do a conte.t switch. Save the state of the running process and run the ne.t process. =ow well does 55 wor!L /ell it gives good response time but can give bad waiting time. 'onsider the waiting times under round robin for E processes "< (burst time 9D) "9 (burst time E) and "E (burst time D) with time %uantum D. /hat happens and what is average waiting timeL /hat gives best waiting timeL /hat happens with really a really small %uantumL #t loo!s li!e you;ve got a '"& that is </n as powerful as the real '"& where n is the number of processes. "roblem with a small %uantum conte.t switch overhead. /hat about having a really small %uantum supported in hardwareL Then you have something called multithreading. )ive the '"& a bunch of registers and heavily pipeline the e.ecution. ,eed the processes into the pipe one by one. Treat memory access li!e #O - suspend the thread until the data comes bac! from the memory. #n the meantime e.ecute other threads. &se computation to hide the latency of accessing memory. /hat about a really big %uantumL #t turns into ,',S. 5ule of thumb - want HA percent of '"& bursts to be shorter than time %uantum. 0ultilevel Oueue Scheduling - li!e 55 e.cept have multiple %ueues. Typically classify processes into separate categories and give a %ueue to each category. So might have system interactive and batch processes with the priorities in that order. 'ould also allocate a percentage of the '"& to each %ueue. 0ultilevel ,eedbac! Oueue Scheduling - (i!e multilevel scheduling e.cept processes can move between %ueues as their priority changes. 'an be used to give #O bound and interactive processes '"& priority over '"& bound processes. 'an also prevent starvation by increasing the priority of processes that have been idle for a long time. A simple e.ample of a multilevel feedbac! %ueue scheduling algorithm. =ave E %ueues numbered A < 9 with corresponding priority. So for e.ample e.ecute a tas! in %ueue 9 only when %ueues A and < are empty. A process goes into %ueue A when it becomes ready. /hen run a process from %ueue A give it a %uantum of H ms. #f it e.pires its %uantum move to %ueue <. /hen e.ecute a process from %ueue < give it a %uantum of <B. #f it e.pires its %uantum move to %ueue 9. #n %ueue 9 run a 55 scheduler with a large %uantum if in an interactive system or an ,',S scheduler if in a batch system. Of course preempt %ueue 9 processes when a new process becomes ready. Another e.ample of a multilevel feedbac! %ueue scheduling algorithm4 the &ni. scheduler. /e will go over a simplified version that does not include !ernel priorities. The point of the algorithm is to fairly allocate the '"& between processes with processes that have not recently used a lot of '"& resources given priority over processes that have. 80

"rocesses are given a base priority of BA with lower numbers representing higher priorities. The system cloc! generates an interrupt between @A and <AA times a second so we will assume a value of BA cloc! interrupts per second. The cloc! interrupt handler increments a '"& usage field in the "'$ of the interrupted process every time it runs. The system always runs the highest priority process. #f there is a tie it runs the process that has been ready longest. 2very second it recalculates the priority and '"& usage field for every process according to the following formulas. • • '"& usage field W '"& usage field / 9 "riority W '"& usage field / 9 P base priority

So when a process does not use much '"& recently its priority rises. The priorities of #O bound processes and interactive processes therefore tend to be high and the priorities of '"& bound processes tend to be low (which is what you want). &ni. also allows users to provide a ^^nice;; value for each process. 6ice values modify the priority calculation as follows4 • "riority W '"& usage field / 9 P base priority P nice value

So you can reduce the priority of your process to be ^^nice;; to other processes (which may include your own). #n general multilevel feedbac! %ueue schedulers are comple. pieces of software that must be tuned to meet re%uirements. Anomalies and system effects associated with schedulers. "riority interacts with synchroni3ation to create a really nasty effect called priority inversion. A priority inversion happens when a low-priority thread ac%uires a loc! then a high-priority thread tries to ac%uire the loc! and bloc!s. Any middle-priority threads will prevent the low-priority thread from running and unloc!ing the loc!. #n effect the middle-priority threads bloc! the high-priority thread. =ow to prevent priority inversionsL &se priority inheritance. Any time a thread holds a loc! that other threads are waiting on give the thread the priority of the highest-priority thread waiting to get the loc!. "roblem is that priority inheritance ma!es the scheduling algorithm less efficient and increases the overhead. "reemption can interact with synchroni3ation in a multiprocessor conte.t to create another nasty effect - the convoy effect. One thread ac%uires the loc! then suspends. Other threads come along and need to ac%uire the loc! to perform their operations. 2verybody suspends until the loc! that has the thread wa!es up. At this point the threads are synchroni3ed and will convoy their way through the loc! seriali3ing the computation. So drives down the processor utili3ation. #f have non-bloc!ing synchroni3ation via operations li!e ((/S' don;t get convoy effects caused by suspending a thread competing for access to a resource. /hy notL $ecause threads don;t hold resources and prevent other threads from accessing them. Similar effect when scheduling '"& and #O bound processes. 'onsider a ,',S algorithm with several #O bound and one '"& bound process. All of the #O bound processes e.ecute their bursts %uic!ly and %ueue up for access to the #O device. The '"& bound process then e.ecutes for a long time. 1uring this time all of the #O bound processes have their #O re%uests satisfied and move bac! 81

into the run %ueue. $ut they don;t run - the '"& bound process is running instead - so the #O device idles. ,inally the '"& bound process gets off the '"& and all of the #O bound processes run for a short time then %ueue up again for the #O devices. 5esult is poor utili3ation of #O device - it is busy for a time while it processes the #O re%uests then idle while the #O bound processes wait in the run %ueues for their short '"& bursts. #n this case an easy solution is to give #O bound processes priority over '"& bound processes. #n general a convoy effect happens when a set of processes need to use a resource for a short time and one process holds the resource for a long time bloc!ing all of the other processes. 'auses poor utili3ation of the other resources in the system.

CP)=Proce"" Schedu(in
The assignment of physical processors to processes allows processors to accomplish wor!. The problem of determining when processors should be assigned and to which processes is called processor scheduling or '"& scheduling. /hen more than one process is runable the operating system must decide which one first. The part of the operating system concerned with this decision is called the scheduler and algorithm it uses is called the scheduling algorithm. Goa(" o8 "chedu(in *o:Decti;e"+ #n this section we try to answer following %uestion4 /hat the scheduler try to achieveL 0any ob:ectives must be considered in the design of a scheduling discipline. #n particular a scheduler should consider fairness efficiency response time turnaround time throughput etc. Some of these goals depends on the system one is using for e.ample batch system interactive system or real-time system etc. but there are also some goals that are desirable in all systems. Genera( Goa(" Fairne"": ,airness is important under all circumstances. A scheduler ma!es sure that each process gets its fair share of the '"& and no process can suffer indefinite postponement. 6ote that giving e%uivalent or e%ual time is not fair. Thin! of safety control and payroll at a nuclear plant. Po(ic! En8orce#ent: The scheduler has to ma!e sure that system;s policy is enforced. ,or e.ample if the local policy is safety then the safety control processes must be able to run whenever they want to even if it means delay in payroll processes. E88icienc!: Scheduler should !eep the system (or in particular '"&) busy cent percent of the time when possible. #f the '"& and all the #nput/Output devices can be !ept running all the time more wor! gets done per second than if some components are idle. Re"pon"e Ti#e: A scheduler should minimi3e the response time for interactive user. Turnaround: A scheduler should minimi3e the time batch users must wait for an output. Throu hput: A scheduler should ma.imi3e the number of :obs processed per unit time. A little thought will show that some of these goals are contradictory. #t can be shown that any scheduling algorithm that favors some class of :obs hurts another class of :obs. The amount of '"& time available is finite after all. Pree#pti;e @" No Pree#pti;e Schedu(in 82

The Scheduling algorithms can be divided into two categories with respect to how they deal with cloc! interrupts. Nonpree#pti;e Schedu(in A scheduling discipline is nonpreemptive if once a process has been given the '"& the '"& cannot be ta!en away from that process. ,ollowing are some characteristics of nonpreemptive scheduling <. #n nonpreemptive system short :obs are made to wait by longer :obs but the overall treatment of all processes is fair. 9. #n nonpreemptive system response times are more predictable because incoming high priority :obs can not displace waiting :obs. E. #n nonpreemptive scheduling a scheduler e.ecutes :obs in the following two situations. a. /hen a process switches from running state to the waiting state. b. /hen a process terminates. Pree#pti;e Schedu(in A scheduling discipline is preemptive if once a process has been given the '"& can ta!en away. The strategy of allowing processes that are logically runable to be temporarily suspended is called "reemptive Scheduling and it is contrast to the >run to completion> method.

Schedu(in A( orith#"
'"& Scheduling deals with the problem of deciding which of the processes in the ready %ueue is to be allocated the '"&. ,ollowing are some scheduling algorithms we will study • • • • • • • ,',S Scheduling. 5ound 5obin Scheduling. SM, Scheduling. S5T Scheduling. "riority Scheduling. 0ultilevel Oueue Scheduling. 0ultilevel ,eedbac! Oueue Scheduling.

Fir"t-Co#e-Fir"t-Ser;ed *FCFS+ Schedu(in Other names of this algorithm are4 • ,irst-#n-,irst-Out (,#,O) 83

• •

5un-to-'ompletion 5un-&ntil-1one

"erhaps ,irst-'ome-,irst-Served algorithm is the simplest scheduling algorithm is the simplest scheduling algorithm. "rocesses are dispatched according to their arrival time on the ready %ueue. $eing a nonpreemptive discipline once a process has a '"& it runs to completion. The ,',S scheduling is fair in the formal sense or human sense of fairness but it is unfair in the sense that long :obs ma!e short :obs wait and unimportant :obs ma!e important :obs wait. ,',S is more predictable than most of other schemes since it offers time. ,',S scheme is not useful in scheduling interactive users because it cannot guarantee good response time. The code for ,',S scheduling is simple to write and understand. One of the ma:or drawbac! of this scheme is that the average time is often %uite long. The ,irst-'ome-,irst-Served algorithm is rarely used as a master scheme in modern operating systems but it is often embedded within other schemes. Round Ro:in Schedu(in One of the oldest simplest fairest and most widely used algorithm is round robin (55). #n the round robin scheduling processes are dispatched in a ,#,O manner but are given a limited amount of '"& time called a time-slice or a %uantum. #f a process does not complete before its '"&-time e.pires the '"& is preempted and given to the ne.t process waiting in a %ueue. The preempted process is then placed at the bac! of the ready list. 5ound 5obin Scheduling is preemptive (at the end of time-slice) therefore it is effective in timesharing environments in which the system needs to guarantee reasonable response times for interactive users. The only interesting issue with round robin scheme is the length of the %uantum. Setting the %uantum too short causes too many conte.t switches and lower the '"& efficiency. On the other hand setting the %uantum too long may cause poor response time and appro.imates ,',S. #n any event the average waiting time under round robin scheduling is often %uite long. Shorte"t-Bo:-Fir"t *SBF+ Schedu(in Other name of this algorithm is Shortest-"rocess-6e.t (S"6). Shortest-Mob-,irst (SM,) is a non-preemptive discipline in which waiting :ob (or process) with the smallest estimated run-time-to-completion is run ne.t. #n other words when '"& is available it is assigned to the process that has smallest ne.t '"& burst. The SM, scheduling is especially appropriate for batch :obs for which the run times are !nown in advance. Since the SM, scheduling algorithm gives the minimum average time for a given set of processes it is probably optimal. The SM, algorithm favors short :obs (or processors) at the e.pense of longer ones. 84

The obvious problem with SM, scheme is that it re%uires precise !nowledge of how long a :ob or process will run and this information is not usually available. The best SM, algorithm can do is to rely on user estimates of run times. #n the production environment where the same :obs run regularly it may be possible to provide reasonable estimate of run time based on the past performance of the process. $ut in the development environment users rarely !now how their program will e.ecute. (i!e ,',S SM, is non preemptive therefore it is not useful in timesharing environment in which reasonable response time must be guaranteed.

Priorit! Schedu(in The basic idea is straightforward4 each process is assigned a priority and priority is allowed to run. 2%ual-"riority processes are scheduled in ,',S order. The shortest-Mob-,irst (SM,) algorithm is a special case of general priority scheduling algorithm. An SM, algorithm is simply a priority algorithm where the priority is the inverse of the (predicted) ne.t '"& burst. That is the longer the '"& burst the lower the priority and vice versa. "riority can be defined either internally or e.ternally. #nternally defined priorities use some measurable %uantities or %ualities to compute priority of a process. E?a#p(e" o8 Interna( prioritie" are • • • • Time limits. 0emory re%uirements. ,ile re%uirements for e.ample number of open files. '"& Gs #/O re%uirements.

E?terna((! de8ined prioritie" are "et :! criteria that are e?terna( to operatin "!"te# "uch a" • • • • The importance of process. Type or amount of funds being paid for computer use. The department sponsoring the wor!. "olitics.

"riority scheduling can be either preemptive or non preemptive • • A preemptive priority algorithm will preemptive the '"& if the priority of the newly arrival process is higher than the priority of the currently running process. A non-preemptive priority algorithm will simply put the new process at the head of the ready %ueue.

85

A ma:or problem with priority scheduling is indefinite bloc!ing or starvation. A solution to the problem of indefinite bloc!age of the low-priority process is aging. Aging is a techni%ue of gradually increasing the priority of processes that wait in the system for a long period of time. Mu(ti(e;e( Eueue Schedu(in A multilevel %ueue scheduling algorithm partitions the ready %ueue in several separate %ueues for instance #n a multilevel %ueue scheduling processes are permanently assigned to one %ueues. The processes are permanently assigned to one another based on some property of the process such as • • • 0emory si3e "rocess priority "rocess type

Algorithm choose the process from the occupied %ueue that has the highest priority and run that process either • • "reemptive or 6on-preemptively

2ach %ueue has its own scheduling algorithm or policy. Po""i:i(it! 1 #f each %ueue has absolute priority over lower-priority %ueues then no process in the %ueue could run unless the %ueue for the highest-priority processes were all empty. ,or e.ample in the above figure no process in the batch %ueue could run unless the %ueues for system processes interactive processes and interactive editing processes will all empty. Po""i:i(it! II #f there is a time slice between the %ueues then each %ueue gets a certain amount of '"& times which it can then schedule among the processes in its %ueue. ,or instance; • • HA` of the '"& time to foreground %ueue using 55. 9A` of the '"& time to bac!ground %ueue using ,',S.

Since processes do not move between %ueue so this policy has the advantage of low scheduling overhead but it is infle.ible. Mu(ti(e;e( Feed:ac0 Eueue Schedu(in 0ultilevel feedbac! %ueue-scheduling algorithm allows a process to move between %ueues. #t uses many ready %ueues and associate a different priority with each %ueue. The Algorithm chooses to process with highest priority from the occupied %ueue and run that process either preemptively or unpreemptively. #f the process uses too much '"& time it will moved 86

to a lower-priority %ueue. Similarly a process that wait too long in the lower-priority %ueue may be moved to a higher-priority %ueue may be moved to a highest-priority %ueue. 6ote that this form of aging prevents starvation. • • • • A process entering the ready %ueue is placed in %ueue A. #f it does not finish within H milliseconds time it is moved to the tail of %ueue <. #f it does not complete it is preempted and placed into %ueue 9. "rocesses in %ueue 9 run on a ,',S basis only when %ueue 9 run on a ,',S basis only when %ueue A and %ueue < are empty.^

C5APTER , INTERPROCESS COMM)NICATION
Since processes fre%uently needs to communicate with other processes therefore there is a need for a well-structured communication without using interrupts among processes. Race Condition" #n operating systems processes that are wor!ing together share some common storage (main memory file etc.) that each process can read and write. /hen two or more processes are reading or writing some shared data and the final result depends on who runs precisely when are called race conditions. 'oncurrently e.ecuting threads that share data need to synchroni3e their operations and processing in order to avoid race condition on shared data. Only one acustomerU thread at a time should be allowed to e.amine and update the shared variable. 5ace conditions are also possible in Operating Systems. #f the ready %ueue is implemented as a lin!ed list and if the ready %ueue is being manipulated during the handling of an interrupt then interrupts must be disabled to prevent another interrupt before the first one completes. #f interrupts are not disabled than the lin!ed list could become corrupt.

Critica( Section
=ow to avoid race conditionsL

87

The !ey to preventing trouble involving shared storage is find some way to prohibit more than one process from reading and writing the shared data simultaneously. That part of the program where the shared memory is accessed is called the 'ritical Section. To avoid race conditions and flawed results one must identify codes in 'ritical Sections in each thread. The characteristic properties of the code that form a 'ritical Section are • • • • 'odes that reference one or more variables in a *read-update-write+ fashion while any of those variables is possibly being altered by another thread. 'odes that alter one or more variables that are possibly being referenced in *read-updatawrite+ fashion by another thread. 'odes use a data structure while any part of it is possibly being altered by another thread. 'odes alter any part of a data structure while it is possibly in use by another thread.

=ere the important point is that when one process is e.ecuting shared modifiable data in its

Mutua( E?c(u"ion
A way of ma!ing sure that if one process is using a shared modifiable data the other processes will be e.cluded from doing the same thing. ,ormally while one process e.ecutes the shared variable all other processes desiring to do so at the same time moment should be !ept waiting; when that process has finished e.ecuting the shared variable one of the processes waiting; while that process has finished e.ecuting the shared variable one of the processes waiting to do so should be allowed to proceed. #n this fashion each process e.ecuting the shared data (variables) e.cludes all others from doing so simultaneously. This is called 0utual 2.clusion. 6ote that mutual e.clusion needs to be enforced only when processes access shared modifiable data - when processes are performing operations that do not conflict with one another they should be allowed to proceed concurrently. Mutua( E?c(u"ion Condition" #f we could arrange matters such that no two processes were ever in their critical sections simultaneously we could avoid race conditions. /e need four conditions to hold to have a good solution for the critical section problem (mutual e.clusion). • • 6o two processes may at the same moment inside their critical sections. 6o assumptions are made about relative speeds of processes or number of '"&s. 88

• •

6o process should outside its critical section should bloc! other processes. 6o process should wait arbitrary long to enter its critical section.

Propo"a(" 8or Achie;in Mutua( E?c(u"ion
The mutual e.clusion problem is to devise a pre-protocol (or entry protocol) and a postprotocol (or e.ist protocol) to !eep two or more threads from being in their critical sections at the same time. Tanenbaum e.amine proposals for critical-section problem or mutual e.clusion problem. Pro:(e# /hen one process is updating shared modifiable data in its critical section no other process should allowed to enter in its critical section. Propo"a( 1 -/i"a:(in Interrupt" *5ard4are So(ution+ 2ach process disables all interrupts :ust after entering in its critical section and re-enable all interrupts :ust before leaving critical section. /ith interrupts turned off the '"& could not be switched to other process. =ence no other process will enter its critical and mutual e.clusion achieved. 1isabling interrupts is sometimes a useful interrupts is sometimes a useful techni%ue within the !ernel of an operating system but it is not appropriate as a general mutual e.clusion mechanism for users process. The reason is that it is unwise to give user process the power to turn off interrupts.

Propo"a( $ - Loc0 @aria:(e *So8t4are So(ution+ #n this solution we consider a single shared (loc!) variable initially A. /hen a process wants to enter in its critical section it first test the loc!. #f loc! is A the process first sets it to < and then enters the critical section. #f the loc! is already < the process :ust waits until (loc!) variable becomes A. Thus a A means that no process in its critical section and < means hold your horses - some process is in its critical section. The flaw in this proposal can be best e.plained by e.ample. Suppose process A sees that the loc! is A. $efore it can set the loc! to < another process $ is scheduled runs and sets the loc! to <. /hen the process A runs again it will also set the loc! to < and two processes will be in their critical section simultaneously. Propo"a( % - Strict A(teration #n this proposed solution the integer variable ;turn; !eeps trac! of whose turn is to enter the critical section. #nitially process A inspect turn finds it to be A and enters in its critical section. "rocess $ also finds it to be A and sits in a loop continually testing ;turn; to see when it becomes <.'ontinuously testing a variable waiting for some value to appear is called the $usy-/aiting. Ta!ing turns is not a good idea when one of the processes is much slower than the other. Suppose process A finishes its critical section %uic!ly so both processes are now in their noncritical section. This situation violates above mentioned condition E. )"in S!"te#" ca((" F"(eepF and F4a0eupF 89

$asically what above mentioned solution do is this4 when a processes wants to enter in its critical section it chec!s to see if then entry is allowed. #f it is not the process goes into tight loop and waits (i.e. start busy waiting) until it is allowed to enter. This approach waste '"&-time. 6ow loo! at some interprocess communication primitives is the pair of steep-wa!eup. • • Sleep4 #t is a system call that causes the caller to bloc! that is be suspended until some other process wa!es it up. /a!eup4 #t is a system call that wa!es up the process.

$oth ;sleep; and ;wa!eup; system calls have one parameter that represents a memory address used to match up ;sleeps; and awa!eupsU. The Bounded Bu88er Producer" and Con"u#er" The bounded buffer producers and consumers assume that there is a fi.ed buffer si3e i.e. a finite numbers of slots are available. State#ent To suspend the producers when the buffer is full to suspend the consumers when the buffer is empty and to ma!e sure that only one process at a time manipulates a buffer so there are no race conditions or lost updates. As an e.ample how sleep-wa!eup system calls are used consider the producer-consumer problem also !nown as bounded buffer problem. Two processes share a common fi.ed-si3e (bounded) buffer. The producer puts information into the buffer and the consumer ta!es information out. Trouble arises when <. The producer wants to put a new data in the buffer but buffer is already full. Solution4 "roducer goes to sleep and to be awa!ened when the consumer has removed data. 9. The consumer wants to remove data the buffer but buffer is already empty. Solution4 'onsumer goes to sleep until the producer puts some data in buffer and wa!es consumer up. This approaches also leads to same race conditions we have seen in earlier approaches. 5ace condition can occur due to the fact that access to ;count; is unconstrained. The essence of the problem is that a wa!eup call sent to a process that is not sleeping is lost.

Se#aphore"
A semaphore is a protected variable whose value can be accessed and altered only by the operations " and G and initiali3ation operation called ;Semaphoiinitisli3e;. $inary Semaphores can assume only the value A or the value < counting semaphores also called general semaphores can assume only nonnegative values. The " (or wait or sleep or down) operation on semaphores S written as "(S) or wait (S) operates as follows4 90

1(5):

DF

5 TE=N =35=

+

" 5 F ,

5 :!

()ait on 5)

The G (or signal or wa!eup or up) operation on semaphore S written as G(S) or signal (S) operates as follows4 6(5): DF (one or more process are )aiting on 5) TE=N (let one of these processes proceed) =35= 5: ! 5 #,

Operations " and G are done as single indivisible atomic action. #t is guaranteed that once a semaphore operations has stared no other process can access the semaphore until operation has completed. 0utual e.clusion on the semaphore S is enforced within "(S) and G(S). #f several processes attempt a "(S) simultaneously only process will be allowed to proceed. The other processes will be !ept waiting but the implementation of " and G guarantees that processes will not suffer indefinite postponement. Semaphores solve the lost-wa!eup problem. Producer-Con"u#er Pro:(e# )"in Se#aphore" The Solution to producer-consumer problem uses three semaphores namely full empty and mute.. The semaphore ;full; is used for counting the number of slots in the buffer that are full. The ;empty; for counting the number of slots that are empty and semaphore ;mute.; to ma!e sure that the producer and consumer do not access modifiable shared section of the buffer simultaneously. #nitiali3ation • • • Set full buffer slots to A. i.e. semaphore ,ull W A. Set empty buffer slots to 6. i.e. semaphore empty W 6. ,or control access to critical section set mute. to <. i.e. semaphore mute. W <.

"roducer ( ) /=#(2 (true) produce-#tem ( ); " (empty); " (mute.); enter-#tem ( ) G (mute.) G (full); 91

'onsumer ( ) /=#(2 (true) " (full) " (mute.); remove-#tem ( ); G (mute.); G (empty); consume-#tem (#tem)

C5APTER . /EA/LOCA
/e8inition
*'rises and deadloc!s when they occur have at least this advantage that they force us to thin!.+- Mawaharlal 6ehru (<HH? - <?BD) #ndian political leader A set of process is in a deadloc! state if each process in the set is waiting for an event that can be caused by only another process in the set. #n other words each member of the set of deadloc! processes is waiting for a resource that can be released only by a deadloc! process. 6one of the processes can run none of them can release any resources and none of them can be awa!ened. #t is important to note that the number of processes and the number and !ind of resources possessed and re%uested are unimportant. The resources may be either physical or logical. 2.amples of physical resources are "rinters Tape 1rivers 0emory Space and '"& 'ycles. 2.amples of logical resources are ,iles Semaphores and 0onitors. The simplest e.ample of deadloc! is where process < has been allocated non-shareable resources A say a tap drive and process 9 has be allocated non-sharable resource $ say a printer. 6ow if it turns out that process < needs resource $ (printer) to proceed and process 9 needs 92

resource A (the tape drive) to proceed and these are the only two processes in the system each is bloc!ed the other and all useful wor! in the system stops. This situation ifs termed deadloc!. The system is in deadloc! state because each process holds a resource being re%uested by the other process neither process is willing to release the resource it holds. Pree#pta:(e and Nonpree#pta:(e Re"ource" 5esources come in two flavors4 preemptable and nonpreemptable. A preemptable resource is one that can be ta!en away from the process with no ill effects. 0emory is an e.ample of a preemptable resource. On the other hand a nonpreemptable resource is one that cannot be ta!en away from process (without causing ill effect). ,or e.ample '1 resources are not preemptable at an arbitrary moment. 5eallocating resources can resolve deadloc!s that involve preemptable resources. 1eadloc!s that involve nonpreemptable resources are difficult to deal with.

/ead(oc0 Condition
Nece""ar! and Su88icient /ead(oc0 Condition" 'offman (<?C<) identified four (D) conditions that must hold simultaneously for there to be a deadloc!. Mutua( E?c(u"ion Condition: The resources involved are non-shareable. 2.planation4 At least one resource (thread) must be held in a non-shareable mode that is only one process at a time claims e.clusive control of the resource. #f another process re%uests that resource the re%uesting process must be delayed until the resource has been released. 5o(d and 6ait Condition: 5e%uesting process hold already resources while waiting for re%uested resources. 2.planation4 There must e.ist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes. No-Pree#pti;e Condition: 5esources already allocated to a process cannot be preempted. 2.planation4 5esources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. Circu(ar 6ait Condition: The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the ne.t process in the list. As an e.ample consider the traffic deadloc! in the following figure

93

'onsider each section of the street as a resource. • • • • 0utual e.clusion condition applies since only one vehicle can be on a section of the street at a time. =old-and-wait condition applies since each vehicle is occupying a section of the street and waiting to move on to the ne.t section of the street. 6o-preemptive condition applies since a section of the street that is a section of the street that is occupied by a vehicle cannot be ta!en away from it. 'ircular wait condition applies since each vehicle is waiting on the ne.t vehicle to move. That is each vehicle in the traffic is waiting for a section of street held by the ne.t vehicle in the traffic.

The simple rule to avoid traffic deadloc! is that a vehicle should only enter an intersection if it is assured that it will not have to stop inside the intersection. #t is not possible to have a deadloc! involving only one single process. The deadloc! involves a circular *hold-and-wait+ condition between two or more processes so *one+ process cannot hold a resource yet be waiting for another resource that it is holding. #n addition deadloc! is not possible between two threads in a process because it is the process that holds resources not the thread that is each thread has access to the resources held by the process.

/ea(in 4ith /ead(oc0 Pro:(e#
#n general there are four strategies of dealing with deadloc! problem4 • The Ostrich Approach4 Must ignore the deadloc! problem altogether. 94

• • •

1eadloc! 1etection and 5ecovery4 1etect deadloc! and when it occurs ta!e steps to recover. 1eadloc! Avoidance4 Avoid deadloc! by careful resource scheduling. 1eadloc! "revention4 "revent deadloc! by resource scheduling so as to negate at least one of the four conditions.

/ead(oc0 Pre;ention =avender in his pioneering wor! showed that since all four of the conditions are necessary for deadloc! to occur it follows that deadloc! might be prevented by denying any one of the conditions. E(i#ination o8 GMutua( E?c(u"ionH Condition: The mutual e.clusion condition must hold for non-sharable resources. That is several processes cannot simultaneously share a single resource. This condition is difficult to eliminate because some resources such as the tap drive and printer are inherently non-shareable. 6ote that shareable resources li!e read-only-file do not re%uire mutually e.clusive access and thus cannot be involved in deadloc!. E(i#ination o8 G5o(d and 6aitH Condition: There are two possibilities for elimination of the second condition. The first alternative is that a process re%uest be granted all of the resources it needs at once prior to e.ecution. The second alternative is to disallow a process from re%uesting resources whenever it has previously allocated resources. This strategy re%uires that all of the resources a process will need must be re%uested at once. The system must grant resources on *all or none+ basis. #f the complete set of resources needed by a process is not currently available then the process must wait until the complete set is available. /hile the process waits however it may not hold any resources. Thus the *wait for+ condition is denied and deadloc!s simply cannot occur. This strategy can lead to serious waste of resources. ,or e.ample a program re%uiring ten tap drives must re%uest and receive all ten derives before it begins e.ecuting. #f the program needs only one tap drive to begin e.ecution and then does not need the remaining tap drives for several hours. Then substantial computer resources (? tape drives) will sit idle for several hours. This strategy can cause indefinite postponement (starvation). Since not all the re%uired resources may become available at once. E(i#ination o8 GNo-pree#ptionH Condition: The nonpreemption condition can be alleviated by forcing a process waiting for a resource that cannot immediately be allocated to relin%uish all of its currently held resources so that other processes may use them to finish. Suppose a system does allow processes to hold resources while re%uesting additional resources. 'onsider what happens when a re%uest cannot be satisfied. A process holds resources a second process may need in order to proceed while second process may hold the resources needed by the first process. This is a deadloc!. This strategy re%uire that when a process that is holding some resources is denied a re%uest for additional resources. The process must release its held resources and if necessary re%uest them again together with additional resources. #mplementation of this strategy denies the *nopreemptive+ condition effectively. =igh 'ost when a process releases resources the process may lose all its wor! to that point. One serious conse%uence of this strategy is the possibility of indefinite postponement (starvation). A process might be held off indefinitely as it repeatedly re%uests and releases the same resources. E(i#ination o8 GCircu(ar 6aitH Condition: The last condition the circular wait can be denied by imposing a total ordering on all of the resource types and than forcing all processes to re%uest the resources in order (increasing or decreasing). This strategy impose a total ordering of all resources types and to re%uire that each process re%uests resources in a numerical order (increasing or decreasing) of enumeration. /ith this rule the resource allocation graph can never have a cycle. 95

,or e.ample provide a global numbering of all the resources as shown < 9 E D @ b b b b b 'ard reader "rinter "lotter Tape drive 'ard punch

6ow the rule is this4 processes can re%uest resources whenever they want to but all re%uests must be made in numerical order. A process may re%uest first printer and then a tape drive (order4 9 D) but it may not re%uest first a plotter and then a printer (order4 E 9). The problem with this strategy is that it may be impossible to find an ordering that satisfies everyone. /ead(oc0 A;oidance This approach to the deadloc! problem anticipates deadloc! before it actually occurs. This approach employs an algorithm to access the possibility that deadloc! could occur and acting accordingly. This method differs from deadloc! prevention which guarantees that deadloc! cannot occur by denying one of the necessary conditions of deadloc!. #f the necessary conditions for a deadloc! are in place it is still possible to avoid deadloc! by being careful when resources are allocated. "erhaps the most famous deadloc! avoidance algorithm due to 1i:!stra R<?B@Q is the $an!erUs algorithm. So named because the process is analogous to that used by a ban!er in deciding if a loan can be safely made. Ban0erI" A( orith# #n this analogy 'ustomersb &nits $an!er b b processes resources tape drive say

Operating System

'ustomers&sed A $ ' 1 A A A A

0a. B @ D C

Available &nits W <A 96

,ig. <

#n the above figure we see four customers each of whom has been granted a number of credit nits. The ban!er reserved only <A units rather than 99 units to service them. At certain moment the situation becomes 'ustomers&sed A $ ' 1 < < 9 D 0a. B @ D C

Available &nits W 9

,ig. 9 Sa8e State: The !ey to a state being safe is that there is at least one way for all users to finish. #n other analogy the state of figure 9 is safe because with 9 units left the ban!er can delay any re%uest e.cept ';s thus letting ' finish and release all four resources. /ith four units in hand the ban!er can let either 1 or $ have the necessary units and so on. )n"a8e State: 'onsider what would happen if a re%uest from $ for one more unit were granted in above figure 9. /e would have following situation 'ustomers&sed A $ ' 1 < 9 9 D 0a. B @ D C

Available &nits W <

,ig. E This is an unsafe state. #f all the customers namely A $ ' and 1 as!ed for their ma.imum loans then ban!er could not satisfy any of them and we would have a deadloc!. #t is important to note that an unsafe state does not imply the e.istence or even the eventual e.istence a deadloc!. /hat an unsafe state does imply is simply that some unfortunate se%uence of events might lead to a deadloc!. 97

The $an!er;s algorithm is thus to consider each re%uest as it occurs and see if granting it leads to a safe state. #f it does the re%uest is granted otherwise it postponed until later. =aberman R<?B?Q has shown that e.ecuting of the algorithm has comple.ity proportional to 69 where 6 is the number of processes and since the algorithm is e.ecuted each time a resource re%uest occurs the overhead is significant. /ead(oc0 /etection 1eadloc! detection is the process of actually determining that a deadloc! e.ists and identifying the processes and resources involved in the deadloc!. The basic idea is to chec! allocation against resource availability for all possible allocation se%uences to determine if the system is in deadloc!ed state a. Of course the deadloc! detection algorithm is only half of this strategy. Once a deadloc! is detected there needs to be a way to recover several alternatives e.ists4 • • • Temporarily prevent resources from deadloc!ed processes. $ac! off a process to some chec! point allowing preemption of a needed resource and restarting the process at the chec!point later. Successively !ill processes until the system is deadloc! free.

These methods are e.pensive in the sense that each iteration calls the detection algorithm until the system proves to be deadloc! free. The comple.ity of algorithm is O(69) where 6 is the number of proceeds. Another potential problem is starvation; same process !illed repeatedly.

C5APTER 1 MEMORY MANAGEMENT

98

A:out Me#or!
A 0acintosh computer;s available 5A0 is used by the Operating System applications and other software components such as device drivers and system e.tensions. This section describes both the general organi3ation of memory by the Operating System and the organi3ation of the emory partition allocated to your application when it is launched. This section also provides a preliminary description of three related memory topics4 • • • temporary memory virtual memory 9D- and E9-bit addressing

,or more complete information on these three topics you need to read the remaining chapters in this boo!. Or ani>ation o8 Me#or! :! the Operatin S!"te# /hen the 0acintosh Operating System starts up it divides the available 5A0 into two broad sections. #t reserves for itself a 3one or partition of memory !nown as the system partition. The system partition always begins at the lowest addressable byte of memory (memory address A) and e.tends upward. The system partition contains a system heap and a set of global variables described in the ne.t two sections. All memory outside the system partition is available for allocation to applications or other software components. #n system software version C.A and later (or when 0ulti,inder is running in system software versions @.A and B.A) the user can have multiple applications open at once. /hen an application is launched the Operating System assigns it a section of memory !nown as its application partition. #n general an application uses only the memory contained in its own application partition. ,igure <-< illustrates the organi3ation of memory when several applications are open at the same time. The system partition occupies the lowest position in memory. Application partitions occupy part of the remaining space. 6ote that application partitions are loaded into the top part of memory first.

99

#n ,igure <-< three applications are open each with its own application partition. The application labeled Application is the active applicationThe System =eap The main part of the system partition is an area of memory !nown as the system heap. #n general the system heap is reserved for e.clusive use by the Operating System and other system software components which load into it various items such as system resources system code segments and system data structures. All system buffers and %ueues for e.ample are allocated in the system heap. The system heap is also used for code and other resources that do not belong to specific applications such as code resources that add features to the Operating System or that provide control of special-purpose peripheral e%uipment. System patches and system e.tensions (stored as code resources of type ;#6#T;) are loaded into the system heap during the system startup process. =ardware device drivers (stored as code resources of type ;15G5;) are loaded into the system heap when the driver is opened. 0ost applications don;t need to load anything into the system heap. #n certain cases however you might need to load resources or code segments into the system heap. ,or e.ample if you want a vertical retrace tas! to continue to e.ecute even when your application is in the bac!ground you 100

need to load the tas! and any data associated with it into the system heap. Otherwise the Gertical 5etrace 0anager ignores the tas! when your application is in the bac!ground. The S!"te# G(o:a( @aria:(e" The lowest part of memory is occupied by a collection of global variables called system global variables (or low-memory system global variables). The Operating System uses these variables to maintain different !inds of information about the operating environment. ,or e.ample the Tic!s global variable contains the number of tic!s (si.tieths of a second) that have elapsed since the system was most recently started up. Similar variables contain for e.ample the height of the menu bar (0$ar=eight) and pointers to the heads of various operating-system %ueues (1TOueue ,SO=dr G$(Oueue and so forth). 0ost low-memory global variables are of this variety4 they contain information that is generally useful only to the Operating System or other system software components. Other low-memory global variables contain information about the current application. ,or e.ample the ApplKone global variable contains the address of the first byte f the active application;s partition. The Appl(imit global variable contains the address of the last byte the active application;s heap can e.pand to include. The 'urrentA@ global variable contains the address of the boundary between the active application;s global variables and its application parameters. $ecause these global variables contain information about the active application the Operating System changes the values of these variables whenever a conte.t switch occurs. #n general it is best to avoid reading or writing low-memory system global variables. 0ost of these variables are undocumented and the results of changing their values can be unpredictable. &sually when the value of a low-memory global variable is li!ely to be useful to applications the system software provides a routine that you can use to read or write that value. ,or e.ample you can get the current value of the Tic!s global variable by calling the Tic! 'ount function. #n rare instances there is no routine that reads or writes the value of a documented global variable. #n those cases you might need to read or write that value directly. See the chapter >0emory 0anager> in this boo! for instructions on reading and writing the values of low-memory global variables from a high-level language. Or ani>ation o8 Me#or! in an App(ication Partition /hen your application is launched the Operating System allocates for it a partition of memory called its application partition. That partition contains re%uired segments of the application;s code as well as other data associated with the application. ,igure <-9 illustrates the general organi3ation of an application partition. ,igure <-9 Organi3ation of an application partition

101

8our application partition is divided into three ma:or parts4 • • • the application stac! the application heap the application global variables and A@ world

The heap is located at the low-memory end of your application partition and always e.pands (when necessary) toward high memory. The A@ world is located at the high-memory end of your application partition and is of fi.ed si3e. The stac! begins at the low-memory end of the A@ world and e.pands downward toward the top of the heap. As you can see in ,igure <-9 there is usually an unused area of memory between the stac! and the heap. This unused area provides space for the stac! to grow without encroaching upon the space assigned to the application heap. #n some cases however the stac! might grow into space reserved for the application heap. #f this happens it is very li!ely that data in the heap will become corrupted. The Appl(imit global variable mar!s the upper limit to which your heap can grow. #f you call the 0a.ApplKone procedure at the beginning of your program the heap immediately e.tends all the way up to this limit. #f you were to use all of the heap;s free space the 0emory 0anager would not allow you to allocate additional bloc!s above Appl(imit. #f you do not call 0a.ApplKone the heap grows toward Appl(imit whenever the 0emory 0anager finds that there is not enough memory in the heap to fill a re%uest. =owever once the heap grows up to Appl(imit it can grow no further. Thus whether you ma.imi3e your application heap or not you can use only the space between the bottom of the heap and Appl(imit. &nli!e the heap the stac! is not bounded by Appl(imit. #f your application uses heavily nested procedures with many local variables or uses e.tensive recursion the stac! could grow downward 102

beyond Appl(imit. $ecause you do not use 0emory 0anager routines to allocate memory on the stac! the 0emory 0anager cannot stop your stac! from growing beyond Appl(imit and possibly encroaching upon space reserved for the heap. =owever a vertical retrace tas! chec!s appro.imately BA times each second to see if the stac! has moved into the heap. #f it has the tas! !nown as the >stac! sniffer > generates a system error. This system error alerts you that you have allowed the stac! to grow too far so that you can ma!e ad:ustments. See >'hanging the Si3e of the Stac!> on page <-E? for instructions on how to change the si3e of your application stac!. The App(ication Stac0 The stac! is an area of memory in your application partition that can grow or shrin! at one end while the other end remains fi.ed. This means that space on the stac! is always allocated and released in (#,O (last-in first-out) order. The last item allocated is always the first to be released. #t also means that the allocated area of the stac! is always contiguous. Space is released only at the top of the stac! never in the middle so there can never be any unallocated >holes> in the stac!. $y convention the stac! grows from high memory toward low memory addresses. The end of the stac! that grows or shrin!s is usually referred to as the >top> of the stac! even though it;s actually at the lower end of memory occupied by the stac!. $ecause of its (#,O nature the stac! is especially useful for memory allocation connected with the e.ecution of functions or procedures. /hen your application calls a routine space is automatically allocated on the stac! for a stac! frame. A stac! frame contains the routine;s parameters local variables and return address. ,igure <-E illustrates how the stac! e.pands and shrin!s during a function call. The leftmost diagram shows the stac! :ust before the function is called. The middle diagram shows the stac! e.panded to hold the stac! frame. Once the function is e.ecuted the local variables and function parameters are popped off the stac!. #f the function is a "ascal function all that remains is the previous stac! with the function result on top. ,igure <-E. The application stac!

103

The App(ication 5eap An application heap is the area of memory in your application partition in which space is dynamically allocated and released on demand. The heap begins at the low-memory end of your application partition and e.tends upward in memory. The heap contains virtually all items that are not allocated on the stac!. ,or instance your application heap contains the application;s code segments and resources that are currently loaded into memory. The heap also contains other dynamically allocated items such as window records dialog records document data and so forth. 8ou allocate space within your application;s heap by ma!ing calls to the 0emory 0anager either directly (for instance using the 6ew=andle function) or indirectly (for instance using a routine such as 6ew/indow which calls 0emory 0anager routines). Space in the heap is allocated in loc!s which can be of any si3e needed for a particular ob:ect. The 0emory 0anager does all the necessary house!eeping to !eep trac! of bloc!s in the heap as they are allocated and released. $ecause these operations can occur in any order the heap doesn;t usually grow and shrin! in an orderly way as the stac! does. #nstead after your application has been running for a while the heap can tend to become fragmented into a patchwor! of allocated and free bloc!s as shown in ,igure <-D. This fragmentation is !nown as heap fragmentation. ,igure <-D A fragmented heaps

One result of heap fragmentation is that the 0emory 0anager might not be able to satisfy your application;s re%uest to allocate a bloc! of a particular si3e. 2ven though there is enough free space available the space is bro!en up into bloc!s smaller than the re%uested si3e. /hen this happens the 0emory 0anager tries to create the needed space by moving allocated bloc!s together thus 104

collecting the free space in a single larger bloc!. This operation is !nown as heap compaction. ,igure <-@ shows the results of compacting the fragmented heap shown in ,igure <-D. ,igure <-@ A compacted heaps

=eap fragmentation is generally not a problem as long as the bloc!s of memory you allocate are free to move during heap compaction. There are however two situations in which a bloc! is not free to move4 when it is a nonrelocatable bloc! and when it is a loc!ed relocatable bloc!. To minimi3e heap fragmentation you should use nonrelocatable bloc!s sparingly and you should loc! relocatable bloc!s only when absolutely necessary. The App(ication G(o:a( @aria:(e" and A' 6or(d 8our application;s global variables are stored in an area of memory near the top of your application partition !nown as the application A@ world. The A@ world contains four !inds of data4 • • • • application global variables application Ouic!1raw global variables application parameters the application;s :ump table

2ach of these items is of fi.ed si3e although the si3es of the global variables and of the :ump table may vary from application to application. ,igure <-B shows the standard organi3ation of the A@ world. ,igure <-B Organi3ation of an application;s A@ world 105

The system global variable 'urrentA@ points to the boundary between the current application;s global variables and its application parameters. ,or this reason the application;s global variables are found as negative offsets from the value of 'urrentA@. This boundary is important because the Operating System uses it to access the following information from your application4 its global variables its Ouic!1raw global variables the application parameters and the :ump table. This information is !nown collectively as the A@ world because the Operating System uses the microprocessor;s A@ register to point to that boundary. 8our application;s Ouic!1raw global variables contain information about its drawing environment. ,or e.ample among these variables is a pointer to the current graphics port. 8our application;s :ump table contains an entry for each of your application;s routines that is called by code in another segment. The Segment 0anager uses the :ump table to determine the address of any e.ternally referenced routines called by a code segment. The application parameters are E9 bytes of memory located above the application global variables; they;re reserved for use by the Operating System. The first long word of those parameters is a pointer to your application;s Ouic!1raw global variables. Te#porar! Me#or! #n the 0acintosh multitas!ing environment each application is limited to a particular memory partition (whose si3e is determined by information in the ;S#K2; resource of that application). The si3e of your application;s partition places certain limits on the si3e of your application heap and hence on the si3es of the buffers and other data structures that your application uses. #n general you specify an application partition si3e that is large enough to hold all the buffers resources and other data that your application is li!ely to need during its e.ecution. #f for some reason you need more memory than is currently available in your application heap you can as! the Operating System to let you use any available memory that is not yet allocated to any other application. This memory !nown as temporary memory is allocated from the available 106

unused 5A0; usually that memory is not contiguous with the memory in your application;s 3one. ,igure <-C shows an application using some temporary memory. ,igure <-C &sing temporary memory allocated from unused 5A0

#n ,igure <-C Application < has almost e.hausted its application heap. As a result it has re%uested and received a large bloc! of temporary memory e.tending from the top of Application 9;s partition to the top of the a locatable space. Application < can use the temporary memory in whatever manner it desires. 8our application should use temporary memory only for occasional short-term purposes that could be accomplished in less space though perhaps less efficiently. ,or e.ample if you want to copy a large file you might try to allocate a fairly large buffer of temporary memory. #f you receive the temporary memory you can copy data from the source file into the destination file using the large buffer. #f however the re%uest for temporary memory fails you can instead use a smaller buffer within your application heap. Although using the smaller buffer might prolong the copying operation the file is nonetheless copied. One good reason for using temporary memory only occasionally is that you cannot assume that you will always receive the temporary memory you re%uest. ,or e.ample in ,igure <-C all the available memory is allocated to the two open applications; any further re%uests by either one for some temporary memory would fail. ,or complete details on using temporary memory see the chapter >0emory 0anager> in this boo!. 107

@irtua( Me#or! #n system software version C.A and later suitably e%uipped 0acintosh computers can ta!e advantage of a feature of the Operating System !nown as virtual memory by which the machines have a logical address space that e.tends beyond the limits of the available physical memory. $ecause of virtual memory a user can load more programs and data into the logical address space than would fit in the computer;s physical 5A0. The Operating System e.tends the address space by using part of the available secondary storage (that is part of a hard dis!) to hold portions of applications and data that are not currently needed in 5A0. /hen some of those portions of memory are needed the Operating System swaps out unneeded parts of applications or data to the secondary storage thereby ma!ing room for the parts that are needed. #t is important to reali3e that virtual memory operates transparently to most applications. &nless your application has time-critical needs that might be adversely affected by the operation of virtual memory or installs routines that e.ecute at interrupt time you do not need to !now whether virtual memory is operating. ,or complete details on virtual memory see the chapter >Girtual 0emory 0anager> later in this boo!. Addre""in Mode" On suitably e%uipped 0acintosh computers the Operating System supports E9-bit addressing that is the ability to use E9 bits to determine memory addresses. 2arlier versions of system software use 9D-bit addressing where the upper H bits of memory addresses are ignored or used as flag bits. #n a 9D-bit addressing scheme the logical address space has a si3e of <B 0$. $ecause H 0$ of this total are reserved for #/O space 5O0 and slot space the largest contiguous program address space is H 0$. /hen E9-bit addressing is in operation the ma.imum program address space is < )$. The ability to operate with E9-bit addressing is available only on certain 0acintosh models namely those with systems that contain a E9-bit 0emory 0anager. (,or compatibility reasons these systems also contain a 9D-bit 0emory 0anager.) #n order for your application to wor! when the machine is using E9-bit addressing it must be E9-bit clean that is able to run in an environment where all E9 bits of a memory address are significant. ,ortunately writing applications that are E9-bit clean is relatively easy if you follow the guidelines in #nside 0acintosh. #n general applications are not E9-bit clean because they manipulate flag bits in master pointers directly (for instance to mar! the associated memory bloc!s as loc!ed or purgeable) instead of using 0emory 0anager routines to achieve the desired result.

5eap Mana e#ent
Applications allocate and manipulate memory primarily in their application heap. As you have seen space in the application heap is allocated and released on demand. /hen the bloc!s in your heap are free to move the 0emory 0anager can often reorgani3e the heap to free space when necessary to fulfill a memory-allocation re%uest. #n some cases however bloc!s in your heap cannot move. #n these cases you need to pay close attention to memory allocation and management to avoid fragmenting your heap and running out of memory. This section provides a general description of how to manage bloc!s of memory in your application heap. #t describes • relocatable and nonrelocatable bloc!s 108

• • • • •

properties of relocatable bloc!s heap purging and compaction heap fragmentation dangling pointers low-memory conditions

Re(ocata:(e and Nonre(ocata:(e B(oc0" 8ou can use the 0emory 0anager to allocate two different types of bloc!s in your heap4 nonrelocatable bloc!s and relocatable bloc!s. A nonrelocatable bloc! is a bloc! of memory whose location in the heap is fi.ed. #n contrast a relocatable bloc! is a bloc! of memory that can be moved within the heap (perhaps during heap compaction). The 0emory 0anager sometimes moves relocatable bloc!s during memory operations so that it can use the space in the heap optimally. The 0emory 0anager provides data types that reference both relocatable and nonrelocatable bloc!s. #t also provides routines that allow you to allocate and release bloc!s of both types. To reference a nonrelocatable bloc! you can use a pointer variable defined by the "tr data type. T8"2 Signed$yte "tr W -<9H..<9C;

W cSigned$yte;

A pointer is simply the address of an arbitrary byte in memory and a pointer to a nonrelocatable bloc! of memory is simply the address of the first byte in the bloc! as illustrated in ,igure <-H. After you allocate a nonrelocatable bloc! you can ma!e copies of the pointer variable. $ecause a pointer is the address of a bloc! of memory that cannot be moved all copies of the pointer correctly reference the bloc! as long as you don;t dispose of it. ,igure <-H. A pointer to a nonrelocatable bloc!

109

The pointer variable itself occupies D bytes of space in your application partition. Often the pointer variable is a global variable and is therefore contained in your application;s A@ world. $ut the pointer can also be allocated on the stac! or in the heap itself. To reference relocatable bloc!s the 0emory 0anager uses a scheme !nown as double indirection. The 0emory 0anager !eeps trac! of a relocatable bloc! internally with a master pointer which itself is part of a nonrelocatable master pointer bloc! in your application heap and can never move. /hen the 0emory 0anager moves a relocatable bloc! it updates the master pointer so that it always contains the address of the relocatable bloc!. 8ou reference the bloc! with a handle defined by the =andle data type. T8"2 =andle W c"tr;

A handle contains the address of a master pointer. The left side of ,igure <-? shows a handle to a relocatable bloc! of memory located in the middle of the application heap. #f necessary (perhaps to ma!e room for another bloc! of memory) the 0emory 0anager can move that bloc! down in the heap as shown in the right side of ,igure <-?. ,igure <-? A handle to a relocatable bloc!

110

0aster pointers for relocatable ob:ects in your heap are always allocated in your application heap. $ecause the bloc!s of masters pointers are nonrelocatable it is best to allocate them as low in your heap as possible. 8ou can do this by calling the 0ore0asters procedure when your application starts up. /henever possible you should allocate memory in relocatable bloc!s. This gives the 0emory 0anager the greatest freedom when rearranging the bloc!s in your application heap to create a new bloc! of free memory. #n some cases however you may be forced to allocate a nonrelocatable bloc! of memory. /hen you call the /indow 0anager function 6ew/indow for e.ample the /indow 0anager internally calls the 6ew"tr function to allocate a new nonrelocatable bloc! in your application partition. 8ou need to e.ercise care when calling Toolbo. routines that allocate such bloc!s lest your application heap become overly fragmented. &sing relocatable bloc!s ma!es the 0emory 0anager more efficient at managing available space but it does carry some overhead. As you have seen the 0emory 0anager must allocate e.tra memory to hold master pointers for relocatable bloc!s. #t groups these master pointers into nonrelocatable bloc!s. ,or large relocatable bloc!s this e.tra space is negligible but if you allocate many very small relocatable bloc!s the cost can be considerable. ,or this reason you should avoid allocating a very large number of handles to small bloc!s; instead allocate a single large bloc! and use it as an array to hold the data you need. Propertie" o8 Re(ocata:(e B(oc0" As you have seen a heap bloc! can be either relocatable or nonrelocatable. The designation of a bloc! as relocatable or nonrelocatable is a permanent property of that bloc!. #f relocatable a bloc! can be either loc!ed or unloc!ed; if it;s unloc!ed a bloc! can be either purgeable or 111

unpurgeable. These attributes of relocatable bloc!s can be set and changed as necessary. The following sections e.plain how to loc! and unloc! bloc!s and how to mar! them as purgeable or unpurgeable. Loc0in and )n(oc0in Re(ocata:(e B(oc0": Occasionally you might need a relocatable bloc! of memory to stay in one place. To prevent a bloc! from moving you can loc! it using the =(oc! procedure. Once you have loc!ed a bloc! it won;t move. (ater you can unloc! it using the =&nloc! procedure allowing it to move again. #n general you need to loc! a relocatable bloc! only if there is some danger that it might be moved during the time that you read or write the data in that bloc!. This might happen for instance if you dereference a handle to obtain a pointer to the data and (for increased speed) use the pointer within a loop that calls routines that might cause memory to be moved. #f within the loop the bloc! whose data you are accessing is in fact moved then the pointer no longer points to that data; this pointer is said to dangle. &sing loc!ed relocatable bloc!s can however slow the 0emory 0anager down as much as using nonrelocatable bloc!s. The 0emory 0anager can;t move loc!ed bloc!s. #n addition e.cept when you allocate memory and resi3e relocatable bloc!s it can;t move relocatable bloc!s around loc!ed relocatable bloc!s (:ust as it can;t move them around nonrelocatable bloc!s). Thus loc!ing a bloc! in the middle of the heap for long periods of time can increase heap fragmentation. (oc!ing and unloc!ing bloc!s every time you want to prevent a bloc! from moving can become troublesome. ,ortunately the 0emory 0anager moves unloc!ed relocatable bloc!s only at welldefined predictable times. #n general each routine description in #nside 0acintosh indicates whether the routine could move or purge memory. #f you do not call any of those routines in a section of code you can rely on all bloc!s to remain stationary while that code e.ecutes. 6ote that the Segment 0anager might move memory if you call a routine located in a segment that is not currently resident in memory. Pur in and Rea((ocatin Re(ocata:(e B(oc0": One advantage of relocatable bloc!s is that you can use them to store information that you would li!e to !eep in memory to ma!e your application more efficient but that you don;t really need if available memory space becomes low. ,or e.ample your application might at the beginning of its e.ecution load user preferences from a preferences file into a relocatable bloc!. As long as the bloc! remains in memory your application can access information from the preferences file without actually reopening the file. =owever reopening the file probably wouldn;t ta!e enough time to :ustify !eeping the bloc! in memory if memory space were scarce. $y ma!ing a relocatable bloc! purgeable you allow the 0emory 0anager to free the space it occupies if necessary. #f you later want to prohibit the 0emory 0anager from freeing the space occupied by a relocatable bloc! you can ma!e the bloc! unpurgeable. 8ou can use the ="urge and =6o"urge procedures to change bac! and forth between these two states. A bloc! you create by calling 6ew=andle is initially unpurgeable. Once you ma!e a relocatable bloc! purgeable you should subse%uently chec! handles to that bloc! before using them if you call any of the routines that could move or purge memory. #f a handle;s master pointer is set to 6#( then the Operating System has purged its bloc!. To use the information formerly in the bloc! you must reallocate space for it (perhaps by calling the 5eallocate=andle procedure) and then reconstruct its contents (for e.ample by rereading the preferences file). ,igure <-<A illustrates the purging and reallocating of a relocatable bloc!. /hen the bloc! is purged its master pointer is set to 6#(. /hen it is reallocated the handle correctly references a new bloc! but that bloc!;s contents are initially undefined. 112

,igure <-<A "urging and reallocating a relocatable bloc!

Me#or! Re"er;ation The 0emory 0anager does its best to prevent situations in which nonrelocatable bloc!s in the middle of the heap trap relocatable bloc!s. /hen it allocates new nonrelocatable bloc!s it attempts to reserve memory for them as low in the heap as possible. The 0emory 0anager reserves memory for a nonrelocatable bloc! by moving unloc!ed relocatable bloc!s upward until it has created a space large enough for the new bloc!. /hen the 0emory 0anager can successfully pac! all nonrelocatable bloc!s into the bottom of the heap no nonrelocatable bloc! can trap a relocatable bloc! and it has successfully prevented heap fragmentation. ,igure <-<< illustrates how the 0emory 0anager allocates nonrelocatable bloc!s. Although it could place a bloc! of the re%uested si3e at the top of the heap it instead reserves space for the bloc! as close to the bottom of the heap as possible and then puts the bloc! into that reserved space. 1uring this process the 0emory 0anager might even move a relocatable bloc! over a nonrelocatable bloc! to ma!e room for another nonrelocatable bloc!. ,igure <-<<. Allocating a nonrelocatable bloc!

113

/hen allocating a new relocatable bloc! you can if you want manually reserve space for the bloc! by calling the 5eserve0em procedure. #f you do not the 0emory 0anager loo!s for space big enough for the bloc! as low in the heap as possible but it does not create space near the bottom of the heap for the bloc! if there is already enough space higher in the heap. 5eap Pur in and Co#paction /hen your application attempts to allocate memory (for e.ample by calling either the 6ew"tr or 6ew=andle function) the 0emory 0anager might need to compact or purge the heap to free memory and to fuse many small free bloc!s into fewer large free bloc!s. The 0emory 0anager first tries to obtain the re%uested amount of space by compacting the heap; if compaction fails to free the re%uired amount of space the 0emory 0anager then purges the heap. /hen compacting the heap the 0emory 0anager moves unloc!ed relocatable bloc!s down until they reach nonrelocatable bloc!s or loc!ed relocatable bloc!s. 8ou can compact the heap manually by calling either the 'ompact0em function or the 0a.0em function. #n a purge of the heap the 0emory 0anager se%uentially purges unloc!ed purgeable relocatable bloc!s until it has freed enough memory or until it has purged all such bloc!s. #t purges a bloc! by deallocating it and setting its master pointer to 6#(. #f you want you can manually purge a few bloc!s or an entire heap in anticipation of a memory shortage. To purge an individual bloc! manually call the 2mpty=andle procedure. To purge your entire heap manually call the "urge0em procedure or the 0a.0em function. 5eap Fra #entation =eap fragmentation can slow your application by forcing the 0emory 0anager to compact or purge your heap to satisfy a memory-allocation re%uest. #n the worst cases when your heap is severely fragmented by loc!ed or nonrelocatable bloc!s it might be impossible for the 0emory 0anager to find the re%uested amount of contiguous free space even though that much space is actually free in your heap. This can have disastrous conse%uences for your application. ,or e.ample 114

if the 0emory 0anager cannot find enough room to load a re%uired code segment your application will crash. Obviously it is best to minimi3e the amount of fragmentation that occurs in your application heap. #t might be tempting to thin! that because the 0emory 0anager controls the movement of bloc!s in the heap there is little that you can do to prevent heap fragmentation. #n reality however fragmentation does not stri!e your application;s heap by chance. Once you understand the ma:or causes of heap fragmentation you can follow a few simple rules to minimi3e it. The primary causes of heap fragmentation are indiscriminate use of nonrelocatable bloc!s and indiscriminate loc!ing of relocatable bloc!s. 2ach of these creates immovable bloc!s in your heap thus creating >roadbloc!s> for the 0emory 0anager when it rearranges the heap to ma.imi3e the amount of contiguous free space. 8ou can significantly reduce heap fragmentation simply by e.ercising care when you allocate nonrelocatable bloc!s and when you loc! relocatable bloc!s. Throughout this section you should !eep in mind the following rule4 the 0emory 0anager can move a relocatable bloc! around a nonrelocatable bloc! (or a loc!ed relocatable bloc!) at these times only4 • /hen the 0emory 0anager reserves memory for a nonrelocatable bloc! (or when you manually reserve memory before allocating a bloc!) it can move unloc!ed relocatable bloc!s upward over nonrelocatable bloc!s to ma!e room for the new bloc! as low in the heap as possible. /hen you attempt to resi3e a relocatable bloc! the 0emory 0anager can move that bloc! around other bloc!s if necessary.



#n contrast the 0emory 0anager cannot move relocatable bloc!s over nonrelocatable bloc!s during compaction of the heap. /ea((ocatin Nonre(ocata:(e B(oc0" One of the most common causes of heap fragmentation is also one of the most difficult to avoid. The problem occurs when you dispose of a nonrelocatable bloc! in the middle of the pile of nonrelocatable bloc!s at the bottom of the heap. &nless you immediately allocate another nonrelocatable bloc! of the same si3e you create a gap where the nonrelocatable bloc! used to be. #f you later allocate a slightly smaller nonrelocatable bloc! that gap shrin!s. =owever small gaps are inefficient because of the small li!elihood that future memory allocations will create bloc!s small enough to occupy the gaps. #t would not matter if the first bloc! you allocated after deleting the nonrelocatable bloc! were relocatable. The 0emory 0anager would place the bloc! in the gap if possible. #f you were later to allocate a nonrelocatable bloc! as large as or smaller than the gap the new bloc! would ta!e the place of the relocatable bloc! which would :oin other relocatable bloc!s in the middle of the heap as desired. =owever the new nonrelocatable bloc! might be smaller than the original nonrelocatable bloc! leaving a small gap. /henever you dispose of a nonrelocatable bloc! that you have allocated you create small gaps unless the ne.t nonrelocatable bloc! you allocate happens to be the same si3e as the disposed bloc!. These small gaps can lead to heavy fragmentation over the course of your application;s e.ecution. Thus you should try to avoid disposing of and then reallocating nonrelocatable bloc!s during program e.ecution. 115

Re"er;in Me#or! Another cause of heap fragmentation ironically occurs because of a limitation of memory reservation a process designed to prevent it. 0emory reservation never ma!es fragmentation worse than it would be if there were no memory reservation. Ordinarily memory reservation ensures that allocating nonrelocatable bloc!s in the middle of your application;s e.ecution causes no problems. Occasionally however memory reservation can cause fragmentation either when it succeeds but leaves small gaps in the reserved space or when it fails and causes a nonrelocatable bloc! to be allocated in the middle of the heap. The 0emory 0anager uses memory reservation to create space for nonrelocatable bloc!s as low as possible in the heap. (8ou can also manually reserve memory for relocatable bloc!s but you rarely need to do so.) =owever when the 0emory 0anager moves a bloc! up during memory reservation that bloc! cannot overlap its previous location. As a result the 0emory 0anager might need to move the relocatable bloc! up more than is necessary to contain the new nonrelocatable bloc! thereby creating a gap between the top of the new bloc! and the bottom of the relocated bloc!. 0emory reservation can also fragment the heap if there is not enough space in the heap to move the relocatable bloc! up. #n this case the 0emory 0anager allocates the new nonrelocatable bloc! above the relocatable bloc!. The relocatable bloc! cannot then move over the nonrelocatable bloc! e.cept during the times described previously. Loc0in Re(ocata:(e B(oc0" (oc!ed relocatable bloc!s present a special problem. /hen relocatable bloc!s are loc!ed they can cause as much heap fragmentation as nonrelocatable bloc!s. One solution is to reserve memory for all relocatable bloc!s that might at some point need to be loc!ed and to leave them loc!ed for as long as they are allocated. This solution has drawbac!s however because then the bloc!s would lose any fle.ibility that being relocatable otherwise gives them. 1eleting a loc!ed relocatable bloc! can create a gap :ust as deleting a nonrelocatable bloc! can. An alternative partial solution is to move relocatable bloc!s to the top of the heap before loc!ing them. The 0ove==i procedure allows you to move a relocatable bloc! upward until it reaches the top of the heap a nonrelocatable bloc! or a loc!ed relocatable bloc!. This has the effect of partitioning the heap into four areas as illustrated in ,igure <-<9. At the bottom of the heap are the nonrelocatable bloc!s. Above those bloc!s are the unloc!ed relocatable bloc!s. At the top of the heap are loc!ed relocatable bloc!s. $etween the loc!ed relocatable bloc!s and the unloc!ed relocatable bloc!s is an area of free space. The principal idea behind moving relocatable bloc!s to the top of the heap and loc!ing them there is to !eep the contiguous free space as large as possible. ,igure <-<9. An effectively partitioned heap

116

&sing 0ove==i is however not always a perfect solution to handling relocatable bloc!s that need to be loc!ed. The 0ove==i procedure moves a bloc! upward only until it reaches either a nonrelocatable bloc! or a loc!ed relocatable bloc!. &nli!e 6ew"tr and 5eserve0em 0ove==i does not currently move a relocatable bloc! around one that is not relocatable. 2ven if 0ove==i succeeds in moving a bloc! to the top area of the heap unloc!ing or deleting loc!ed bloc!s can cause fragmentation if you don;t unloc! or delete those bloc!s beginning with the lowest loc!ed bloc!. A relocatable bloc! that is loc!ed at the top area of the heap for a long period of time could trap other relocatable bloc!s that were loc!ed for short periods of time but then unloc!ed. This suggests that you need to treat relocatable bloc!s loc!ed for a long period of time differently from those loc!ed for a short period of time. #f you plan to loc! a relocatable bloc! for a long period of time you should reserve memory for it at the bottom of the heap before allocating it then loc! it for the duration of your application;s e.ecution (or as long as the bloc! remains allocated). 1o not reserve memory for relocatable bloc!s you plan to allocate for only short periods of time. #nstead move them to the top of the heap (by calling 0ove==i) and then loc! them. #n practice you apply the same rules to relocatable bloc!s that you reserve space for and leave permanently loc!ed as you apply to nonrelocatable bloc!s4 Try not to allocate such bloc!s in the middle of your application;s e.ecution and don;t dispose of and reallocate such bloc!s in the middle of your application;s e.ecution. After you loc! relocatable bloc!s temporarily you don;t need to move them manually bac! into the middle area when you unloc! them. /henever the 0emory 0anager compacts the heap or moves another relocatable bloc! to the top heap area it brings all unloc!ed relocatable bloc!s at the bottom of that partition bac! into the middle area. /hen moving a bloc! to the top area be sure to call 0ove==i on the bloc! and then loc! the bloc! in that order. A((ocatin Nonre(ocata:(e B(oc0" As you have seen there are two reasons for not allocating nonrelocatable bloc!s during the middle of your application;s e.ecution. ,irst if you also dispose of nonrelocatable bloc!s in the middle of your application;s e.ecution then allocation of new nonrelocatable bloc!s is li!ely to create small gaps as discussed earlier. Second even if you never dispose of nonrelocatable bloc!s until your application terminates memory reservation is an imperfect process and the 0emory 0anager could occasionally place new nonrelocatable bloc!s above relocatable bloc!s. 117

There is however an e.ception to the rule that you should not allocate nonrelocatable bloc!s in the middle of your application;s e.ecution. Sometimes you need to allocate a nonrelocatable bloc! only temporarily. #f between the times that you allocate and dispose of a nonrelocatable bloc! you allocate no additional nonrelocatable bloc!s and do not attempt to compact the heap then you have done no harm. The temporary bloc! cannot create a new gap because the 0emory 0anager places no other bloc! over the temporary bloc!. Su##ar! o8 Pre;entin Fra #entation Avoiding heap fragmentation is not difficult. #t simply re%uires that you follow a few rules as closely as possible. 5emember that allocation of even a small nonrelocatable bloc! in the middle of your heap can ruin a scheme to prevent fragmentation of the heap because the 0emory 0anager does not move relocatable bloc!s around nonrelocatable bloc!s when you call 0ove==i or when it attempts to compact the heap. #f you adhere to the following rules you are li!ely to avoid significant heap fragmentation4 • At the beginning of your application;s e.ecution call the 0a.ApplKone procedure once and the 0ore0asters procedure enough times so that the 0emory 0anager never needs to call it for you. Try to anticipate the ma.imum number of nonrelocatable bloc!s you will need and allocate them at the beginning of your application;s e.ecution. Avoid disposing of and then reallocating nonrelocatable bloc!s during your application;s e.ecution. /hen allocating relocatable bloc!s that you need to loc! for long periods of time use the 5eserve0em procedure to reserve memory for them as close to the bottom of the heap as possible and loc! the bloc!s immediately after allocating them. #f you plan to loc! a relocatable bloc! for a short period of time and allocate nonrelocatable bloc!s while it is loc!ed use the 0ove==i procedure to move the bloc! to the top of the heap and then loc! it. /hen the bloc! no longer needs to be loc!ed unloc! it. 5emember that you need to loc! a relocatable bloc! only if you call a routine that could move or purge memory and you then use a dereferenced handle to the relocatable bloc! or if you want to use a dereferenced handle to the relocatable bloc! at interrupt time.

• • •





"erhaps the most difficult restriction is to avoid disposing of and then reallocating nonrelocatable bloc!s in the middle of your application;s e.ecution. Some Toolbo. routines re%uire you to use nonrelocatable bloc!s and it is not always easy to anticipate how many such bloc!s you will need. #f you must allocate and dispose of bloc!s in the middle of your program;s e.ecution you might want to place used bloc!s into a lin!ed list of free bloc!s instead of disposing of them. #f you !now how many nonrelocatable bloc!s of a certain si3e your application is li!ely to need you can add that many to the beginning of the list at the beginning of your application;s e.ecution. #f you need a nonrelocatable bloc! later you can chec! the lin!ed list for a bloc! of the e.act si3e instead of simply calling the 6ew"tr function. /an (in Pointer" Accessing a relocatable bloc! by double indirection through its handle instead of through its master pointer re%uires an e.tra memory reference. ,or efficiency you might sometimes want to 118

dereference the handle--that is ma!e a copy of the bloc!;s master pointer--and then use that pointer to access the bloc! by single indirection. /hen you do this however you need to be particularly careful. Any operation that allocates space from the heap might cause the relocatable bloc! to be moved or purged. #n that event the bloc!;s master pointer is correctly updated but your copy of the master pointer is not. As a result your copy of the master pointer is a dangling pointer. 1angling pointers are li!ely to ma!e your application crash or produce garbled output. &nfortunately it is often easy during debugging to overloo! situations that could leave pointers dangling because pointers dangle only if the relocatable bloc!s that they reference actually move. 5outines that can move or purge memory do not necessarily do so unless memory space is tight. Thus if you improperly dereference a handle in a section of code that code might still wor! properly most of the time. #f however a dangling pointer does cause errors they can be very difficult to trace. This section describes a number of situations that can cause dangling pointers and suggests some ways to avoid them. Co#pi(er /ere8erencin Some of the most difficult dangling pointers to isolate are not caused by any e.plicit dereferencing on your part but by implicit dereferencing on the part of the compiler. ,or e.ample suppose you use a handle called my=andle to access the fields of a record in a relocatable bloc!. 8ou might use "ascal;s /#T= statement to do so as follows4 /#T= my=andlecc 1O $2)#6 ... 261; A compiler is li!ely to dereference my=andle so that it can access the fields of the record without double indirection. =owever if the code between the $2)#6 and 261 statements causes the 0emory 0anager to move or purge memory you are li!ely to end up with a dangling pointer. The easiest way to prevent dangling pointers is simply to loc! the relocatable bloc! whose data you want to read or write. $ecause the bloc! is loc!ed and cannot move the master pointer is guaranteed always to point to the beginning of the bloc!;s data. (isting <-< illustrates one way to avoid dangling pointers by loc!ing a relocatable bloc!. (isting <-<. (oc!ing a bloc! to avoid dangling pointers 6>< orig5tate: 5ignedA7te; (original attributes of handle'

orig5tate :! EGet5tate(Eandle(m7Hata));(get handle attributes' IoveEEi(Eandle(m7Hata)); E3ock(Eandle(m7Hata)); 9DTE m7HataJJ HK (move the handle high' (lock the handle' (fill in )indo) data' 119

A=GDN edit<ec :! T=Ne)(gHest<ect, g6ie)<ect); v5croll :! GetNe)@ontrol(r65croll, m79indo)); h5croll :! GetNe)@ontrol(rE5croll, m79indo)); file<efNum :! "; )indo)Hirt7 :! F>35=; =NH; E5et5tate (orig5tate); (reset handle attributes'

The handle my1ata needs to be loc!ed before the /#T= statement because the functions T26ew and )et6ew'ontrol allocate memory and hence might move the bloc! whose handle is my1ata. 8ou should be careful to loc! bloc!s only when necessary because loc!ed relocatable bloc!s can increase heap fragmentation and slow down your application unnecessarily. 8ou should loc! a handle only if you dereference it directly or indirectly and then use a copy of the original master pointer after calling a routine that could move or purge memory. /hen you no longer need to reference the bloc! with the master pointer you should unloc! the handle. #n (isting <-< the handle my1ata is never e.plicitly unloc!ed. #nstead the original attributes of the handle are saved by calling =)etState and later are restored by calling =SetState. This strategy is preferable to :ust calling =(oc! and =&nloc!. A compiler can generate hidden dereferencing and hence potential dangling pointers in other ways for instance by assigning the result of a function that might move or purge bloc!s to a field in a record referenced by a handle. Such problems are particularly common in code that manipulates lin!ed data structures. ,or e.ample you might use this code to allocate a new element of a lin!ed list4 my=andlecc.ne.t=andle 4W 6ew=andle(si3eof(my(in!ed2lement)); This can cause problems because your compiler could dereference my=andle before calling 6ew=andle. Therefore you should either loc! my=andle before performing the allocation or use a temporary variable to allocate the new handle as in the following code4 temp=andle 4W 6ew=andle(si3eof(my(in!ed2lement)); my=andlecc.ne.t=andle 4W temp=andle; "assing fields of records as arguments to routines that might move or purge memory can cause similar problems if the records are in relocatable bloc!s referred to with handles. "roblems arise only when you pass a field by reference rather than by value. "ascal conventions call for all arguments larger than D bytes to be passed by reference. #n "ascal a variable is also passed by reference when the routine called re%uests a variable parameter. $oth of the following lines of code could leave a pointer dangling4 T2&pdate(hT2cc.view5ect hT2); #nval5ect(the'ontrolcc.contrl5ect); 120

These problems occur because a compiler may dereference a handle before calling the routine to which you pass the handle. Then that routine may move memory before it uses the dereferenced handle which might then be invalid. As before you can solve these problems by loc!ing the handles or using temporary variables. Loadin Code Se #ent" #f you call an application-defined routine located in a code segment that is not currently in 5A0 the Segment 0anager might need to move memory when loading that code segment thus :eopardi3ing any dereferenced handles you might be using. ,or e.ample suppose you call an application-defined procedure 0anipulate1ata which manipulates some data at an address passed to it in a variable parameter. 1<K@=HL<= I7<outine; A=GDN 222 IanipulateHata(m7EandleJ); 222 =NH; 8ou can create a dangling pointer if 0anipulate1ata and 0y5outine are in different segments and the segment containing 0anipulate1ata is not loaded when 0y5outine is e.ecuted. 8ou can do this because you;ve passed a dereferenced copy of my=andle as an argument to 0anipulate1ata. #f the Segment 0anager must allocate a new relocatable bloc! for the segment containing 0anipulate1ata it might move my=andle to do so. #f so the dereferenced handle would dangle. A similar problem can occur if you assign the result of a function in a nonresident code segment to a field in a record referred to by a handle. 8ou need to be careful even when passing a field in a record referenced by a handle to a routine in the same code segment as the caller or when assigning the result of a function in the same code segment to such a field. #f that routine could call a Toolbo. routine that might move or purge memory or call a routine in a different nonresident code segment then you could indirectly cause a pointer to dangle. Ca((:ac0 Routine" 'ode segmentation can also lead to a different type of dangling-pointer problem when you use callbac! routines. The problem rarely arises but it is difficult to debug. Some Toolbo. routines re%uire that you pass a pointer to a procedure in a variable of type "roc"tr. Ordinarily it does not matter whether the procedure you pass in such a variable is in the same code segment as the routine that calls it or in a different code segment. ,or e.ample suppose you call Trac!'ontrol as follows4 my"art 4W Trac!'ontrol(my'ontrol my2vent.where d0y'all$ac!); #f 0y'all$ac! were in the same code segment as this line of code then a compiler would pass to Trac!'ontrol the absolute address of the 0y'all$ac! procedure. #f it were in a different code segment then the compiler would ta!e the address from the :ump table entry for 0y'all$ac!. 2ither way Trac!'ontrol should call 0y'all$ac! correctly. 121

Occasionally you might use a variable of type "roc"tr to hold the address of a callbac! procedure and then pass that address to a routine. =ere is an e.ample4

my"roc 4W d0y'all$ac!; ... my"art 4W Trac!'ontrol(my'ontrol my2vent.where my"roc); As long as these lines of code are in the same code segment and the segment is not unloaded between the e.ecution of those lines the preceding code should wor! perfectly. Suppose however that my"roc is a global variable and the first line of the code is in a different segment from the call to Trac!'ontrol. Suppose further that the 0y'all$ac! procedure is in the same segment as the first line of the code (which is in a different segment from the call to Trac!'ontrol). Then the compiler might place the absolute address of the 0y'all$ac! routine into the variable my"roc. The compiler cannot reali3e that you plan to use the variable in a different code segment from the one that holds both the routine you are referencing and the routine you are using to initiali3e the my"roc variable. $ecause 0y'all$ac! and the call to Trac!'ontrol are in different code segments the Trac!'ontrol procedure re%uires that you pass an address in the :ump table not an absolute address. Thus in this hypothetical situation my"roc would reference 0y'all$ac! incorrectly. To avoid this problem ma!e sure to place in the same segment any code in which you assign a value to a variable of type "roc"tr and any code in which you use that variable. #f you must put them in different code segments then be sure that you place the callbac! routine in a code segment different from the one that initiali3es the variable. In;a(id 5and(e" An invalid handle refers to the wrong area of memory :ust as a dangling pointer does. There are three types of invalid handles4 empty handles disposed handles and fa!e handles. 8ou must avoid empty disposed or fa!e handles as carefully as dangling pointers. ,ortunately it is generally easier to detect and thus to avoid invalid handles. /i"po"ed 5and(e": A disposed handle is a handle whose associated relocatable bloc! has been disposed of. /hen you dispose of a relocatable bloc! (perhaps by calling the procedure 1ispose=andle) the 0emory 0anager does not change the value of any handle variables that previously referenced that bloc!. #nstead those variables still hold the address of what once was the relocatable bloc!;s master pointer. $ecause the bloc! has been disposed of however the contents of the master pointer are no longer defined. (The master pointer might belong to a subse%uently allocated relocatable bloc! or it could become part of a lin!ed list of unused master pointers maintained by the 0emory 0anager.) #f you accidentally use a handle to a bloc! you have already disposed of you can obtain une.pected results. #n the best cases your application will crash. #n the worst cases you will get garbled data. #t might however be difficult to trace the cause of the garbled data because your application can continue to run for %uite a while before the problem begins to manifest itself. 8ou can avoid these problems %uite easily by assigning the value 6#( to the handle variable after you dispose of its associated bloc!. $y doing so you indicate that the handle does not point anywhere in particular. #f you subse%uently attempt to operate on such a bloc! the 0emory 0anager will probably generate a nil=andle2rr result code. #f you want to ma!e certain that a handle is not disposed of before operating on a relocatable bloc! you can test whether the value of the handle is 122

ND3, as follo)s: DF m7Eandle C+ ND3 TE=N 222; (handle is valid, so )e can operate on it here'

E#pt! 5and(e": An empty handle is a handle whose master pointer has the value 6#(. /hen the 0emory 0anager purges a relocatable bloc! for e.ample it sets the bloc!;s master pointer to 6#(. The space occupied by the master pointer itself remains allocated and handles to the purged bloc! continue to point to the master pointer. This is useful because if you later reallocate space for the bloc! by calling 5eallocate=andle the master pointer will be updated and all e.isting handles will correctly access the reallocated bloc!. Once again however inadvertently using an empty handle can give une.pected results or lead to a system crash. #n the 0acintosh Operating System 6#( technically refers to memory location A. $ut this memory location holds a value. #f you doubly dereference an empty handle you reference whatever data is found at that location and you could obtain une.pected results that are difficult to trace. 8ou can chec! for empty handles much as you chec! for disposed handles. Assuming you set handles to 6#( when you dispose of them you can use the following code to determine whether a handle both points to a valid master pointer and references a nonempty relocatable bloc!4 DF m7Eandle C+ ND3 TE=N DF m7EandleJ C+ ND3 TE=N 222 ()e can operate on the relocatable block here'

6ote that because "ascal evaluates e.pressions completely you need two #,-T=26 statements rather than one compound statement in case the value of the handle itself is 6#(. 0ost compilers however allow you to use >short-circuit> $oolean operators to minimi3e the evaluation of e.pressions. ,or e.ample if your compiler uses the operator J as a short-circuit operator for A61 you could rewrite the preceding code li!e this4 DF (m7Eandle C+ ND3) M (m7EandleJ C+ ND3) TE=N 222 ()e can operate on the relocatable block here'

#n this case the second e.pression is evaluated only if the first e.pression evaluates to T5&2. #t is useful during debugging to set memory location A to an odd number such as e@A,,'AA<. This causes the Operating System to crash immediately if you attempt to dereference an empty handle. This is useful because you can immediately fi. problems that might otherwise re%uire e.tensive debugging. Fa0e 5and(e": A fa!e handle is a handle that was not created by the 0emory 0anager. 6ormally you create handles by either directly or indirectly calling the 0emory 0anager function 6ew=andle (or one of its variants such as 6ew=andle'lear). 8ou create a fa!e handle--usually inadvertently--by directly assigning a value to a variable of type =andle as illustrated in (isting <-9. (isting <-9. 'reating a fa!e handle FLN@TDKN IakeFakeEandle: Eandle; (HKNNT L5= TED5 FLN@TDKN0' 123

@KN5T kIemor73oc ! O,""; 6>< m7Eandle: m71ointer: A=GDN m71ointer :! 1tr(kIemor73oc); m7Eandle :! Pm71ointer; IakeFakeEandle :! m7Eandle; =NH; 5emember that a real handle contains the address of a master pointer. The fa!e handle manufactured by the function 0a!e,a!e=andle in (isting <-9 contains an address that may or may not be the address of a master pointer. #f it isn;t the address of a master pointer then you virtually guarantee chaotic results if you pass the fa!e handle to a system software routine that e.pects a real handle. ,or e.ample suppose you pass a fa!e handle to the 0ove==i procedure. After allocating a new relocatable bloc! high in the heap 0ove==i is li!ely to copy the data from the original bloc! to the new bloc! by dereferencing the handle and using supposedly a master pointer. $ecause however the value of a fa!e handle probably isn;t the address of a master pointer 0ove==i copies invalid data. (Actually it;s unli!ely that 0ove==i would ever get that far; probably it would run into problems when attempting to determine the si3e of the original bloc! from the bloc! header.) 6ot all fa!e handles are as easy to spot as those created by the 0a!e,a!e=andle function defined in (isting <-9. 8ou might for instance attempt to copy the data in an e.isting record (my5ecord) into a new handle as follows4 my=andle 4W 6ew=andle(Si3eOf(my5ecord)); fcreate a new handleg my=andlec 4W dmy5ecord; f1O6;T 1O T=#SVg (the address of some memor7' (the address of a pointer' Eandle; 1tr; (a random memor7 location'

The second line of code does not ma!e my=andle a handle to the beginning of the my5ecord record. #nstead it overwrites the master pointer with the address of that record ma!ing my=andle a fa!e handle. A correct way to create a new handle to some e.isting data is to ma!e a copy of the data using the "trTo=and function as follows4 my2rr 4W "trTo=and(dmy5ecord my=andle Si3eOf(my5ecord)); The 0emory 0anager provides a set of pointer- and handle-manipulation routines that can help you avoid creating fa!e handles. Lo4-Me#or! Condition"

124

#t is particularly important to ma!e sure that the amount of free space in your application heap never gets too low. ,or e.ample you should never deplete the available heap memory to the point that it becomes impossible to load re%uired code segments. As you have seen your application will crash if the Segment 0anager is called to load a re%uired code segment and there is not enough contiguous free memory to allocate a bloc! of the appropriate si3e. 8ou can ta!e several steps to help ma.imi3e the amount of free space in your heap. ,or e.ample you can mar! as purgeable any relocatable bloc!s whose contents could easily be reconstructed. $y ma!ing a bloc! purgeable you give the 0emory 0anager the freedom to release that space if heap memory becomes low. 8ou can also help ma.imi3e the available heap memory by intelligently segmenting your application;s e.ecutable code and by periodically unloading any unneeded segments. The standard way to do this is to unload every nonessential segment at the end of your application;s main event loop. Me#or! Cu"hion": These two measures--ma!ing bloc!s purgeable and unloading segments--help you only by releasing bloc!s that have already been allocated. #t is even more important to ma!e sure before you attempt to allocate memory directly that you don;t deplete the available heap memory. $efore you call 6ew=andle or 6ew"tr you should chec! that if the re%uested amount of memory were in fact allocated the remaining amount of space free in the heap would not fall below a certain threshold. The free memory defined by that threshold is your memory cushion. 8ou should not simply inspect the handle or pointer returned to you and ma!e sure that its value isn;t 6#( because you might have succeeded in allocating the space you re%uested but left the amount of free space dangerously low. 8ou also need to ma!e sure that indirect memory allocation doesn;t cut into the memory cushion. /hen for e.ample you call )et6ew1ialog the 1ialog 0anager might need to allocate space for a dialog record; it also needs to allocate heap space for the dialog item list and any other custom items in the dialog. $efore calling )et6ew1ialog therefore you need to ma!e sure that the amount of space left free after the call is greater than your memory cushion. The e.ecution of some system software routines re%uires significant amounts of memory in your heap. ,or e.ample some Ouic!1raw operations on regions can temporarily allocate fairly large amounts of space in your heap. Some of these system software routines however do little or no chec!ing to see that your heap contains the re%uired amount of free space. They either assume that they will get whatever memory they need or they simply issue a system error when they don;t get the needed memory. #n either case the result is usually a system crash. 8ou can avoid these problems by ma!ing sure that there is always enough space in your heap to handle these hidden memory allocations. 2.perience has shown that DA I$ is a reasonably safe si3e for this memory cushion. #f you can consistently maintain that amount of space free in your heap you can be reasonably certain that system software routines will get the memory they need to operate. 8ou also generally need a larger cushion (about CA I$) when printing. Me#or! Re"er;e" &nfortunately there are times when you might need to use some of the memory in the cushion yourself. #t is better for instance to dip into the memory cushion if necessary to save a user;s document than to re:ect the re%uest to save the document. Some actions your application performs should not be re:ectable simply because they re%uire it to reduce the amount of free space below a desired minimum. #nstead of relying on :ust the free memory of a memory cushion you can allocate a memory reserve some additional emergency storage that you release when free memory becomes low. The 125

important difference between this memory reserve and the memory cushion is that the memory reserve is a bloc! of allocated memory which you release whenever you detect that essential tas!s have dipped into the memory cushion. That emergency memory reserve might provide enough memory to compensate for any essential tas!s that you fail to anticipate. $ecause you allow essential tas!s to dip into the memory cushion the release itself of the memory reserve should not be a cause for alarm. &sing this scheme your application releases the memory reserve as a precautionary measure during ordinary operation. #deally however the application should never actually deplete the memory cushion and use the memory reserve. Gro4-Jone Function" The 0emory 0anager provides a particularly easy way for you to ma!e sure that the emergency memory reserve is released when necessary. 8ou can define a grow-3one function that is associated with your application heap. The 0emory 0anager calls your heap;s grow-3one function only after other techni%ues of freeing memory to satisfy a memory re%uest fail (that is after compacting and purging the heap and e.tending the heap 3one to its ma.imum si3e). The grow-3one function can then ta!e appropriate steps to free additional memory. A grow-3one function might dispose of some bloc!s or ma!e some unpurgeable bloc!s purgeable. /hen the function returns the 0emory 0anager once again purges and compacts the heap and tries to reallocate memory. #f there is still insufficient memory the 0emory 0anager calls the grow-3one function again (but only if the function returned a non3ero value the previous time it was called). This mechanism allows your grow-3one function to release :ust a little bit of memory at a time. #f the amount it releases at any time is not enough the 0emory 0anager calls it again and gives it the opportunity to ta!e more drastic measures. As the most drastic step to freeing memory in your heap you can release the emergency reserve.

)"in Me#or!
This section describes how you can use the 0emory 0anager to perform the most typical memory management tas!s. #n particular this section shows how you can • • • • set up your application heap at application launch time determine how much free space is available in your application heap allocate and release bloc!s of memory in your heap define and install a grow-3one function

The techni%ues described in this section are designed to minimi3e fragmentation of your application heap and to ensure that your application always has sufficient memory to complete any essential operations. Settin )p the App(ication 5eap /hen the "rocess 0anager launches your application it calls the 0emory 0anager to create and initiali3e a memory partition for your application. The "rocess 0anager then loads code segments into memory and sets up the stac! heap and A@ world (including the :ump table) for your application. 126

To help prevent heap fragmentation you should also perform some setup of your own early in your application;s e.ecution. 1epending on the needs of your application you might want to • • • change the si3e of your application;s stac! e.pand the heap to the heap limit allocate additional master pointer bloc!s

The following sections describe in detail how and when to perform these operations. Chan in the Si>e o8 the Stac0: 0ost applications allocate space on their stac! in a predictable way and do not need to monitor stac! space during their e.ecution. ,or these applications stac! usage usually reaches a ma.imum in some heavily nested routine. #f the stac! in your application can never grow beyond a certain si3e then to avoid collisions between your stac! and heap you simply need to ensure that your stac! is large enough to accommodate that si3e. #f you never encounter system error 9H (generated by the stac! sniffer when it detects a collision between the stac! and the heap) during application testing then you probably do not need to increase the si3e of your stac!. Some applications however rely heavily on recursive programming techni%ues in which one routine repeatedly calls itself or a small group of routines repeatedly call each other. #n these applications even routines with :ust a few local variables can cause stac! overflow because each time a routine calls itself a new copy of that routine;s parameters and variables is appended to the stac!. The problem can become particularly acute if one or more of the local variables is a string which can re%uire up to 9@B bytes of stac! space. 8ou can help prevent your application from crashing because of insufficient stac! space by e.panding the si3e of your stac!. #f your application does not depend on recursion you should do this only if you encounter system error 9H during testing. #f your application does depend on recursion you might consider e.panding the stac! so that your application can perform deeply nested recursive computations. #n addition some ob:ect-oriented languages (for e.ample 'PP) allocate space for ob:ects on the stac!. #f you are using one of these languages you might need to e.pand your stac!. To increase the si3e of your stac! you simply reduce the si3e of your heap. $ecause the heap cannot grow above the boundary contained in the Appl(imit global variable you can lower the value of Appl(imit to limit the heap;s growth. $y lowering Appl(imit technically you are not ma!ing the stac! bigger; you are :ust preventing collisions between it and the heap. $y default the stac! can grow to H I$ on 0acintosh computers without 'olor Ouic!1raw and to E9 I$ on computers with 'olor Ouic!1raw. (The si3e of the stac! for a faceless bac!ground process is always H I$ whether 'olor Ouic!1raw is present or not.) 8ou should never decrease the si3e of the stac! because future versions of system software might increase the default amount of space allocated for the stac!. ,or the same reason you should not set the stac! to a predetermined absolute si3e or calculate a new absolute si3e for the stac! based on the microprocessor;s type. #f you must modify the si3e of the stac! you should increase the stac! si3e only by some relative amount that is sufficient to meet the increased stac! re%uirements of your application. There is no ma.imum si3e to which the stac! can grow. (isting <-E defines a procedure that increases the stac! si3e by a given value. #t does so by determining the current heap limit subtracting the value of the e.tra$ytes parameter from that value and then setting the application limit to the difference. 127

(isting <-E. #ncreasing the amount of space allocated for the stac! 1<K@=HL<= Dncrease5tack5i;e (e8traA7tes: 5i;e); A=GDN 5et>ppl3imit(1tr(K<HQ(Get>ppl3imit) * e8traA7tes)); =NH; 8ou should call this procedure at the beginning of your application before you call the 0a.ApplKone procedure (as described in the ne.t section). #f you call #ncreaseStac!Si3e after you call 0a.ApplKone it has no effect because the SetAppl(imit procedure cannot change the Appl(imit global variable to a value lower than the current top of the heap. E?pandin the 5eap: 6ear the beginning of your application;s e.ecution before you allocate any memory you should call the 0a.ApplKone procedure to e.pand the application heap immediately to the application heap limit. #f you do not do this the 0emory 0anager gradually e.pands your heap as memory needs re%uire. This gradual e.pansion can result in significant heap fragmentation if you have previously moved relocatable bloc!s to the top of the heap (by calling 0ove==i) and loc!ed them (by calling =(oc!). /hen the heap grows beyond those loc!ed bloc!s they are no longer at the top of the heap. 8our heap then remains fragmented for as long as those bloc!s remain loc!ed. Another advantage to calling 0a.ApplKone is that doing so is li!ely to reduce the number of relocatable bloc!s that are purged by the 0emory 0anager. The 0emory 0anager e.pands your heap to fulfill a memory re%uest only after it has e.hausted other methods of obtaining the re%uired amount of space including compacting the heap and purging bloc!s mar!ed as purgeable. $y e.panding the heap to its limit you can prevent the 0emory 0anager from purging bloc!s that it otherwise would purge. This together with the fact that your heap is e.panded only once can ma!e memory allocation significantly faster. A((ocatin Ma"ter Pointer B(oc0" After calling 0a.ApplKone you should call the 0ore0asters procedure to allocate as many new nonrelocatable bloc!s of master pointers as your application is li!ely to need during its e.ecution. 2ach bloc! of master pointers in your application heap contains BD master pointers. The Operating System allocates one bloc! of master pointers as your application is loaded into memory and every relocatable bloc! you allocate needs one master pointer to reference it. #f when you allocate a relocatable bloc! there are no unused master pointers in your application heap the 0emory 0anager automatically allocates a new bloc! of master pointers. ,or several reasons however you should try to prevent the 0emory 0anager from calling 0ore0asters for you. ,irst 0ore0asters e.ecutes more slowly if it has to move relocatable bloc!s up in the heap to ma!e room for the new nonrelocatable bloc! of master pointers. /hen your application first starts running there are no such bloc!s that might have to be moved. Second the new nonrelocatable bloc! of master pointers is li!ely to fragment your application heap. At any time the 0emory 0anager is forced to call 0ore0asters for you there are already at least BD relocatable bloc!s allocated in your heap. &nless all or most of those bloc!s are loc!ed high in the heap (an unli!ely situation) the new nonrelocatable bloc! of master pointers might be allocated above e.isting relocatable bloc!s. This increases heap fragmentation. To prevent this fragmentation you should call 0ore0asters at the beginning of your application enough times to ensure that the 0emory 0anager never needs to call it for you. ,or 128

e.ample if your application never allocates more than EAA relocatable bloc!s in its heap then five calls to the 0ore0asters should be enough. #t;s better to call 0ore0asters too many times than too few so if your application usually allocates about <AA relocatable bloc!s but sometimes might allocate <AAA in a particularly busy session you should call 0ore0asters enough times at the beginning of the program to cover the larger figure. 8ou can determine empirically how many times to call 0ore0asters by using a low-level debugger. ,irst remove all the calls to 0ore0asters from your code and then give your application a rigorous wor!out opening and closing windows dialog bo.es and des! accessories as much as any user would. Then find out from your debugger how many times the system called 0ore0asters. To do so count the nonrelocatable bloc!s of si3e e<AA bytes (decimal 9@B or BD D). $ecause of 0emory 0anager si3e corrections you should also count any nonrelocatable bloc!s of si3e e<AH e<A' or e<<A bytes. (8ou should also chec! to ma!e sure that your application doesn;t allocate other nonrelocatable bloc!s of those si3es. #f it does subtract the number it allocates from the total.) ,inally call 0ore0asters at least that many times at the beginning of your application. (isting <-D illustrates a typical se%uence of steps to configure your application heap and stac!. The 1oSet&p=eap procedure defined there increases the si3e of the stac! by E9 I$ e.pands the application heap to its new limit and allocates five additional bloc!s of master pointers. (isting <-D. Setting up your application heap and stac! 1<K@=HL<= Ho5etLpEeap; @KN5T k=8tra5tack5pace ! OR"""; kIoreIaster@alls ! T; 6>< count: A=GDN Dncrease5tack5i;e(k=8tra5tack5pace); Ia8>pplUone; FK< count :! , TK kIoreIaster@alls HK IoreIasters; =NH; To reduce heap fragmentation you should call 1oSet&p=eap in a code segment that you never unload (possibly the main segment) rather than in a special initiali3ation code segment. This is because 0ore0asters allocates a nonrelocatable bloc!. #f you call 0ore0asters from a code segment that is later purged the new master pointer bloc! is located above the purged space thereby increasing fragmentation. /eter#inin the A#ount o8 Free Me#or! $ecause space in your heap is limited you cannot usually honor every user re%uest that would re%uire your application to allocate memory. ,or e.ample every time the user opens a new window 129 (VQ more master ptrs' (increase stack si;e' (e8tend heap to limit' Dnteger; (/B SA' (for /B" master ptrs'

you probably need to allocate a new window record and other associated data structures. #f you allow the user to open windows endlessly you ris! running out of memory. This might adversely affect your application;s ability to perform important operations such as saving e.isting data in a window. #t is important therefore to implement some scheme that prevents your application from using too much of its own heap. One way to do this is to maintain a memory cushion that can be used only to satisfy essential memory re%uests. $efore allocating memory for any nonessential tas! you need to ensure that the amount of memory that remains free after the allocation e.ceeds the si3e of your memory cushion. 8ou can do this by calling the function #s0emoryAvailable defined in (isting <-@. (isting <-@. 1etermining whether allocating memory would deplete the memory cushion

FLN@TDKN DsIemor7>vailable (mem<e?uest: 3ongDnt): Aoolean; 6>< total: contig: A=GDN 1urge5pace(total, contig); DsIemor7>vailable :! ((mem<e?uest # kIem@ushion) C contig); =NH; The #s0emoryAvailable function calls the 0emory 0anager;s "urgeSpace procedure to determine the si3e of the largest contiguous bloc! that would be available if the application heap were purged; that si3e is returned in the contig parameter. #f the si3e of the potential memory re%uest together with the si3e of the memory cushion is less than the value returned in contig #s0emoryAvailable is set to T5&2 indicating that it is safe to allocate the specified amount of memory; otherwise #s0emoryAvailable returns ,A(S2. 6otice that the #s0emoryAvailable function does not itself cause the heap to be purged or compacted; the 0emory 0anager does so automatically when you actually attempt to allocate the memory. &sually the easiest way to determine how big to ma!e your application;s memory cushion is to e.periment with various values. 8ou should attempt to find the lowest value that allows your application to e.ecute successfully no matter how hard you try to allocate memory to ma!e the application crash. As an e.tra guarantee against your application;s crashing you might want to add some memory to this value. As indicated earlier in this chapter DA I$ is a reasonable si3e for most applications. 'O6ST !0em'ushion W DA T <A9D; fsi3e of memory cushiong 3ongDnt; 3ongDnt; (total free memor7 if heap purged' (largest contiguous block if heap purged'

8ou should call the #s0emoryAvailable function before all nonessential memory re%uests no matter how small. ,or e.ample suppose your application allocates a new small relocatable bloc! each time a user types a new line of te.t. That bloc! might be small but thousands of such bloc!s 130

could ta!e up a considerable amount of space. Therefore you should chec! to see if there is sufficient memory available before allocating each one. 8ou should never however call the #s0emoryAvailable function before an essential memory re%uest. /hen deciding how big to ma!e the memory cushion for your application you must ma!e sure that essential re%uests can never deplete all of the cushion. 6ote that when you call the #s0emoryAvailable function for a nonessential re%uest essential re%uests might have already dipped into the memory cushion. #n that case #s0emoryAvailable returns ,A(S2 no matter how small the nonessential re%uest is. Some actions should never be re:ectable. ,or e.ample you should guarantee that there is always enough memory free to save open documents and to perform typical maintenance tas!s such as updating windows. Other user actions are li!ely to be always re:ectable. ,or e.ample because you cannot allow the user to create an endless number of documents you should ma!e the 6ew 1ocument and Open 1ocument menu commands re:ectable. Although the decisions of which actions to ma!e re:ectable are usually obvious modal and modeless bo.es present special problems. #f you want to ma!e such dialog bo.es available at all costs you must ensure that you allocate a large enough memory cushion to handle the ma.imum number of these dialog bo.es that the user could open at once. #f you consider a certain dialog bo. (for instance a spelling chec!er) nonessential you must be prepared to inform the user that there is not enough memory to open it if memory space become low. A((ocatin B(oc0" o8 Me#or! As you have seen a !ey element of the memory-management scheme presented in this chapter is to disallow any nonessential memory allocation re%uests that would deplete the memory cushion. #n practice this means that before calling 6ew=andle 6ew"tr or another function that allocates memory you should chec! that the amount of space remaining after the allocation if successful e.ceeds the si3e of the memory cushion. An easy way to do this is never to allocate memory for nonessential tas!s by calling 6ew=andle or 6ew"tr directly. #nstead call a function such as 6ew=andle'ushion defined in (isting <-B or 6ew"tr'ushion defined in (isting <-C. (isting <-B Allocating relocatable bloc!s FLN@TDKN Ne)Eandle@ushion (logical5i;e: 5i;e): Eandle; A=GDN DF NKT DsIemor7>vailable(logical5i;e) TE=N Ne)Eandle@ushion :! ND3 =35= A=GDN 5etGro)Uone(ND3); (remove gro)*;one function'

Ne)Eandle@ushion :! Ne)Eandle@lear(logical5i;e); 5etGro)Uone(PI7Gro)Uone); =NH; 131 (install gro)*;one function'

=NH; The 6ew=andle'ushion function first calls #s0emoryAvailable to determine whether allocating the re%uested number of bytes would deplete the memory cushion. #f so 6ew=andle'ushion returns 6#( to indicate that the re%uest has failed. Otherwise if there is indeed sufficient space for the new bloc! 6ew=andle'ushion calls 6ew=andle'lear to allocate the relocatable bloc!. $efore calling 6ew=andle'lear however 6ew=andle'ushion disables the grow-3one function for the application heap. This prevents the grow-3one function from releasing any emergency memory reserve your application might be maintaining. 8ou can define a function 6ew"tr'ushion to handle allocation of nonrelocatable bloc!s as shown in (isting <-C. (isting <-C Allocating nonrelocatable bloc!s FLN@TDKN Ne)1tr@ushion (logical5i;e: 5i;e): Eandle; A=GDN DF NKT DsIemor7>vailable(logical5i;e) TE=N Ne)1tr@ushion :! ND3 =35= A=GDN 5etGro)Uone(ND3); (remove gro)*;one function'

Ne)1tr@ushion :! Ne)1tr@lear(logical5i;e); 5etGro)Uone(PI7Gro)Uone); =NH; =NH; (isting <-H illustrates a typical way to call 6ew"tr'ushion. (isting <-H Allocating a dialog record FLN@TDKN GetHialog (dialogDH: Dnteger): Hialog1tr; 6>< m71tr: 1tr; A=GDN m71tr :! Ne)1tr@ushion(5i;eKf(Hialog<ecord)); DF Iem=rror ! no=rr TE=N GetHialog :! GetNe)Hialog(dialogDH, m71tr, 9indo)1tr(*,)) =35= GetHialog :! ND3; (canNt get memor7' 132 (storage for the dialog record' (install gro)*;one function'

=NH; /hen you allocate memory directly you can later release it by calling the 1ispose=andle and 1ispose"tr procedures. /hen you allocate memory indirectly by calling a Toolbo. routine there is always a corresponding Toolbo. routine to release that memory. ,or e.ample the 1ispose/indow procedure releases memory allocated with the 6ew/indow function. $e sure to use these special Toolbo. routines instead of the generic 0emory 0anager routines when applicable. Maintainin a Me#or! Re"er;e A simple way to help ensure that your application always has enough memory available for essential operations is to maintain an emergency memory reserve. This memory reserve is a bloc! of memory that your application uses only for essential operations and only when all other heap space has been allocated. This section illustrates one way to implement a memory reserve in your application. To create and maintain an emergency memory reserve you follow three distinct steps4 • /hen your application starts up you need to allocate a bloc! of reserve memory. $ecause you allocate the bloc! it is no longer free in the heap and does not enter into the free-space determination done by #s0emoryAvailable. /hen your application needs to fulfill an essential memory re%uest and there isn;t enough space in your heap to satisfy the re%uest you can release the reserve. This effectively ensures that you always have the memory you re%uest at least for essential operations. 2ach time through your main event loop you should chec! whether the reserve has been released. #f it has you should attempt to recover the reserve. #f you cannot recover the reserve you should warn the user that memory is critically short. To refer to the emergency reserve you can declare a global variable of type =andle. GA5 g2mergency0emory4 =andle; fhandle to emergency memory reserveg (isting <-? defines a function that you can call early in your application;s e.ecution (before entering your main event loop) to create an emergency memory reserve. This function also installs the application-defined grow-3one procedure. (isting <-?. 'reating an emergency memory reserve 1<K@=HL<= Dnitiali;e=mergenc7Iemor7; A=GDN g=mergenc7Iemor7 :! Ne)Eandle(k=mergenc7Iemor75i;e); 5etGro)Uone(PI7Gro)Uone); =NH; The #nitiali3e2mergency0emory procedure defined in (isting <-? simply allocates a relocatable bloc! of a predefined si3e. That bloc! is the emergency memory reserve. A reasonable si3e for the 133





memory reserve is whatever si3e you use for the memory cushion. Once again DA I$ is a good si3e for many applications. 'O6ST !2mergency0emorySi3e W DA T <A9D; fsi3e of memory reserveg /hen using a memory reserve you need to change the #s0emoryAvailable function defined earlier in (isting <-@. 8ou need to ma!e sure when determining whether a nonessential memory allocation re%uest should be honored that the memory reserve has not been released. To chec! that the memory reserve is intact use the function #s2mergency0emory defined in (isting <-<A. (isting <-<A. 'hec!ing the emergency memory reserve FLN@TDKN Ds=mergenc7Iemor7: Aoolean; A=GDN Ds=mergenc7Iemor7 :! (g=mergenc7Iemor7 C+ ND3) M (g=mergenc7Iemor7J C+ ND3); =NH; Then you can replace the function #s0emoryAvailable defined in (isting <-@ by the version defined in (isting <-<<. (isting <-<<. 1etermining whether allocating memory would deplete the memory cushion FLN@TDKN DsIemor7>vailable (mem<e?uest: 3ongDnt): Aoolean; 6>< total: contig: A=GDN DF NKT Ds=mergenc7Iemor7 TE=N (is emergenc7 memor7 availableW' DsIemor7>vailable :! F>35= =35= A=GDN 1urge5pace(total, contig); DsIemor7>vailable:! ((mem<e?uest # kIem@ushion) C contig); =NH; =NH; As you can see this is e.actly li!e the earlier version e.cept that it indicates that memory is not available if the memory reserve is not intact. 134 3ongDnt; 3ongDnt; (total free memor7 if heap purged' (largest contiguous block if heap purged'

Once you have allocated the memory reserve early in your application;s e.ecution it should be released only to honor essential memory re%uests when there is no other space available in your heap. 8ou can install a simple grow-3one function that ta!es care of releasing the reserve at the proper moment. 2ach time through your main event loop you can chec! whether the reserve is still intact; to do this add these lines of code to your main event loop before you ma!e your event call4 #, 6OT #s2mergency0emory T=26 5ecover2mergency0emory; The 5ecover2mergency0emory function defined in (isting <-<9 simply attempts to reallocate the memory reserve. (isting <-<9 5eallocating the emergency memory reserve 1<K@=HL<= <ecover=mergenc7Iemor7; A=GDN <eallocateEandle(g=mergenc7Iemor7, k=mergenc7Iemor75i;e); =NH; #f you are unable to reallocate the memory reserve you might want to notify the user that because memory is in short supply steps should be ta!en to save any important data and to free some memory. /e8inin a Gro4-Jone Function The 0emory 0anager calls your heap;s grow-3one function only after other attempts to obtain enough memory to satisfy a memory allocation re%uest have failed. A grow-3one function should be of the following form4 ,&6'T#O6 0y)rowKone (cb6eeded4 Si3e)4 (ong#nt; The 0emory 0anager passes to your function (in the cb6eeded parameter) the number of bytes it needs. 8our function can do whatever it li!es to free that much space in the heap. ,or e.ample your grow-3one function might dispose of certain bloc!s or ma!e some unpurgeable bloc!s purgeable. 8our function should return the number of bytes if any it managed to free. /hen the function returns the 0emory 0anager once again purges and compacts the heap and tries again to allocate the re%uested amount of memory. #f there is still insufficient memory the 0emory 0anager calls your grow-3one function again but only if the function returned a non3ero value when last called. This mechanism allows your grow-3one function to release memory gradually; if the amount it releases is not enough the 0emory 0anager calls it again and gives it the opportunity to ta!e more drastic measures. Typically a grow-3one function frees space by calling the 2mpty=andle procedure which purges a relocatable bloc! from the heap and sets the bloc!;s master pointer to 6#(. This is preferable to disposing of the space (by calling the 1ispose=andle procedure) because you are li!ely to want to reallocate the bloc!. The 0emory 0anager might designate a particular relocatable bloc! in the heap as protected; your grow-3one function should not move or purge that bloc!. 8ou can determine which bloc! if any the 0emory 0anager has protected by calling the )KSave=nd function in your grow-3one function. 135

(isting <-<E defines a very basic grow-3one function. The 0y)rowKone function attempts to create space in the application heap simply by releasing the bloc! of emergency memory. ,irst however it chec!s that (<) the emergency memory hasn;t already been released and (9) the emergency memory is not a protected bloc! of memory (as it would be for e.ample during an attempt to reallocate the emergency memory bloc!). #f either of these conditions isn;t true then 0y)rowKone returns A to indicate that no memory was released. (isting <-<E A grow-3one function that releases emergency storage FLN@TDKN I7Gro)Uone (cbNeeded: 5i;e): 3ongDnt; 6>< the>T: A=GDN the>T:! 5et@urrent>T; remember current value of >T; install ours' DF (g=mergenc7Iemor7J GU5aveEnd) TE=N A=GDN =mpt7Eandle(g=mergenc7Iemor7); I7Gro)Uone :! k=mergenc7Iemor75i;e; =NH =35= I7Gro)Uone :! "; the>T :! 5et>T(the>T); =NH; The function 0y)rowKone defined in (isting <-<E saves the current value of the A@ register when it begins and then restores the previous value before it e.its. This is necessary because your grow-3one function might be called at a time when the system is attempting to allocate memory and value in the A@ register is not correct. See the chapter >0emory 0anagement &tilities> in this boo! for more information about saving and restoring the A@ register. (no more memor7 to release' (restore previous value of >T' C+ ND3) M (g=mergenc7Iemor7 C+ 3ongDnt; (value of >T )hen function is called'

136

C5APTER 2 CAC5ING AN/ INTRO TO FILE SYSTEMS
Introduction to Fi(e S!"te#"
• ,ile systems. #mportant topic - most crucial data stored in file systems and file system performance is crucial component of overall system performance. #n practice is maybe the most important. /hat are filesL 1ata that is readily available but stored on non-volatile media. Standard place to store files4 on a hard dis! or floppy dis!. Also data may be a networ! away. 0ost systems let you organi3e files into a tree structure so have directories and files. /hat is stored in filesL (ate. source 6achos source ,rame0a!er source 'PP ob:ect files e.ecutables "erl scripts shell files databases "ostScript files etc. 0eaning of a file depends on the tools that manipulate it. 0eaning of a (ate. file is different for the (ate. e.ecutable than for a standard te.t editor. 2.ecutable file format has meaning to OS. Ob:ect file format has meaning to lin!er. Some systems support a lot of different file types e.plicitly. 0acintosh #$0 mainframes do this. Inowledge of file types built into OS and OS handles different !inds of files differently. #n &ni. meaning of a file is simply a se%uence of bytes. =ow do &ni. tools tell file types apartL $y loo!ing at contentsV ,or e.ample how does &ni. tell e.ecutables apart from shell scripts apart from "erl files when it e.ecutes itL o "erl Scripts - start with a hV/usr/bin/perl. #n general if file starts with hVtool &ni. shell interprets file using tool. o Shell Scripts - start with a h. 137

• • • •

• •

o =ow about e.ecutablesL Start with &ni. e.ecutable magic number. 5ecall 6achos ob:ect file format. o /hat about "ostScript filesL Start with something li!e `V"S-Adobe-9.A which printing utilities recogni3e. Single e.ception4 directories and symbolic lin!s are e.plicitly tagged in &ni.. • /hat about 0acintoshL All files have a type (pict te.t) and the name of program that created the file. /hen double clic! on the file it automatically starts the program that created file and loads the file. =ave to have utilities that twiddle the file metadata (types and program names). /hat about 1OSL =ave an ad-hoc file typing mechanism built into file naming conventions. So .com and .e.e identify two different !inds of e.ecutables. .bat identifies a te.t batch file. These are enforced by OS (because it is involved with launching e.ecutables). Other file e.tensions are recogni3ed by other programs but not by OS. ,ile attributes4 o 6ame o Type - in &ni. implicit. o (ocation - where file is stored on dis! o Si3e o "rotection o Time date and user identification. • • All file system information is stored in nonvolatile storage in a way that it can be reconstructed on a system crash. Gery important for data security. =ow do programs access filesL Several general ways4 o Se%uential - open it then read or write from beginning to end. o 1irect - specify the starting address of the data. o #nde.ed - inde. file by identifier (name for e.ample) then retrieve record associated by name. ,iles may be accessed more than one way. A payroll file for e.ample may be accessed se%uentially by paychec! program and inde.ed by personnel office. 6achos e.ecutable files are accessed directly. • ,ile structure can be optimi3ed for a given access mode. o ,or se%uential access can have file :ust laid out se%uentially on dis!. /hat is problemL o ,or direct access can have a dis! bloc! table telling where each dis! bloc! is. To access inde.ed data first traverse dis! bloc! table to find right dis! bloc! then go to the bloc! containing data. 138





o ,or more sophisticated inde.ed access may build an inde. file. 2.ample4 #$0 #SA0 (#nde.ed Se%uential Access 0ode). &ser selects a !ey and system builds a two-level inde. for the !ey. &ses binary search at each level of inde. then linear search within final bloc!. 6otice how memory hierarchy considerations drive file implementation. • • • • • 2asy to simulate a se%uential access file given a direct access file - :ust !eep trac! of current file position. $ut simulating direct access file with a se%uential access file is a lot harder. ,undamental design choice4 lots of file formats or few file formatsL &ni.4 few (one) file format. G0S4 few (three). #$0 lots (# don;t !now :ust how many). Advantage of lots of file formats4 user probably has one that fits the bill. 1isadvantage4 OS becomes larger. System becomes harder to use (must choose file format if get it wrong it is a big problem). 1irectory structure. To organi3e files many systems provide a hierarchical file system arrangement. 'an have files and then directories of files. 'ommon arrangement4 tree of files. 6aming can be absolute relative or both. There is sometimes a need to share files between different parts of the tree. So structure becomes a graph. 'an get to same file in multiple ways. &ni. supports two !inds of lin!s4 o Symbolic (in!s4 directory entry is name of another file. #f that file is moved symbolic lin! still points to (non-e.istent) file. #f another file is copied into that spot symbolic lin! all of a sudden points to it. o =ard (in!s4 stic!s with the file. #f file is moved hard lin! still points to file. To get rid of file must delete it from all places that have hard lin!s to it. (in! command (ln) sets these lin!s up. • • &ses for soft lin!sL 'an have two people share files. 'an also set up source directories then lin! compilation directories to source directories. Typically useful file system structuring tool. )raph structure introduces complications. ,irst must be sure not to delete hard lin!ed files until all pointers to them are gone. Standard solution4 reference counts. Second only want to traverse files once even if have multiple references to same file. Standard solution4 mar!ing. cp does not handle this well for soft lin!s; tar handles it well. /hat about cyclic graph structuresL "roblem is that cycles may ma!e reference counts not wor! - can have a section of graph that is disconnected from rest but all entries have positive reference counts. Only solution4 garbage collect. 6ot done very often because it ta!es so long. &ni. prevents users from ma!ing hard lin!s create cycles by only allowing hard lin!s to point to files not directories. $ut with still have some cycles in structure. 0emory-mapped files. Standard view of system4 have data stored in address space of a process but data goes away when process dies. #f want to preserve data must write it to dis! then read it bac! in again when need it.





• •

139



/riting #O routines to dump data to dis! and bac! again is a real hassle. /hat is worse if programs share data using files must maintain consistency between file and data read in via some other mechanism. Solution4 memory-mapped files. 'an map part of file into process;s address space and read and write the file li!e a normal piece of memory. Sort of li!e memory-mapped #O generali3ed to user level. So processes can share persistent data directly with no hassles. "rograms can dump data structures to dis! without having to write routines to lineari3e output and read in data structures. &sed for stuff li!e snapshot files in interactive systems. #n &ni. the system call that sets this up is the map system call. =ow is sharing set up for processes on the same machineL /hat about processes on different machinesL 6e.t issue4 protection. /hy is protection necessaryL $ecause people want to share files but not share all aspects of all files. /ant protection on individual file and operation basis. o "rofessor wants students to read but not write assignments. o "rofessor wants to !eep e.am in same directory as assignments but students should not be able to read e.am. o 'an e.ecute but not write commands li!e cp cat etc. ,or convenience want to create coarser grain concepts. • • All people in research group should be able to read and write source files. Others should not be able to access them. 2verybody should be able to read files in a given directory.



• • •



'onceptually have operations (open read write e.ecute) resources (files) and principals (users or processes). 'an describe desired protection using access matri.. =ave list of principals across top and resources on the side. 2ach entry of matri. lists operations that the principal can perform on the resource. Two standard mechanisms for access control4 access lists and capabilities. o Access lists4 for each resource (li!e a file) give a list of principals allowed to access that resource and the access they are allowed to perform. So each row of access matri. is an access list. o 'apabilities4 for each resource and access operation give out capabilities that give the holder the right to perform the operation on that resource. 'apabilities must be unforgeable. 2ach column of access matri. is a capability list. #nstead of organi3ing access lists on a principal by principal basis can organi3e on a group



basis. • /ho controls access lists and capabilitiesL 1one under OS control. /ill tal! more about security later. 140



/hat is the &ni. security modelL =ave three operations - read write and e.ecute. 2ach file has an owner and a group. "rotections are given for each operation on basis of everybody group and owner. (i!e everything else in &ni. is a fairly simple and primitive protection strategy. &ni. file listing4 D drw.r-.r-9 martin faculty 9ADH 0ay <@ 9<4AE ./ @<9 0ay E <C4DB ../ 9<E Apr <? 9949C aA.au. EDHH Apr <? 9949C aA.dvi <9<H Apr <? 9949C aA.log EBB<C Apr <? 9949C aA.ps 9@?? Apr @ <H4AC aA.te.T



9 drw.r-.r-. C martin faculty 9 -rw-r----H -rw-r----D -rw-r----C9 -rw-r--r-B -rw.r-.r-. • • < martin faculty < martin faculty < martin faculty < martin faculty < martin faculty

=ow are files implemented on a standard hard-dis! based systemL #t is up to OS to implement it. /hy must OS do thisL "rotection. /hat does a dis! loo! li!eL #t is a stac! of platters. 2ach platter may have two surfaces (one per side). There is one dis! head per surface. The surfaces revolve beneath the heads with the heads riding on a cushion of air. The heads move bac! and forth between the platters as a unit. The area beneath a stationary head is a trac!. The set of trac!s that can be accessed without moving the heads is a cylinder. 2ach trac! is bro!en up into sectors. A sector is the unit of dis! transfer. To read a given sector we first move the heads to that sector;s cylinder (see! time) then wait for the sector to rotate under the head (latency time) then copy data off of dis! into memory (transfer time). Typical hard dis! statistics4 (Se%uel @DAA from August <??E @.9@ inch D.A)byte). o "latters4 <E o 5ead//rite heads4 9B o Trac!s/Surface4 E A@H o Trac! 'apacity (bytes)4 DA DDH - BA ?9H o $ytes/Sector4 @<9 - @9A o Sectors/Trac!4 C?-<<? o 0edia Transfer 5ate (0$/s)4 E.B-@.@ o Trac!-to-trac! See!4 <.E ms o 0a. See!4 9@ ms 141





o Average See!4 <9 ms o 5otational Speed4 @ DAA rpm o Average (atency4 @.B ms • =ow does this compare to timings for a standard wor!stationL 12'Station @AAA is a standard wor!station available in <??E. =ad a EE 0=3 0#"S 5EAAA BA ns memory. =ow many instructions can e.ecute in EA ms (about time for average see! plus average latency)L EE T EA T <AAA W ??A AAA. "lus many operations re%uire multiple dis! accesses. /hat does dis! loo! li!e to OSL #s :ust a se%uence of sectors. All sectors in a trac! are in se%uence; all trac!s in a cylinder are in se%uence. Ad:acent cylinders are in se%uence. OS may logically lin! several dis! sectors together to increase effective dis! bloc! si3e. =ow does OS access dis!L There is a piece of hardware on the dis! called a dis! controller. OS issues instructions to dis! controller. 'an either use #O instructions or memory-mapped #O operations. #n effect dis! is :ust a big array of fi.ed-si3e chun!s. Mob of the OS is to implement file system abstractions on top of these chun!s.







Fi(e S!"te# I#p(e#entation
• • 1iscuss several file system implementation strategies. ,irst implementation strategy4 contiguous allocation. Must lay out the file in contiguous dis! bloc!s. &sed in G0/'0S - an old #$0 interactive system.

Advantages4 • • • Ouic! and easy calculation of bloc! holding data - :ust offset from start of fileV ,or se%uential access almost no see!s re%uired. 2ven direct access is fast - :ust see! and read. Only one dis! access.

1isadvantages4 • • • • • /here is best place to put a new fileL "roblems when file gets bigger - may have to move whole fileVV 2.ternal ,ragmentation. 'ompaction may be re%uired and it can be very e.pensive.

6e.t strategy4 lin!ed allocation. All files stored in fi.ed si3e bloc!s. (in! together ad:acent bloc!s li!e a lin!ed list. Advantages4 o 6o more variable-si3ed file allocation problems. 2verything ta!es place in fi.ed-si3e chun!s which ma!es memory allocation a lot easier. 142

o 6o more e.ternal fragmentation. o 6o need to compact or relocate files. 1isadvantages4 o "otentially terrible performance for direct access files - have to follow pointers from one dis! bloc! to the ne.tV o 2ven se%uential access is less efficient than for contiguous files because may generate long see!s between bloc!s. o 5eliability -if lose one pointer have big problems. • ,AT allocation. #nstead of storing ne.t file pointer in each bloc! have a table of ne.t pointers inde.ed by dis! bloc!. Still have to linearly traverse ne.t pointers but at least don;t have to go to dis! for each of them. 'an :ust cache the ,AT table and do traverse all in memory. 0S/1OS and OS/9 use this scheme. Table pointer of last bloc! in file has 2O, pointer value. ,ree bloc!s have table pointer of A. Allocation of free bloc!s with ,AT scheme is straightforward. Must search for first bloc! with A table pointer. #nde.ed Schemes. )ive each file an inde. table. 2ach entry of the inde. points to the dis! bloc!s containing the actual file data. Supports fast direct file access and not bad for se%uential access. Ouestion4 how to allocate inde. tableL 0ust be stored on dis! li!e everything else in the file system. =ave basically same alternatives as for file itselfV 'ontiguous lin!ed and multilevel inde.. #n practice some combination scheme is usually used. This whole discussion is reminiscent of paging discussions. /ill now discuss how traditional &ni. lays out file system. ,irst HI$ - label P boot bloc!. 6e.t HI$ - Superbloc! plus free inode and dis! bloc! cache. 6e.t BDI$ - inodes. 2ach inode corresponds to one file. &ntil end of file system - dis! bloc!s. 2ach dis! bloc! consists of a number of consecutive sectors. /hat is in an inode - information about a file. 2ach inode corresponds to one file. #mportant fields4 o 0ode. This includes protection information and the file type. ,ile type can be normal file (-) directory (d) symbolic lin! (l). o Owner o 6umber of lin!s - number of directory entries that point to this inode. o (ength - how many bytes long the file is. o 6bloc!s - number of dis! bloc!s the file occupies. 143







• • • • •

o Array of <A direct bloc! pointers. These are first <A bloc!s of file. o One indirect bloc! pointer. "oints to a bloc! full of pointers to dis! bloc!s. o One doubly indirect bloc! pointer. "oints to a bloc! full of pointers to bloc!s full of pointers to dis! bloc!s. o One triply indirect bloc! pointer. (6ot currently used). So a file consists of an inode and the dis! bloc!s that it points to. • • 6bloc!s and (ength do not contain redundant information - can have holes in files. A hole shows up as bloc! pointers that point to bloc! A - i.e. nothing in that bloc!. Assume bloc! si3e is @<9 bytes (i.e. one sector). To access any of first @<9T<A bytes of file can :ust go straight from inode. To access data farther in must go indirect through at least one level of indirection. /hat does a directory loo! li!eL #t is a file consisting of a list of (name inode number) pairs. #n early &ni. Systems the name was a ma.imum of <D characters long and the inode number was 9 bytes. (ater versions of &ni. removed this restriction and each directory entry was variable length and also included the length of the file name. /hy don;t inodes contain namesL $ecause would li!e a file to be able to have multiple names. =ow does &ni. implement the directories . and ..L They are :ust names in the directory. . points to the inode of the directory while .. points to the inode of the directory;s parent directory. So there are some circularities in the file system structure. &ser can refer to files in one of two ways4 relative to current directory or relative to the root directory. /here does loo!up for root startL $y convention inode number 9 is the inode for the top directory. #f a name starts with / loo!up starts at the file for inode number 9. =ow does system convert a name to an inodeL There is a routine called name that does it. 1o a simple file system e.ample draw out inodes and dis! bloc!s etc. #nclude counts length etc. /hat about symbolic lin!sL A symbolic lin! is a file containing a file name. /henever a &ni. operation has the name of the symbolic lin! as a component of a file name it macro substitutes the name in the file in for the component. /hat dis! accesses ta!e place when list a directory cd to a directory cat a fileL #s there any difference between ls and ls -,L /hat about when use the &ni. rm commandL 1oes it always delete the fileL 6O - it decrements the reference count. #f the count is A then it frees up the space. 1oes this algorithm wor! for directoriesL 6O - directory has a reference to itself (.). &se a different command. /hen write a file may need to allocate more inodes and dis! bloc!s. The superbloc! !eeps trac! of data that help this process along. A superbloc! contains4 144



• •



• • •

• •



o the si3e of the file system o number of free bloc!s in the file system o list of free bloc!s available in the file system o inde. of ne.t free bloc! in free bloc! list o the si3e of the inode list o the number of free inodes in the file system o a cache of free inodes o the inde. of the ne.t free inode in inode cache • • The !ernel maintains the superbloc! in memory and periodically writes it bac! to dis!. The superbloc! also contains crucial information so it is replicated on dis! in case part of dis! fails. /hen OS wants to allocate an inode it first loo!s in the inode cache. The inode cache is a stac! of free inodes the inde. points to the top of the stac!. /hen the OS allocates an inode it :ust decrements inde.. #f the inode cache is empty it linearly searches inode list on dis! to find free inodes. An inode is free if its type field is A. So when go to search inode list for free inodes !eep loo!ing until wrap or fill inode cache in superbloc!. Ieep trac! of where stopped loo!ing - will start loo!ing there ne.t time. To free an inode put it in superbloc!;s inode cache if there is room. #f not don;t do anything much. Only chec! against the number where OS stopped loo!ing for inodes the last time it filled the cache. 0a!e this number the minimum of the freed inode number and the number already there. OS stores list of free dis! bloc!s as follows. The list consists of a se%uence of dis! bloc!s. 2ach dis! bloc! in this se%uence stores a se%uence of free dis! bloc! numbers. The first number in each dis! bloc! is the number of the ne.t dis! bloc! in this se%uence. The rest of the numbers are the numbers of free dis! bloc!s. The superbloc! has the first dis! bloc! in this se%uence. To allocate a dis! bloc! chec! the superbloc!;s bloc! of free dis! bloc!s. #f there are at least two numbers grab the one at the top and decrement the inde. of ne.t free bloc!. #f there is only one number left it contains the inde. of the ne.t bloc! in the dis! bloc! se%uence. 'opy this dis! bloc! into the superbloc!;s free dis! bloc! list and use it as the free dis! bloc!. To free a dis! bloc! do the reverse. #f there is room in the superbloc!;s dis! bloc! push it on there. #f not write superbloc!;s dis! bloc! into free bloc! then put inde. of newly free dis! bloc! in as first number in superbloc!;s dis! bloc!. 6ote that OS maintains a list of free dis! bloc!s but only a cache of free inodes. /hy is thisL o Iernel can determine whether inode is free or not :ust by loo!ing at it. $ut cannot with dis! bloc! - any bit pattern is OI for dis! bloc!s. o 2asy to store lots of free dis! bloc! numbers in one dis! bloc!. $ut inodes aren;t large enough to store lots of inode numbers. 145











o &sers consume dis! bloc!s faster than inodes. So pauses to search for inodes aren;t as bad as searching for dis! bloc!s would be. o #nodes are small enough to read in lots in a single dis! operation. So scanning lists of inodes is not so bad. • Synchroni3ing multiple file accesses. /hat should correct semantics be for concurrent reads and writes to the same fileL 5eads and writes should be atomic4 o #f a read e.ecute concurrently read should either observe the entire write or none of the write. o 5eads can e.ecute concurrently with no atomicity constraints. • =ow to implement these atomicity constraintsL #mplement reader-writer loc!s for each open file. =ere are some operations4 o Ac%uire read loc!4 bloc!s until no other process has a write loc! then increments read loc! count and returns. o 5elease read loc!4 decrements read loc! count. o Ac%uire write loc!4 bloc!s until no other process has a write or read loc! then sets the write loc! flag and returns. o 5elease write loc!4 clears write loc! flag. • Obtain read or write loc!s inside the !ernel;s system call handler. On a 5ead system call obtain read loc! perform all file operations re%uired to read in the appropriate part of file then release read loc! and return. On /rite system call do something similar e.cept get write loc!s. /hat about 'reate Open 'lose and 1elete callsL #f multiple processes have file open and a process calls 1elete on that file all processes must close the file before it is actually deleted. 8et another form of synchroni3ation is re%uired. =ow to organi3e synchroni3ationL =ave a global file table in addition to local file tables. /hat does each file table doL o )lobal ,ile Table4 #nde.ed by some global file id - for e.ample the inode inde. would wor!. 2ach entry has a reader/writer loc! a count of number of processes that have file open and a bit that says whether or not to delete the file when last process that has file open closes it. 0ay have other data depending on what other functionality file system supports. o (ocal ,ile Table4 #nde.ed by open file id for that process. =as a pointer to the current position in the open file to start reading from or writing to for /rite and 5ead operations. • • ,or your nachos assignments do not have to implement reader/writer loc!s - can :ust use a simple mutual e.clusion loc!. /hat are sources of inefficiency in this file systemL Are two !inds - wasted time and wasted space. 146







/asted time comes from waiting to access the dis!. $asic problem with system described above4 it scatters related items all around the dis!. o #nodes separated from files. o #nodes in same directory may be scattered around in inode space. o 1is! bloc!s that store one file are scattered around the dis!. So system may spend all of its time moving the dis! heads and waiting for the dis! to revolve.



The initial layout attempts to minimi3e these phenonmena by setting up free lists so that they allocate consecutive dis! bloc!s for new files. So files tend to be consecutive on dis!. $ut as use file system layout gets scrambled. So the free list order becomes increasingly randomi3ed and the dis! bloc!s for files get spread all over the dis!. Must how bad is itL /ell in traditional &ni. the dis! bloc! si3e e%ualed the sector si3e which was @<9 bytes. /hen they went from E$S1 to D.A$S1 they doubled the dis! bloc! si3e. This more than doubled the dis! performance. Two factors4 o 2ach bloc! access fetched twice as much data so amorti3ed the dis! see! overhead over more data. o The file bloc!s were bigger so more files fit into the direct section of the inode inde..



$ut still pretty bad. /hen file system first created got transfer rates of up to <C@ I$yte per second. After a few wee!s deteriorated down to EA I$yte per second. /hat is worse this is only about D percent (VVVV) of ma.mimum dis! throughput. So the obvious fi. is to ma!e the bloc! si3e even bigger. • /asted space comes from internal fragmentation. 2ach file with anything in it (even small ones) ta!es up at least one dis! bloc!. So if file si3e is not an even multiple of dis! bloc! si3e there will be wasted space off the end of the last dis! bloc! in the file. And since most files are small there may not be lots of full dis! bloc!s in the middle of files. Must how bad is itL #t gets worse for larger bloc! si3es. (so maybe ma!ing bloc! si3e bigger to get more of the dis! transfer rate isn;t such a good idea...). 1id some measurements on a file system at $er!eley to calculate si3e and percentage of waste based on dis! bloc! si3e. =ere are some numbers4 Space &sed (0bytes) "ercent /aste Organi3ation [email protected] H9H.C HBB.@ ?DH.@ A.A B.? <<.H 99.D 1ata only no separation between files 1ata P inodes @<9 byte bloc! 1ata P inodes <A9D byte bloc! 1ata P inodes 9ADH byte bloc! 147



<<9H.E

[email protected]

1ata P inodes DA?B byte bloc!



6otice that a problem is that the presence of small files !ills large file performance. #f only had large files would ma!e the bloc! si3e large and amorti3e the see! overhead down to some very small number. $ut small files ta!e up a full dis! bloc! and large dis! bloc!s waste space. #n D.9$S1 they attempted to fi. some of these problems. #ntroduced concept of a cylinder group. A cylinder group is a set of ad:acent cylinders. A file system consists of a set of cylinder groups. 2ach cylinder group has a redundant copy of the super bloc! space for inodes and a bit map describing available bloc!s in the cylinder group. 1efault policy4 allocate < inode per 9ADH bytes of space in cylinder group. $asic idea behind cylinder groups4 will put related information together in the same cylinder group and unrelated information apart in different cylinder groups. &se a bunch of heuristics. Try to put all inodes for a given directory in the same cylinder group. Also try to put bloc!s for one file ad:acent in the cylinder group. The bitmap as a storage device ma!es it easier to find ad:acent groups of bloc!s. ,or long files redirect bloc!s to a new cylinder group every megabyte. This spreads stuff out over the dis! at a large enough granularity to amorti3e the see! time. #mportant point to ma!ing this scheme wor! well - !eep a free space reserve (@ to <A percent). Once above this reserve only supervisor can allocate dis! bloc!s. #f dis! is almost completely full allocation scheme cannot !eep related data together and allocation scheme degenerates to random. #ncreased bloc! si3e. The minimum bloc! si3e is now DA?B bytes. =elps read bandwidth and write bandwidth for big files. $ut don;t waste a lot of space for small filesL Solution4 introduce concept of a dis! bloc! fragment. 2ach dis! bloc! can be chopped up into 9 D or H fragments. 2ach file contains at most one fragment which holds the last part of data in the file. So if have H small files they together only occupy one dis! bloc!. 'an also allocate larger fragments if the end of the file is larger than one eighth of the dis! bloc!. The bit map is laid out at the granularity of fragments. /hen increase the si3e of the file may need to copy out the last fragment if the si3e gets too big. So may copy a file multiple times as it grows. The &ni. utilities try to avoid this problem by growing files a dis! bloc! at a time. $ottom line4 this helped a lot - read bandwidth up to DE percent of pea! dis! transfer rate for large files. 148

• • •

• • •











• • •

Another standard mechanism that can really help dis! performance - a dis! bloc! cache. OS maintains a cache of dis! bloc!s in main memory. /hen a re%uest comes it can satisfy 5e%uest locally if data is in cache. This is part of almost any #O system in a modern machine and can really help performance. =ow does caching algorithm wor!L 1evote part of main memory to cached data. /hen read a file put into dis! bloc! cache. $efore reading a file chec! to see if appropriate dis! bloc!s are in the cache. /hat about replacement policyL =ave many of same options as for paging algorithms. 'an use (5& ,#,O with second chance etc. =ow easy is it to implement (5& for dis! bloc!sL "retty easy - OS gets control every time dis! bloc! is accessed. So can implement an e.act (5& algorithm easily. =ow easy was it to implement an e.act (5& algorithm for virtual memory pagesL =ow easy was it to implement an appro.imate (5& algorithm for virtual memory pagesL $ottom line4 different conte.t ma!es different cache replacement policies appropriate for dis! bloc! caches. /hat is bad case for all (5& algorithmsL Se%uential accesses. /hat is common case for file accessL Se%uential accesses. =ow to fi. thisL &se free-behind for large se%uentially accessed files - as soon as finish reading one dis! bloc! and move to the ne.t e:ect first dis! bloc! from the cache. So what cache replacement policy do you useL $est choice depends on how file is accessed. So policy choice is difficult because may not !now. 'an use read-ahead to improve file system performance. 0ost files accessed se%uentially so can optimistically prefect dis! bloc!s ahead of the one that is being read. "erfecting is a general techni%ue used to increase the performance of fetching data from longlatency devices. 'an try to hide latency by running something else concurrently with fetch. /ith dis! bloc! caching physical memory serves as a cache for the files stored on dis!. /ith virtual memory physical memory serves as a cache for processes stored on dis!. So have one physical resource shared by two parts of system. =ow much of each resource should file cache and virtual memory getL o ,i.ed allocation. 2ach gets a fi.ed amount. "roblem - not fle.ible enough for all situations. o Adaptive - if run an application that uses lots of files give more space to file cache. #f run applications that need more memory give more to virtual memory subsystem. Sun OS does this.

• • • • •

• • • •





=ow to handle writes. 'an you avoid going to dis! on writesL "ossible answers4 o 6o - user wants data on stable storage that;s why he wrote it to a file. 149

o 8es - !eep in memory for a short time and can get big performance improvements. 0aybe file is deleted so don;t ever need to use dis! at all. 2specially useful for /tmp o ,iles. Or can batch up lots of small writes into a larger write or can give dis! scheduler more fle.ibility. #n general depends on needs of the system. • • One more %uestion - do you !eep data written bac! to dis! in the file cacheL "robably - may be read in the near future so should !eep it resident locally. One common problem with file caches - if use file system as bac!ing store can run into double caching. 2:ect a page and it gets written bac! to file. $ut dis! bloc!s from recently written files may be cached in memory in the file cache. #n effect file caching interferes with performance of the virtual memory system. ,i. this by not caching bac!ing store files. An important issue for file systems is crash recovery. 0ust maintain enough information on dis! to recover from crashes. So modifications must be carefully se%uenced to leave dis! in a recoverable state at all times.



An o(d 5o#e4or0 pro:(e#
'onsider a virtual memory system with E9 bit addresses and H I$ pages. 2ach page table entry is 9 bytes long and there is one level of page table. The entire E9 bit address space is available to processes. 'onsider a process that has allocated the top and bottom <9H 0$ of its address space. =ow much memory does its page table use. (6ote that <0$ W <ADH@CB bytes) 6ow consider the same process in a two level paging system with the same si3e pages but two page identifiers of <A and ? bits for the first and second level of the page table. =ow much memory does this page table consume for the same processL Answer4 The first page table ta!es the same amount of space for all processes. HI page tables imply that the offset ta!es up <E bits leaving <? bits for the page identifier. There are 9<? two byte entries which is 99A W <ADH@CB bytes (or <0$). The second page table re%uires one main page table of 9<A 9 byte entries which ta!es up 9ADH bytes (or 9I$). 2ach secondary page table addresses 9? W @<9 pages. $ecause each page is HI each secondary page addresses D<?DEAD (D0$) of data. 2ach <9H 0$ of data re%uires E9 secondary pages for a total of BD secondary pages each of 9? 9 byte entries (<A9D bytes). The whole page table ta!es 9ADH P BD i <A9D W BC@HD or BHI$. Another o(d ho#e4or0 pro:(e# 'onsider a virtual memory system. This system also has virtual memory hardware cache on the processor. The hit rates and service times for the layers of the memory system are4 • • 'aching level =it 5ate Service Time '"& 'ache ?A` <ns 150

• •

0ain 0emory C@` <microsecond "age ,ault (includes <AA` <A milliseconds translation and retrieval)

The layers are tried in order so a main memory hit only occurs on a cache miss so a main memory hit is (A.<TA.C@) C.@` of all memory references. /hat is the average service time of a memory access on this machineL 6ote that the service time for a layer is paid for a hit or a miss the time to serve a cache miss that is in main memory is the service time of the cache plus the service time of the memory. The service time must be paid to determine that the item you want is not at this level. /hich improves memory performance more improving the processor cache service time by <A` (a A.? ns loo!up time) or increasing the main memory hit rate by @` (to HA`) by choosing a better page replacement algorithmL 'alculate the average memory access time for both changes. Answer4 Absolute rates are4 'aching level Total =it 5ate Service Time 'ache ?A` < ns 0ain 0emory (A.<TA.C@) W C.@` <A ns "age ,ault (A.<TA.9@T<) W 9.@` <Ans So the average memory access time is access time W A. ?(< i <A-?) P A. AC@( i <A-? P <A i <A-?) P A. A9@(< i <A-? P <A i <A-? P <A i <A-E) W <(< i <A-?) P A. <(<A i <A-?) P A. A9@(<A i <A-E) W 9@A AA9 i <A-? W 9@A. AA9m s 5educing cache service time to A.? ns gives a memory access time of4 access time W A. ?(A. ? i <A-?) P A. AC@(A. ? i <A-? P <A i <A-?) P A. A9@(A. ? i <A-? P <A i <A-? P <A i <A-E) W <(A. ? i <A-?) P A. <(<A i <A-?) P A. A9@(<A i <A-E) W 9@A AA<? i <A-? W 9@A. AA<?m s #ncreasing main memory hit rate to HA` gives4 access time W A. ?(< i <A-?) P A. AH(< i <A-? P <A i <A-?) P A. A9(< i <A-? P <A i <A-? P <A i <A-E) W <(< i <A-?) P A. <(<A i <A-?) P A. A9(<A i <A-E) W 9AA AA9 i <A-? W 9AA. AA9m sa!e the increased hit rate.

Fi(e S!"te#"
,ile systems provide applications with permanent storage. 0ore than that they organi3e and protect data and provide a clean interface to allow manipulation of that data. #tUs no e.aggeration to say that providing a file system is one of the ma:or services of general purpose operating systems and less general ones as well. (2ven the palm pilot has permanent storage). 151

Fi(e" A file is a persistent hardware-independent named protected collection of bits and a collection of operations that can be e.ecuted on them. The access operations generally impose an order on the bits. The attributes attributes define what files are used for. "ersistence implies that the bytes have a meaning that e.tends in time. 0emory used in calculating intermediate results doesnUt have that attribute. One wouldnUt store the memory used in a computation in a file because it has no long-term use. $ecause the data in files has this long term significance files are stored on more permanent media. These days the most common medium is still magnetic dis! although several others are ma!ing bids. Some other media that can contain files are memory flash memory< tapes '1-5O0S and more esoteric media. $asically anything that can hold information permanently and be read by a computer has held a file system or will eventually. $y definition files are largely medium-independent. The same operations are generally allowed on files regardless of the underlying storage medium. There are obvious e.ceptions - you can write to a '15O0 at most once and there are obvious drawbac!s to trying to move to a byte at the front of a tape to the bac!. #n general however code that manipulates files on one medium will wor! on others. This saves a lot of programmer time as weUll see. ,inally file systems provide a way to name files. This is a seemingly simple function that turns out to be enormously powerful.9 ,ile systems provide ways to name files that span multiple media on the same machine (the &6#7\ file system) loosely connected local area networ!s (the 6etwor! ,ile System (6,S)) and even global name spaces (the Andrew ,ile System (A,S)). "roviding a name space outside the confines of memory addresses allows processes to share data and communicate. $ecause files are outside memory they are also outside the protection of the memory protection systemE. As a result the file system has to impose ideas of user identity and related privileges on the data. The Fi(e A:"traction Although the idea of an abstract named collection of bytes is easy enough to grasp the 1evil is in the details. 1espite the fairly simple idea of what a file is files on different operating systems can be remar!ably different. /eUll discuss the variations in the file abstraction along the following a.es4 • • • • • 6aming 1ata Structure and Access "atterns ,ile Types Attributes Operations <. ,lash memory holds the data stored in it even when the power is off. So does core memory but your generation will only see that in museums. 9. 0agicians and con:urers have long believed that to !now a thingUs name gives a person power over that thing. So it is with computers. 6aming information and the operations thereon is the heart of computer science. 152

E. &nless we put them there li!e 0ulitcs does. 2ven then the initial permissions on the memory segments are derived from the files themselves. Na#in A file generally has a name a string of bits that (usually) correspond to human-readable letters. The operating system defines what characters are valid in file names and any e%uivalence classes between them. ,or e.ample &6#7 allows any byte e.cept he. A.9f (AS'## for /) to appear in a filename. 0S1OS limits the character set to uppercase letters and a few symbols (letters are converted to uppercase in all file names). Amiga1OS allows you to specify any capitali3ation but internally ignores case. >,ile> >file> and >,ile> all refer to the same file although it will appear in directory listings spelled as the creator of the file spelled it.D $eyond the character set operating systems impose a structure on the names of files. &nder &6#7 the restraints are minor - filenames canUt contain /@ 'ompare this to 0S-1OS (and the /indowsjdjd systems that basically sit atop them) that re%uires an H character file name and three character e.tension. (Systems other than 0S-1OS have the idea of an e.tension or a naming convention for related files.) 1ifferent operating systems depend to differing e.tents on the structure of the file names. 0S1OS defines an e.ecutable file by its e.tension while &6#7 generally ta!es it as a hint. Other programs li!e compilers place varying degrees of emphasis on file names. ,or e.ample gcc uses the file e.tension to determine which of the languages it supports should be used to compile the source file although the behavior can be overridden. Fi(e Structure #n its simplest form often called a flat file a file is a collection of ordered bytes. Some systems however place additional structure on files. ,or e.ample files under some operating systems consist of records or collections of bytes. 5ather than reading single bytes or see!ing to arbitrary offsets files are always accessed in terms of records. The records can be fi.ed length or variable. Arranging a file as records implies the e.istence of a schema (or description of the record) either embedded in the file in a manner that the OS can read or separately in the system (perhaps elsewhere in the file system). 0any people thin! of record-based files as database entries and thatUs one common use of them. Another common type of record based file was the card file - a file that was a se%uence of HA character records that was the electronic e%uivalent of a stac! of punched cards or of printer lines. 5ecord-based files may display an ordering that is independent of the way the bits are ordered on the underlying storage medium. The internal structure of the file may reflect this possibility and be significantly more comple. than a flat file. /e will discuss the details of this when we discuss implementation of Fi(e "!"te#"C 5ecord-based files impose a structure on the data and allows the operating system to !eep that structure intact. The flip-side of that is that record-based files are less fle.ible. Translation between formats or adding a new format is fre%uently difficult. 153

# should note that record-based access is provided by and enforced by the operating system. #tUs not an application convention (although many applications create the illusion of record-based files in a flat file system.) 5ecord based file systems cannot see! to a specific byte in the file or as! for a set of bytes that spans a record boundary. 5ecords are the fundamental building bloc! of files in such systems in the same way that bytes are the fundamental building bloc!s of flat files. (Alternatively flat files are files with < byte unformatted records). 5elated to the building bloc!s of the files is the access method that a file supports. The access methods are an abstraction of the underlying hardware. A file system that supports only se%uential access re%uires all files to be read or written from the first to the last byte only. Such access methods are appropriate file files residing on a magnetic tape for e.ample.

Fi(e" on di"0 or C/-ROM
6ote well the distinction between an applicationUs access pattern and the access method allowed by the OS. An application may read a configuration file that supports random access se%uentially. That means that the operating system allows any access pattern but the application chose to read it se%uentially. The reverse cannot be done; a file that supports only se%uential access cannot have the bytes read in another order. Other access methods include read-only for unmodifiable files or inde.ed for record based files that have been sorted multiple ways (the OS must support that of course). Fi(e T!pe" ,iles have various uses4 they hold data for various programs programs themselves free-form te.t intended to be read by humans and other things. Some operating systems allow the contents of a file to be directly encodes as a file type. ,ile types can be an e.plicit piece of information remembered by the operating system (a file attribute - see below) or can be a combination of name permissions (another attribute - see below) and contents. =ow types are encoded in the system and how rigidly type restrictions are enforced control how type based the file system is. Strongly-typed operating systems re%uire file types to be encoded in every file and enforce restrictions. As a result files generally have only one function and their use can be easily controlled. Older business computers especially mainframes had strongly-typed systems. The advantage is that file use and the procedures that underlie them can be tightly constrained. #f files that can be printed as chec!s can only be created by a few trusted programs it ma!es forging a chec! more difficult. #t also ma!es creating one for a legitimate purpose that the system hasnUt been programmed for more difficult. Other general purpose systems li!e &6#7 and permissions and file contents to provide typing. determining if a file is e.ecutable (that is contains a business of discriminating between data types to the similar criteria to differentiate). /indows rely on a combination of filenames These systems generally only care about program that can be run) or not leaving the applications that use them. (Applications use

&6#7 considers a file e.ecutable if the user has the right to e.ecute the file encoded in the permissions and if the file is correctly formatted as an e.ecutable. ,ormatting is generally chec!ed by 154

a magic number in the first few bytes of the file. 0S-1OS relies on the file e.tension and the internal format. 'hec! out the file command in &6#7 for a list of some of the magic numbers used by modern versions of &6#7. =ow intrinsic types are in the file system is a tradeoff between codifying practices in the OS and allowing adaptability. Per#i""ion" and Attri:ute" "ermissions encode the operations allowable on a file and what usersH are allowed to perform them. 8ou can thin! of this as a list of all the possible operations on a file and what users are allowed to perform them. #n practice the representations are smaller and the listings less e.haustive. 'onsider the &6#7 file permissions4 each file has an owning user and group and the rights to read write and e.ecute a file is controlled for each of those entities and other users. ,or e.ample a file might be readable and writable by its owner readable by members of its group and not accessible to other users. /e will tal! more about permissions in a few days but right now the owner group and permissions are interesting as attributes of the file. Attributes are meta-data that is data about the file itself not the data within it. "ermissions are a good e.ample4 they control what processes may access the underlying data of the file but are independent of that data. So#e other co##on attri:ute" are: • creator <. Again this is an OS feature not a simulation by the application. The read system calls return the records in sorted order. 9. /ell processes acting on behalf on what users. /eUll discuss it in a little while. • • • • • • • • • • owner system file flag hidden flag temporary flag creation time modification time information about the last bac!up loc! information current si3e ma.imum si3e

0eta-data is used for a variety of reasons. Some of it is for human use4 a data provenance. Some of it is for internal OS use for e.ample the bac!up flags. The ability of the file system to store 155

both data and data about the data is an important aspect of the system. ,or e.ample the ma!e utility would be useless without the meta-data telling the program the relative ages of the files. Fi(e Operation" ,ile operations are a generally simple and intuitive set of operations on files. 6ot all of these are supported by all file systems. #n some file systems operations are implemented in terms of other file operations. Still these give a good feel for file operations. Create: 'reate a new file. This may allocate space for the file or :ust reserve the name for future action. /e(ete: 1elete an e.isting file. The ability to delete a file is distinct from (but often related to) the ability to modify its contents. Some operating systems use the e.istence of a file to start a service so the e.istence of such a file is its most important attribute and should therefore be protected. Open: This lets the OS !now that the current process will be interested in a file soon. #n some sense itUs e.traneous but in cases where the file resides on a medium that has significant startup cost (a robotic tape cabinet) not returning the open call until the file is ready for access is a good idea. 8our 6achos wor! should give you an idea about some of the OS set up wor! thatUs done here. C(o"e: (et the OS !now that the process is done with this file and that the OS can reclaim the resources allocated to manipulating the file. (The data and meta-data are updated but remaining the file system of course.) Some systems delay writes or cache data for future reads. 'lose is an indication to them that pending writes must be flushed and that cached reads can be discarded. See0: ,iles have a notion of the current byte (or record) of the file that will ne.t be accessed. On files that can be randomly accessed see! allows the calling process to set the current byte (or record). Read: )et some data from the file. The OS may ta!e this opportunity to predict future behavior collect statistics or do other support wor! in addition to bringing the data from secondary storage to the processes address space. 5eading occurs at the current byte/record. 6rite: /rite some data to the file. Strictly this means to change e.isting data in the file to a new value but many systems also use the write system call to append data. (i!e read there may be secondary actions associated with this. /riting occurs at the current byte/record. 6ote that reading and writing may cache data and that such caches have to be coordinated so that all processes see a consistent version of the file. Append: Add data to the end of the file. This shares aspects with write but includes the idea that the underlying file is changing si3e. Append means that the OS must increase the allocation of storage to the file. As # mentioned above some OSes determine whether a write system call causes an append or a write based on the current byte and the length of the buffer written. Get=Set Attri:ute": ,or those attributes that can be modified directly by users for e.ample the bac!up flag) this provides access. Rena#e: 'hange a name of the file in the file system. This may be an operation on the file or a directory depending on the file system. 156

Me#or! Mappin Fi(e"
As we discussed in the memory management unit it is sometimes convenient to move a file into memory and access it directly there. The easiest way to do this is to ma!e the file on dis! (or whatever) the bac!ing store for a segment (or section of a paged G0 space) and let the paging system handle the writes. The big issue here is consistency - when do changes in memory get reflected to the file in secondary storageL The paging system is probably pretty la3y about getting changes out to a file compared to a file write but writing a whole page to memory on every memory write is probably too slow. One solution is to !eep data about which files are memory mapped (as an attribute) and ma!e associated file reads from the memory system rather than the file. Puttin /e;ice" in the Fi(e S!"te# The simple powerful semantics of files lend themselves to controlling a variety of resources. ,or e.ample terminal input can be thought of as a se%uential read-only file and terminal output as a se%uential write-only file. This means that programs can be written to ta!e an input and output file and transparently run interactively or from dis! files. ,urthermore by introducing memory-resident se%uential files that are written by one process and read by another called pipes programs can be lin!ed together in arbitrary ways. The OS has to do e.tra wor! to ma!e a terminal loo! li!e a file or create a pipe but the result is a simpler programming model for developers4 #/O is file #/O. &6#7 ta!es this to an e.treme - nearly every OS resource has a file system interface. "hysical dis! drives are a special type of file so are terminals modems and printers. All of these can be accessed through the familiar file system interface. 5ecently the data of running procedures has been added to the list of data accessible through the file system. ,urthermore the access control mechanisms of files can be directly applied to prevent unauthori3ed manipulations of the hardware. 5aw dis! drives or modem devices can only be accessed by privileged users and those accesses are controlled by the same system as file accesses. A unified access control system is easier to use and only having one to debug implies that it will be more secure. All is not a bed of roses though. =ardware has features that are not easily e.pressed as file operations.

157

C5APTER 13 /IRECTORIES AN/ SEC)RITY
The terms protection and security are often used together and the distinction between them is a bit blurred but security is generally used in a broad sense to refer to all concerns about controlled access to facilities while protection describes specific technological mechanisms that support security.

Securit!
As in any other area of software design it is important to distinguish between policies and mechanisms. $efore you can start building machinery to enforce policies you need to establish what policies you are trying to enforce. 0any years ago # heard a story about a software firm that was hired by a small savings and loan corporation to build a financial accounting system. The chief financial officer used the system to embe33le millions of dollars and fled the country. The losses were so great the SJ( went ban!rupt and the loss of the contract was so bad the software company also went belly-up. 1id the accounting system have a good or bad security designL The problem wasn;t unauthori3ed access to information but rather authori3ation to the wrong person. The situation is analogous to the old saw that every program is correct according to some specification. &nfortunately we don;t have the space here to go into the whole %uestion of security policies here. /e will :ust assume that terms li!e *authori3ed access+ have some well-defined meaning in a particular conte.t. Threat" Any discussion of security must begin with a discussion of threats. After all if you don;t !now what you;re afraid of how are you going to defend against itL Threats are generally divided in three main categories. • • • &nauthori3ed disclosure. A *bad guy+ gets to see information he has no right to see (according to some policy that defines *bad guy+ and *right to see+). &nauthori3ed updates. The bad guy ma!es changes he has no right to change. 1enial of service. The bad guy interferes with legitimate access by other users.

There is a wide spectrum of denial-of-service threats. At one end it overlaps with the previous category. A bad guy deleting a good guy;s file could be considered an unauthori3ed update. A the other end of the spectrum blowing up a computer with a hand grenade is not usually considered an unauthori3ed update. As this second e.ample illustrates some denial-of-service threats can only be enforced by physical security. 6o matter how well your OS is designed it can;t protect my files from his hand grenade. Another form of denial-of-service threat comes from unauthori3ed consumption of resources such as filling up the dis! tying up the '"& with an infinite loop or crashing the system by triggering some bug in the OS. /hile there are software defenses against these threats they are generally considered in the conte.t of other parts of the OS rather than security and protection. #n short discussion of software mechanisms for computer security generally focus on the first two threats. #n response to these threats counter measures also fall into various categories. As programmers we tend to thin! of technological tric!s but it is also important to reali3e that a 158

complete security design must involve physical components (such as loc!ing the computer in a secure building with armed guards outside) and human components (such as a bac!ground chec! to ma!e sure your ',O isn;t a croo! or chec!ing to ma!e sure those armed guards aren;t ta!ing bribes). The TroDan 5or"e $rea!-in techni%ues come in numerous forms. One general category of attac! that comes in a great variety of disguises is the Tro:an =orse scam. The name comes from )ree! mythology. The ancient )ree!s were attac!ing the city of Troy which was surrounded by an impenetrable wall. &nable to get in they left a huge wooden horse outside the gates as a *gift+ and pretended to sail away. The Tro:ans brought the horse into the city where they discovered that the horse was filled with )ree! soldiers who defeated the Tro:ans to win the 5ose $owl (oops wrong story). #n software a Tro:an =orse is a program that does something useful--or at least appears to do something useful-but also subverts security somehow. #n the personal computer world Tro:an horses are often computer games infected with *viruses.+ =ere;s the simplest Tro:an =orse program # !now of. (og onto a public terminal and start a program that does something li!e this4 print (>login 4>); name W readA(ine (); turnOff2choing (); print (>password4>); passwd W readA(ine (); send0ail(>badguy> name passwd); print (>login incorrect>); e.it (); A user wa!ing up to the terminal will thin! it is idle. =e will attempt to log in typing his login name and password. The Tro:an =orse program sends this information to the bad guy prints the message login incorrect and e.its. After the program e.its the system will generate a legitimate login4 message and the user thin!ing he mistyped his password (a common occurrence because the password is not echoed) will try again log in successfully and have no suspicion that anything was wrong. 6ote that the Tro:an =orse program doesn;t actually have to do anything useful it :ust has to appear to. /e"i n Princip(e" <. Pu:(ic /e"i nC A common mista!e is to try to !eep a system secure by !eeping its algorithms secret. That;s a bad idea for many reasons. ,irst it gives a !ind of all-or-nothing security. As soon as anybody learns about the algorithm security is all gone. #n the words of $en:amin ,ran!lin *Two people can !eep a secret if one of them is dead.+ Second it is usually not that hard to figure out the algorithm by seeing how the system responds to various inputs decompiling the code etc. Third publishing the algorithm can have beneficial effects. The bad guys probably have already figured out your algorithm and found its wea! points. #f you publish it perhaps some good guys will notice bugs or loopholes and tell you about them so you can fi. them. 159

9. /e8au(t K No Acce""C Start out by granting as little access a possible and adding privileges only as needed. #f you forget to grant access where it is legitimately needed you;ll soon find out about it. &sers seldom complain about having too much access. E. Ti#e(! Chec0"C 'hec!s tend to *wear out.+ ,or e.ample the longer you use the same password the higher the li!elihood it will be stolen or deciphered. $e careful4 This principle can be overdone. Systems that force users to change passwords fre%uently encourage them to use particularly bad ones. A system that forced users to supply a password every time they wanted to open a file would inspire all sorts of ingenious ways to avoid the protection mechanism altogether. D. Mini#u# Pri;i(e eC This is an e.tension of point 9. A person (or program or process) should be given :ust enough powers to get the :ob done. #n other conte.ts this principle is called *need to !now.+ #t implies that the protection mechanism has to support fine-grained control. @. Si#p(e< )ni8or# Mechani"#"C Any piece of software should be as simple as possible (but no simplerV) to ma.imi3e the chances that it is correctly and efficiently implemented. This is particularly important for protection software since bugs are li!ely be usable as security loopholes. #t is also important that the interface to the protection mechanisms be simple easy to understand and easy to use. #t is remar!ably hard to design good foolproof security policies; policy designers need all the help they can get. B. Appropriate Le;e(" o8 Securit!C 8ou don;t store your best silverware in a bo. on the front lawn but you also don;t !eep it in a vault at the ban!. The &S Strategic Air 1efense calls for a different level of security than my records of the grades for this course. 6ot only does e.cessive security mechanism add unnecessary cost and performance degradation it can actually lead to a less secure system. #f the protection mechanisms are too hard to use users will go out of their way to avoid using them. Authentication Authentication is a process by which one party convinces another of its identity. A familiar instance is the login process though which a human user convinces the computer system that he has the right to use a particular account. #f the login is successful the system creates a process and associates with it the internal identifier that identifies the account. Authentication occurs in other conte.ts and it isn;t always a human being that is being authenticated. Sometimes a process needs to authenticate itself to another process. #n a networ!ing environment a computer may need to authenticate itself to another computer. #n general let;s call the party that whats to be authenticated the client and the other party the server. One common techni%ue for authentication is the use of a password. This is the techni%ue used most often for login. There is a value called the password that is !nown to both the server and to legitimate clients. The client tells the server who he claims to be and supplies the password as proof. The server compares the supplied password with what he !nows to be the true password for that user. Although this is a common techni%ue it is not a very good one. There are lots of things wrong with it. /irect attac0" on the pa""4ord The most obvious way of brea!ing in is a frontal assault on the password. Simply try all possible passwords until one wor!s. The main defense against this attac! is the time it ta!es to try 160

lots of possibilities. #f the client is a computer program (perhaps mas%uerading as a human being) it can try lots of combinations very %uic!ly but by if the password is long enough even the fastest computer cannot try succeed in a reasonable amount of time. #f the password is a string of H letters and digits there are 9 H9< <A? ?AC D@B possibilities. A program that tried one combination every millisecond would ta!e H? years to get through them all. #f users are allowed to pic! their own passwords they are li!ely to choose *cute doggie names+ common words names of family members etc. That cuts down the search space considerably. A password crac!er can go through dictionaries lists of common names etc. #t can also use biographical information about the user to narrow the search space. There are several defenses against this sort of attac!. • The system chooses the password. The problem with this is that the password will not be easy to remember so the user will be tempted to write it down or store it in a file ma!ing it easy to steal. This is not a problem if the client is not a human being. The system re:ects passwords that are too *easy to guess+. #n effect it runs a password crac!er when the user tries to set his password and re:ects the password if the crac!er succeeds. This has many of the disadvantages of the previous point. $esides it leads to a sort of arms race between crac!ers and chec!ers. The password chec! is artificially slowed down so that it ta!es longer to go through lots of possibilities. One variant of this idea is to hang up a dial-in connection after three unsuccessful login attempts forcing the bad guy to ta!e the time to redial.





Ea;e"droppin C This is a far bigger program for passwords than brute force attac!s. #n comes in many disguises. • • (oo!ing over someone;s shoulder while he;s typing his password. 0ost systems turn off echoing or echo each character as an asteris! to mitigate this problem. 5eading the password file. #n order to verify that the password is correct the server has to have it stored somewhere. #f the bad guy can somehow get access to this file he can pose as anybody. /hile this isn;t a threat on its own (after all why should the bad guy have access to the password file in the first placeL) it can magnify the effects of an e.isting security lapse. &ni. introduced a clever fi. to this problem that has since been almost universally copied. &se some hash function f and instead of storing password store f(password). The hash function should have two properties4 (i!e any hash function it should generate all possible result values with roughly e%ual probability and in addition it should be very hard to invert--that is given f(password) it should be hard to recover password. #t is %uite easy to devise functions with these properties. /hen a client sends his password the server applies f to it and compares the result with the value stored in the password file. Since only f(password) is stored in the password file nobody can find out the password for a given user even with full access to the password file and logging in re%uires !nowing password not f(password). #n fact this techni%ue is so secure it has become customary to ma!e the password file publicly readableV /ire tapping. #f the bad guy can somehow intercept the information sent from the client to the server password-based authentication brea!s down altogether. #t is increasingly the case the authentication occurs over an insecure channel such as a dial-up line or a local-area networ!. 6ote that the &ni. scheme of storing f(password) is of no help here since the password is sent in its original form (*plainte.t+ in the :argon of encryption) from the client to the server. /e will consider this problem in more detail below. 161





Spoo8in This is the worst threat of all. =ow does the client !now that the server is who it appears to beL #f the bad guy can pose as the server he can tric! the client into divulging his password. /e saw a form of this attac! above. #t would seem that the server needs to authenticate itself to the client before the client can authenticate itself to the server. 'learly there;s a chic!en-and-egg problem here. ,ortunately there;s a very clever and general solution to this problem. Cha((en e-re"pon"e There are wide variety of authentication protocols but they are all based on a simple idea. As before we assume that there is a password !nown to both the (true) client and the (true) server. Authentication is a four-step process. • • • The client sends a message to the server saying who he claims to be and re%uesting authentication. The server sends a challenge to the client consisting of some random value .. The client computes g(password .) and sends it bac! as the response. =ere g is a hash function similar to the function f above e.cept that it has two arguments. #t should have the property that it is essentially impossible to figure out password even if you !now both . and g(password .). The server also computes g(password .) and compares it with the response it got from the client.



'learly this algorithm wor!s if both the client and server are legitimate. An eavesdropper could learn the user;s name . and g(password .) but that wouldn;t help him pose as the user. #f he tried to authenticate himself to the server he would get a different challenge .; and would have no way to respond. 2ven a bogus server is no threat. The change provides him with no useful information. Similarly a bogus client does no harm to a legitimate server e.cept for tying him up in a useless e.change (a denial-of-service problemV).

Protection Mechani"#"
Fir"t< "o#e ter#ino(o ! O:Dect": The things to which we wish to control access. They include physical (hardware) ob:ects as well as software ob:ects such as files databases semaphores or processes. As in ob:ectoriented programming each ob:ect has a type and supports certain operations as defined by its type. #n simple protection systems the set of operations is %uite limited4 read write and perhaps e.ecute append and a few others. ,ancier protection systems support a wider variety of types and operations perhaps allowing new types and operations to be dynamically defined. Principa(": #ntuitively *users+--the ones who do things to ob:ects. "rincipals might be individual persons groups or pro:ects or roles such as *administrator.+ Often each process is associated with a particular principal the owner of the process. Ri ht": "ermissions to invo!e operations. 2ach right is the permission for a particular principal to perform a particular operation on a particular ob:ect. ,or e.ample principal solomon might have read rights for a particular file ob:ect. 162

/o#ain": Sets of rights. 1omains may overlap. 1omains are a form of indirection ma!ing it easier to ma!e wholesale changes to the access environment of a process. There may be three levels of indirection4 A principal owns a particular process which is in a particular domain which contains a set of rights such as the right to modify a particular file. 'onceptually the protection state of a system is defined by an access matri.. The rows correspond to principals (or domains) the columns correspond to ob:ects and each cell is a set of rights. ,or e.ample if accessRsolomonQR>/tmp/foo>Q W f read write g Then # have read and write access to file >/tmp/foo>. # say *conceptually+ because the access is never actually stored anywhere. #t is very large and has a great deal of redundancy (for e.ample my rights to a vast number of ob:ects are e.actly the same4 noneV) so there are much more compact ways to represent it. The access information is represented in one of two ways by columns which are called access control lists (A'(s) and by rows called capability lists. Acce"" Contro( Li"t" An A'( (pronounced *ac!le+) is a list of rights associated with an ob:ect. A good e.ample of the use of A'(s is the Andrew ,ile System (A,S) originally created at 'arnegie-0ellon &niversity and now mar!eted by Transarc 'orporation as an add-on to &ni.. This file system is widely used in the 'omputer Sciences 1epartment. 8our home directory is in A,S. A,S associates an A'( with each directory but the A'( also defines the rights for all the files in the directory (in effect they all share the same A'(). 8ou can list the A'( of a directory with the fs listacl command4

` fs listacl /u/c/s/cs@EC-</public Access list for /u/c/s/cs@EC-</public is 6ormal rights4 system4administrators rlidw!a System4anyuser rl solomon rlidw!a The entry system4 anyuser rl means that the principal system4anyuser (which represents the role *anybody at all+) has rights r (read files in the directory) and l (list the files in the directory and read their attributes). The entry solomon rlidw!a means that # have all seven rights supported by A,S. #n addition to r and l they include the rights to insert new file in the directory (i.e. create files) delete files write files loc! files and administer the A'( list itself. This last right is very powerful4 #t allows me to add delete or modify A'( entries. # thus have the power to grant or deny any rights to this directory to anybody. The remaining entry in the list shows that the principal system4 administrators has the same rights # do (namely all rights). This principal is the name of a group of other principals. The command pts membership system4 administrators list the members of the group. Ordinary &ni. also uses an A'( scheme to control access to files but in a much strippeddown form. 2ach process is associated with a user identifier (uid) and a group identifier (gid) each of which is a <B-bit unsigned integer. The inode of each file also contains a uid and a gid as well as a nine-bit protection mas! called the mode of the file. The mas! is composed of three groups of three bits. The first group indicates the rights of the owner4 one bit each for read access write access and 163

e.ecute access (the right to run the file as a program). The second group similarly lists the rights of the file;s group and the remaining three three bits indicate the rights of everybody else. ,or e.ample the mode <<< <A< <A< (AC@@ in octal) means that the owner can read write and e.ecute the file while members of the owning group and others can read and e.ecute but not write the file. "rograms that print the mode usually use the characters rw.- rather than A and <. 2ach 3ero in the binary value is represented by a dash and each < is represented by r w or . depending on its position. ,or e.ample the mode <<<<A<<A< is printed as rw.r-.r-.. #n somewhat more detail the access-chec!ing algorithm is as follows4 The first three bits are chec!ed to determine whether an operation is allowed if the uid of the file matches the uid of the process trying to access it. Otherwise if the gid of the file matches the gid of the process the second three bits are chec!ed. #f neither of the id;s match the last three bits are used. The code might loo! something li!e this. boolean accessKS(1rocess p, Dnode i, int operation) ( int mode; if (p2uid !! i2uid) mode ! i2mode ++ V; else if (p2gid !! i2gid) mode ! i2mode ++ /; else mode ! i2mode; s)itch (operation) ( case <=>H: mode M! Q; break; case 9<DT=: mode M! B; break; case =X=@LT=: mode M! ,; break; ' return (mode 0! "); ' (The e.pression i.mode ]] E denotes the value i.mode shifted right by three bits positions and the operation mode JW D clears all but the third bit from the right of mode.) 6ote that this scheme can actually give a random user more powers over the file than its owner. ,or e.ample the mode ---r--rw(AAA <AA <<A in binary) means that the owner cannot access the file at all while members of the group can only read the file and other can both read and write. On the other hand the owner of the file (and only the owner) can e.ecute the chmod system call which changes the mode bits to any desired value. /hen a new file is created it gets the uid and gid of the process that created it and a mode supplied as an argument to the creat system call. 0ost modern versions of &ni. actually implement a slightly more fle.ible scheme for groups. A process has a set of gid;s and the chec! to see whether the file is in the process; group chec!s to see whether any of the process; gid;s match the file;s gid. boolean accessKS (1rocess p, Dnode i, int operation) ( 164

int mode; if (prudes !! auld) mode ! i2mode ++ V; else if (p2gid5et2contains(i2gid)) mode ! i2mode ++ /; else mode ! i2mode; s)itch (operation) ( case <=>H: mode M! Q; break; case 9<DT=: mode M! B; break; case =X=@LT=: mode M! ,; break; ' return (mode 0! "); ' /hen a new file is created it gets the uid of the process that created it and the gid of the containing directory. There are system calls to change the uid or gid of a file. ,or obvious security reasons these operations are highly restricted. Some versions of &ni. only allow the owner of the file to change it gid only allow him to change it to one of his gid;s and don;t allow him to change the uid at all. ,or directories *e.ecute+ permission is interpreted as the right to get the attributes of files in the directory. /rite permission is re%uired to create or delete files in the directory. This rule leads to the surprising result that you might not have permission to modify a file yet be able to delete it and replace it with another file of the same name but with different contentsV &ni. has another very clever feature--so clever that it is patentedV The file mode actually has a few more bits that # have not mentioned. One of them is the so-called setuid bit. #f a process e.ecutes a program stored in a file with the setuid bit set the uid of the process is set e%ual to the uid of the file. This rather curious rule turns out to be a very powerful feature allowing the simple rw. permissions directly supported by &ni. to be used to define arbitrarily complicated protection policies. As an e.ample suppose you wanted to implement a mail system that wor!s by putting all mail messages in to one big file say /usr/spool/mbo.. # should be able to read only those message that mention me in the To4 or 'c4 fields of the header. =ere;s how to use the setuid feature to implement this policy. 1efine a new uid mail ma!e it the owner of /usr/spool/mbo. and set the mode of the file to rw------- (i.e. the owner mail can read and write the file but nobody else has any access to it). /rite a program for reading mail say /usr/bin/readmail. This file is also owned by mail and has mode srw.r-.r-.. The asU means that the setuid bit is set. 0y process can e.ecute this program (because the *e.ecute by anybody+ bit is on) and when it does it suddenly changes its uid to mail so that it has complete access to /usr/spool/mbo.. At first glance it would seem that letting my process pretend to be owned by another user would be a big security hole but it isn;t because processes don;t have free will. They can only do what the program tells them to do. /hile my process is running readmail it is following instructions written by the designer of the mail system so it is safe to let it have access 165

appropriate to the mail system. There;s one more feature that helps readmail do its :ob. A process really has two lidUs called the effective uid and the real uid. /hen a process e.ecutes a setuid program its effective uid changes to the uid of the program but its real uid remains unchanged. #t is the effective uid that is used to determine what rights it has to what files but there is a system call to find out the real uid of the current process. 5eadmail can use this system call to find out what user called it and then only show the appropriate messages. Capa:i(itie" An alternative to A'(s are capabilities. A capability is a *protected pointer+ to an ob:ect. #t designates an ob:ect and also contains a set of permitted operations on the ob:ect. ,or e.ample one capability may permit reading from a particular file while another allows both reading and writing. To perform an operation on an ob:ect a process ma!es a system call; presenting a capability that points to the ob:ect and permits the desired operation. ,or capabilities to wor! as a protection mechanism the system has to ensure that processes cannot mess with their contents. There are three distinct ways to ensure the integrity of a capability. Ta ed architecture

Some computers associate a tag bit with each word of memory mar!ing the word as a capability word or a data word. The hardware chec!s that capability words are only assigned from other capability words. To create or modify a capability a process has to ma!e a !ernel call. Separate capa:i(it! "e #ent" #f the hardware does not support tagging individual words the OS can protect capabilities by putting them in a separate segment and using the protection features that control access to segments. Encr!ption 2ach capability can be e.tended with a cryptographic chec!sum that is computed from the rest of the content of the capability and a secret !ey. #f a process modifies a capability it cannot modify the chec!sum to match without access to the !ey. Only the !ernel !nows the !ey. 2ach time a process presents a capability to the !ernel to invo!e an operation; the !ernel chec!s the chec!sum to ma!e sure the capability hasn;t been tampered with. 'apabilities li!e segments are a *good idea+ that somehow seldom seems to be implemented in real systems in full generality. (i!e segments capabilities show up in an abbreviated form in many systems. ,or e.ample the file descriptor for an open file in &ni. is a !ind of capability. /hen a process tries to open a file for writing the system chec!s the file;s A'( to see whether the access is permitted. #f it is the process gets a file descriptor for the open file which is a sort of capability to the file that permits write operations. &ni. uses the separate segment approach to protect the capability. The capability itself is stored in a table in the !ernel and the process has only an indirect reference to it (the inde. of the slot in the table). ,ile descriptors are not full-fledged capabilities however. ,or e.ample they cannot be stored in files because they go away when the process terminates.

/irectorie"
1irectories are collections of files. Their primary function is to impose a naming system on files and organi3e them relative to each other. 1irectories provide a mechanism for collecting the names of ,iles together so that a user and find the file(s) theyUre interested in. They are an important piece of metadata. 166

F(at /irectorie" The simplest !ind of directory is a simple flat list of filenames and the information needed to find the respective file contents. These are rarely used today although they still do e.ist. The "alm "ilot has a flat file name space. #$0 0GS which will never completely die has a flat file name space. ,lat file systems are not often used because they provide no inherent organi3ation to the files and are difficult to ma!e efficient for large collections of files. 6ame space collisions have to be avoided in a flat space so generally some naming convention is followed to prevent them. A common one is to prepend filenames with the userUs name. ,or e.ample a collection of my files might have names li!e4 1#SI<.,A$25.=O02."5O,#(2 1#SI<.,A$25."5OM2'T<.)5A12S "roblems with this include enforcing the conventions (what if other users choose e to separate parts of the file nameL) and the inefficiency of holding all the systemUs files in one big list. 'onsider scanning the system directory to print all the files that # own. 2ither the whole table will have to be searched or it will have to be !ept sorted. Ieeping a large list sorted adds overhead and scanning a large table linearly is not efficient. Also updating such a centrali3ed structure will re%uire synchroni3ation - all file creations and deletions will have to enforce e.clusion on the table or we have to introduce a very fine-grained loc!ing mechanism. $asically a single directory doesnUt scale easily to many users both in terms of technical operation and user behavior.

5ierarchica( /irectorie"
A natural organi3ation of files is into a hierarchy. That is files can be seen as being in related classes and each class has a directory. • • • • All ,iles &ser files #nstructor files ,aberUs files

This is implemented as a tree of directories where each entry in the directory describes either a file or another directory. #f the hierarchy is chosen carefully the result is many small directories. Scanning each one is a reasonable amount of wor! and synchroni3ation is maintained for each directory. $ecause separate tas!s can be confined to separate directories contention can be made rare.

167

$esides the efficiency issues hierarchies are a natural way to organi3e many systems of data. The des!top metaphor of files in folders and cabinets underscores this (although the hierarchy afforded by hierarchical file systems is richer because there can be nearly arbitrary levels of nesting). )iven that we want to impose such a structure on our file names we have to describe a synta. to find a file in the tree. This is done by giving a path through the directory tree. The strings that represent such paths are called pathnames. A character is chosen as the path separator. A path is then the list of directories traversed in order to reach the file. &sing / as a separator a pathname for the file labelled file in the above diagram is /etc/ast/fn. #tUs inconvenient to name files with their entire pathname all the time after all we put related files in the same directory so theyUd be close together and specifying the long pathname hides that. One solution is to add the concept of a current directory to the system and allow paths to be specified relative to it. "aths that begin with the path separator are absolute pathnames; paths that do not are relative and have the current directory prepended. To facilitate relative naming many filesystems have special names that refer to the current directory and the parent directory. These are often and respectively.< an e.ample relative pathname is./test/halt which means the file test in the directory test which is a subdirectory of the current directory. Lin0" 1irectories impose a naming structure on files and in some systems offer the opportunity to give a file multiple names. #f the information about a file (its attributes and OS data) are not stored directly in a directory entry a file may be pointed to by entries in several directories. ,or e.ample the file in the directory tree above can be named as /etc/ast/fn or /etc/:im/f9. Such multiple naming is called lin!ing. There are 9 ma:or forms of lin!s4 hard and soft. A hard lin! is a direct lin! from the directory entry to the internal file data. All hard lin!s are e%uivalent and in file systems that support them a file cannot be deleted without deleting all its hard lin!s. #n general hard lin!s are restricted to parts of the file system that share internal information. To preserve the tree structure of hierarchical file systems they generally canUt lin! to directories. Soft lin!s are a path translation (often also called a symbolic lin!). They are a pathname that points to the file (or directory) on which to operate. These paths can be absolute or relative. $ecause they are a visible pathname translation; they are often allowed to point to directories because programs that rely on the. directory provides the answer to a &6#7\ pu33le4 how do you delete a file named -fL rm -f fails as the filename is ta!en as a switch. The solution is to use a longer relative path4 rm ./-f . the hierarchical nature of the file system (li!e system utilities) can detect and ignore them. $ecause they are a translation soft lin!s can access any parts of the file system they can address but because they are not lin!ed closely with the internal structure of the file system they may not be updated when a file is deleted or moved. #tUs possible that a symbolic lin! can point to a file that no longer e.ists. This is called a dangling pointer problem by analogy with the same problem involving freed memory in a program.

/irector! Operation"
(i!e files directories have well defined operations4 Create: Allocate space for a new directory and create the special directories in it. /e(ete: 5emove a directory. 0ost OSes re%uire the directory to be empty. 168

Open: Analogous to file open. 5ead4 )et the information about one or more files. /rite4 'hange the information about one or more files. Metadata #anipu(ation: 'hange the permissions or some other field associated with this directory. Lin0: Add a lin! to an e.isting file. )n(in0: 5emove a lin! to an e.isting file (if this is the last lin! and the file is closed this usually implies removal of the file). Rena#e: 5eally covered by write as is file renaming Other /irector! S!"te#" Although hierarchical systems are by far the most common there are some other interesting ways to thin! about file naming4 • Attribute-based naming. 6ot all data can be organi3ed nicely into a hierarchy. ,or e.ample is the right hierarchy /usr/faber/pubs/papers/5ST or /usr/faber/papers/published/ 5STL Attribute based naming says that filenames should be the set of attributes that a file satisfies. ,or e.ample the filename above might be (userWfaber typeWpublished-paper topic W 5ST pac!ets). Temporal filesystems. The fol!s at (ucent have added a time a.is to their "lan ? filesystem. 5ather than overwriting files for each update they save all the versions (well actually they save the state once a day) and in addition to giving a pathname you can also give a temporal coordinate. 8ou can change directory into the source code from last wee!. #tUs a strange but powerful idea. &ser specific. The fol!s from (ucent also allow dynamic binding of their file systems. A user specifies a set of directories to bind to a name and thereafter the user has their own version of that directory. The idea of an e.ecution path is turned into a customi3ed /bin directory.





Na#in S!"te#"
The mapping from pathname to filename is :ust one e.ample of a naming system. A naming system maps a string to a resource (or maybe :ust to another string). $eing able to name a resource is the first step in being able to manipulate it. Some other interesting naming systems are4 • The 1omain 6ame System4 hierarchical distributed naming of computers on the #nternet. 2ach *1irectory+ or domain is resolved by a different machine in the #nternet. The names are parsed bac!ward Y edu is resolved before use before aludra in aludra.usc.edu. (The path Separator is a *.+.) 7.@AA is an alternative attribute-based naming system for hosts and users. The names loo! li!e "6Wfaber 'OW&S #6W&S' 12"TW'S #6STW#S#. "rinter names. Strings bind to printers in a flat namespace.

• •

169



&5(s4 a combination of 16S and hierarchical filenames. The name compactly describes the communication protocol to use the name of the machine to contact and the file in to as! about (or other service dependent identifier).

0ost successful systems have a significant naming component. The decision of what elements in a system to name and how to name them is significant.

Securit! and the Fi(e S!"te#
Fi(e S!"te# Securit! The problem addressed by the security system is how are information and resources protected from people. #ssues include the contents of data files which are a privacy issues and the use of resources which is an accounting issue. Security must pervade the system or the system is insecure but the file system is a particularly good place to discuss security because its protection mechanisms are visible and the things it protects are very concrete (for a computer system). /eUre tal!ing about some interesting stuff when we tal! about security. ,or certain people who li!e pu33les finding loopholes in security systems and understanding them to the point of brea!ing them is a challenge. # understand the lure of this. 5emember however that everyone using these machines is a student li!e yourself who deserves the same respect that you do. $rea!ing into another personUs files is li!e brea!ing into their home and should not be ta!en lightly either by those brea!ing in or those who catch them. &ninvited intrusions should be dealt with harshly (for e.ample itUs a felony to brea! into a machine that stores medical records). #f you really want to play around with &6#7\ security get yourself a linu. bo. and play to your heartUs content; donUt brea! into someoneUs account here and start deleting files. Po(icie" and Mechani"#" "olicies are real world statements about the protection that the system provides. These are all statements of (significantly different) policies4 &sers should not be able to read each otherUs mail 6o student should be able to see answer !eys before they are made public. All users should have access to all data. The various systems in a computer system that control access to resources are the mechanisms that are used to implement a policy. A good secuirty system is one with clearly stated policy ob:ectives that have been effectively translated into mechanisms. The fact that data security does not stop with computer security cannot be understated. #f your computer is perfectly secure and an employee photocopies printouts of your new chip design donUt blame the computer security system.

/e"i n Princip(e"
Although every security system is different some overriding principles ma!e sense. =ere is a list generated by Salt3er and Schroeder from their e.perience on 0&(T#'S that remain valid today (these are fun to apply to caper movies - ne.t time you watch 0ission #mpossible or Snea!ers or /ar )ames try to spot the security flaws that let the intruders wor! their magic)4 "ublic 1esign Surprisingly public designs tend to be more secure than private ones. The reason is that the security community as a while reviews them and reports flaws that can be fi.ed. 170

2ven if you ta!e pains to !eep the source code of your system secret you should assume that attac!ers have access to your code. The bad guys will share !nowledge; the good guys should too. 1efault access is no access. This holds for subsystems :ust li!e login screens. #t sounds li!e apple atitude but is a principle worth following at all levels. "eople who need a certain access will let you !now about it %uic!ly. Test for current authority :ust because the user had the right to perform an operation a millisecond ago doesnUt mean they can do it now. Test the authority every time so that revocation of that authority is meaningful. )ive each entity the least privilege re%uired for it to do its :ob. This may mean creating a bunch of fine-grained privilege levels. The more privilege an entity possesses the more costly a mista!e or misuse of that entity is. "rinter daemons that run as root can cause logins that run as root. $uild in security from the start. Adding security later almost never wor!s. There are too many holes to plug and as a practical matter security is nearly impossible to add to a fundamentally insecure system. #n order to ma!e such a design integrable it must be simple and capable of being applied uniformly. The system must be acceptable to the users. All security systems are a compromise between security and usability. The more features a system has the more li!ely opportunities there are for e.ploitation. ,urthermore if a security feature is too onerous to the users they will :ust invent ways to circumvent them. These circumventions are then available for the attac!ers. An unacceptable security system is automatically attac!ed from within.

A Sa#p(in o8 Protection Mechani"#"
The idea of protection domains originated with 0ultics and is a !ey one for understanding computer security. #magine a matri. of all protection domains on one a.is and all system resources (files) on another. The contents of each cell in the matri. is the operations permitted by a process (or thread) in that domain on that process. 1omain ,ile< ,ile 9 1omain < 1omain 9 < 5/ 5/7 - 2nter 95--6otice that once domains are defined the ability to change domains becomes another part of the domain system. "rocesses in given domains are allowed to enter other domains. A processUs initial domain is a function of the user who starts the process and the process itself. /hile the pure domain model ma!es protection easy to understand it is almost never implemented. =olding the domains as a matri. doesnUt scale. Some 1omains and 5ings 171

&6#7 divides processes into 9 parts a user part and a !ernel part. /hen running as a user the process has limited abilities and to access hardware it has to trap into the !ernel. The !ernel can access all OS and hardware and decides what it will do on a userUs behalf based on credentials stored in the "'$. This is a simplification of the 0&(T#'S system of protection rings. 5ather than 9 levels 0&(T#'S had a BD ring system where each ring was more privileged than the ones surrounding it and chec!ed similar credentials before using its increased powers. Acce"" Contro( Li"t" Another representation of the domain concept is Access 'ontrol (ists (A'(s). These are lists attached to each resource (file) that describe the valid operations on them. )enerally the A'( languages are rich enough to describe users and groups of users economically. This economy comes from wildcarding and e.clusion operators. /ildcarding provides a way to describe all users meeting a given criterion; e.clusion operators allow e.clusion of a set of users. 'onceptually though each file contains a list of the users that can operate on the file. The &6#7 file protection system is similar but simplified. There are ? (really <9) bits associated with each file that determine the read write and e.ecution permissions of the owner members of the owning group and everyone else. 9 of the other E bits allow limited domain switching. They are the setuid bits that allow processes running the program to change user or group id to be that of the owner of the file. /hen the owner of a file is root this can convey considerable new power on the process. A'(s are useful and support revocation of rights. That is when a user is reading the file and the owner wants to stop that the owner can remove that right. $ecause the system chec!s current authority (see above) the read will be stopped. Capa:i(itie" Another way to encode domain rights is to encode a processes rights in its pointer to the ob:ect. ,or e.ample file rights would be in the file descriptor; memory rights in a memory pointer. Such pointers with protection information encoded in them are called capabilities. 'apabilities are !ept in special lists (called '-lists) that must be protected from processes direct manipulation. One way is in hardware - the memory actually contains bits that the '"& cannot touch in user mode that determine that a memory location holds a capability. Another is to ma!e the '-list part of the "'$ and only manipulable by the OS. A third is to have the OS encrypt the capabilities with a !ey un!nown to the user. 'apabilities have operations defined on them li!e copying ma!ing copies with reduced or amplified rights. /hen the process presents the capability to the OS the OS need not verify anything about the user only whether the capability is valid. That property ma!es it hard to revo!e a capability although there are a couple ways4 embedded validity chec!s and indirect access. Authentication and Securit! 'entral to the idea of protection systems is the idea of an authentication system. An authentication system proves the identities of elements with which a computer system interacts. This can include users and other systems. 172

#n distributed systems authentication should be 9-way4 The user should authenticate to the machine and the machine to the user. )enerally authentication is accomplished by means of the e.change of a shared secret. The most common shared secret is a password. Pa""4ord" A password is a string of characters that the user and computer system agree will establish the userUs identity to the system. The analogy is to physical passwords where people who wanted access to a military facility had to recite such an unusual phrase to establish their identity to those inside the fort. 'omputer passwords are often the wea!est part of a computer security system especially if the passwords can be guessed off-line - that is without alerting the system under attac! that it is under attac!. "asswords can be stolen (physically or electronically) or guessed. There are several good rules for choosing a computer password4 • 'hoose a long one. 0ost systems allow eight or ten letters - use Uem all. There are only <DA BAH E-letter (cap and lower case) passwords; there are more than @A trillion H-letter combinations. )uessing < in @A trillion is a literally half a billion times harder than < in <DA AAA. • 1onUt use a common phrase or name. A seminal wor! in computer security ran a crac!ing program on a couple hundred donated password files that tested common 2nglish words and the top <AA (or so) female names and had an ungodly (better than @A`) hit rate. =opefully education has gotten better. 6ote that *common phrase+ means anything available in the system dictionary at least. #n my opinion youUre better off not using any 2nglish and non2nglish words fare little better. 6o science fiction or fantasy words either. #nclude some non-letters e.g. TJekd. 1onUt write it down. 8ouUve changed a difficult pu33le into a physical search. 1onUt get too attached to it; you should change them relatively fre%uently - every si. months or so is a good idea.

• • •

Pa""4ord Stora e #tUs possible to store passwords in the open without immediately giving away the contents of the password. The system uses a <-way function. A <-way function is an interesting function that is relatively easy to compute but difficult to invert (essentially the only way to invert it is to compute all the forward transforms loo!ing for one that matches the reverse). Systems li!e &6#7\ donUt store the password but the result of a <-way function on the password. To chec! a userUs password the system ta!es the password as input computes the <-way function on it and compares it with the result in the password file. #f they match the password was (with high probability) correct. 6ote that even !nowing the algorithm and the encrypted password itUs still impossible to easily invert the function.

173

Although itUs theoretically reasonable to leave a hashed password file in the open it is rarely done anymore. There are a couple reasons4 These are also called hash functions. • #n practice bad passwords are not uncommon enough so rather than having to try all the passwords (or half the passwords on average) trying a large dictionary of common passwords is often enough to brea! into an account on the system. "assword file can be attac!ed off-line with the system under attac! completely unaware that it is under attac!. $y forcing the attac!er to actually try passwords on the system that theyUre invading the system can detect an attac!. Other Shared Secrets Some other forms of shared secrets include4 Shared 5eal Secrets - the user gives the system some information that *only the user !nows+ and the system %ui33es the user on it instead of a password. )ood in that the user rarely has to write such information down. $ad in that there isnUt much information that canUt be found by a determined investigator. 'ode boo!s - a fre%uent system is to as! the user for a word from a code boo!. This was in vogue for a while with anti-piracy systems; to gain access to a program the program would as! the user for the nth word on the m-th page of the manual. #t practice it means that the pirate photocopied the manual. One time passwords - the computer generates a table of passwords for the user each of which is to be used once. /hen the user tries to log in the computer as!s him/her for the ne.t password in the se%uence. The advantage is that if an attac!er manages to steal the password it cannot be reused. The disadvantage is that an attac!er can steal the list (and a user is unli!ely to memori3e a set of single use passwords). 'hallenge/5esponse - The system and user agree on some (one-way) function or transformation. At login time the computer presents the user with a value (called the challenge) and the user responds with the transform of the value. ,or e.ample if the function were the s%uare root a challenge of ? would be correctly answered with a response of E. #n practice the functions are more comple. and usually encoded in hardware. The hardware is often password protected so that theft of the hardware only means that the user cannot log in not that the intruder can.



• • •

• • •



Ph!"ica( I/ Another shared secret can be physical attributes of the human who wants to access the system. Several body measurements identify a user with significant precision4 finger lengths retina fingerprints etc. 'ontrolling access based on physical features has problems if the features are damaged (cutting oneUs fingertip should confuse a fingerprint scanner). #t also raises the grisly possibility of theft of those features. One way to beat a thumbprint scanner is to physically ac%uire someoneUs thumb. 174

A Sa#p(in o8 Attac0" Some common attac!s on computer systems4 Tro:an =orse - This is a benign program that steals information as part of its function. An 2.ample is a script that mimics the login prompt ta!es a userUs password saves it for the owner of the script and logs the user in. The legitimate user has access to the account but so does the owner of the script. "assword )uessing - we tal!ed about this with passwords. # pause to mention the infamous T2627 security hole that Tannenbaum discusses. T2627 allowed user functions to be called on each page fault. Some clever user reali3ed that this allowed password guessing by the letter instead of one letter at a time. The password being guessed had one. (etter on one page and the rest on another which was forced out of memory. "asswords were chec!ed letter-by-letter e%uentially. #f the first letter was correct there would be a page fault when the system faulted he second into memory to chec! it. $y repeating the process the whole password could be guessed se%uentially. This is an interesting e.ample of how multiple OS feature combine to affect security. Social 2ngineering - This is by far the most difficult to control. An attac!er simply lies to a human being and gets the information that they want. The only real cure for this is to educate anyone who has security information (that is everyone) about security. $uffer overruns - forcing a program to overrun a variable on the stac! and insert code in it that the attac!er wants run. $ac!doors - sometimes developers leave privileged debugging hoo!s in place in production systems. One of the well !nown offenders here is sendmail. Other production systems used to ship with well !nown user names with well-!nown passwords for remote maintenance. @iru"e" and 6or#" Giruses are programs contained in other programs often for malicious purposes. (They neednUt be though - one can imagine benign programs propagated the same way - virus chec!ers for e.ample). /orms are self replicating independent programs. The distinction is in the method of transmission4 a virus needs a host program to be run to propagate it; a worm has no such host it propagates itself. $oth have made national news for in their malevolent forms but both could be used for benign purposes. Co;ert Channe(" A covert channel is an unintentional communication channel in the system. ,or e.ample if 9 processes banned from communicating directly can use the following scheme4 one process repeatedly performs a computation !nown to ta!e a fi.ed time. The other process alternately loads and unloads them machine with computationally intensive child processes depending on the bit it wants to send. (oading the machine corresponds to a < and unloading the machine a A. The listening process !nows that if itUs computation ta!es longer than usual it should record a < and if itUs shorter record a A. The two can wor! out the timing and loading (statistically if necessary) to communicate. 'overt channels are necessarily low bandwidth and stopping them is difficult. (#n the e.ample above the system would have to guarantee system load to be fi.ed which would mean slowing the system when it was unloaded.) 0ost systems donUt stop covert channels. Systems that hold serious enough data.

175

C5APTER 11 FILE SYSTEM IMPLEMENTATION
,irst we loo! at files from the point of view of a person or program using the file system and then we consider how this user interface is implemented.

The )"er Inter8ace to Fi(e"
Must as the process abstraction beautifies the hardware by ma!ing a single '"& (or a small number of '"&s) appear to be many '"&s one per *user + the file system beautifies the hardware dis! ma!ing it appear to be a large number of dis!-li!e ob:ects called files. (i!e a dis! a file is capable of storing a large amount of data cheaply reliably and persistently. The fact that there are lots of files is one form of beautification4 2ach file is individually protected so each user can have his own files without the e.pense of re%uiring each user to buy his own dis!. 2ach user can have lots of files which ma!es it easier to organi3e persistent data. The filesystem also ma!es each individual file more beautiful than a real dis!. At the very least it erases bloc! boundaries so a file can be any length (not :ust a multiple of the bloc! si3e) and programs can read and write arbitrary regions of the file without worrying about whether they cross bloc! boundaries. Some systems (not &ni.) also provide assistance in organi3ing the contents of a file. Systems use the same sort of device (a dis! drive) to support both virtual memory and files. The %uestion arises why these have to be distinct facilities with vastly different user interfaces. The answer is that they don;t. #n 0ultics there was no difference whatsoever. 2verything in 0ultics was a segment. The address space of each running process consisted of a set of segments (each with its own segment number) and the *file system+ was simply a set of named segments. To access a segment from the file system a process would pass its name to a system call that assigned a segment number to it. ,rom then on the process could read and write the segment simply by e.ecuting ordinary loads and stores. ,or e.ample if the segment was an array of integers the program could access the number with a notation li!e a rather than having to see! to the appropriate offset and then e.ecute a read system call. #f the bloc! of the file containing this value wasn;t in memory the array access would cause a page fault which was serviced as e.plained in the previous chapter.

176

This user-interface idea sometimes called *single-level store + is a great idea. So why is it not common in current operating systemsL #n other words why are virtual memory and files presented as very different !inds of ob:ectsL There are possible e.planations one might propose4 The addre"" "pace o8 a proce"" i" "#a(( co#pared to the "i>e o8 a 8i(e "!"te#C There is no reason why this has to be so. #n 0ultics a process could have up to 9@BI segments but each segment was limited to BDI words. 0ultics allowed for lots of segments because every *file+ in the file system was a segment. The upper bound of BDI words per segment was considered large by the standards of the time; the hardware actually allowed segments of up to 9@BI words (over one megabyte). 0ost new processors introduced in the last few years allow BD-bit virtual addresses. #n a few years such processors will dominate. So there is no reason why the virtual address space of a process cannot be large enough to include the entire file system.

The ;irtua( #e#or! o8 a proce"" i" tran"ient--it oe" a4a! 4hen the proce"" ter#inate"--4hi(e 8i(e" #u"t :e per"i"tentC 0ultics showed that this doesn;t have to be true. A segment can be designated as *permanent + meaning that it should be preserved after the process that created it terminates. "ermanent segments to raise a need for one *file-system-li!e+ facility the ability to give names to segments so that new processes can find them. Fi(e" are "hared :! #u(tip(e proce""e"< 4hi(e the ;irtua( addre"" "pace o8 a proce"" i" a""ociated 4ith on(! that proce""C 0ost modern operating systems (including most variants of &ni.) provide some way for processes to share portions of their address spaces anyhow so this is a particularly wea! argument for a distinction between files and segments. The real reason single-level store is not ubi%uitous is probably a concern for efficiency. The usual file-system interface encourages a particular style of access4 Open a file go through it se%uentially copying big chun!s of it to or from main memory and then close it. /hile it is possible to access a file li!e an array of bytes :umping around and accessing the data in tiny pieces it is aw!ward. Operating system designers have found ways to implement files that ma!e the common *file li!e+ style of access very efficient. /hile there appears to be no reason in principle why memorymapped files cannot be made to give similar performance when they are accessed in this way in practice the added functionality of mapped files always seems to pay a price in performance. $esides if it is easy to :ump around in a file applications programmers will ta!e advantage of it overall performance will suffer and the file system will be blamed. Na#in 2very file system provides some way to give a name to each file. /e will consider only names for individual files here and tal! about directories later. The name of a file is (at least sometimes) meant to used by human beings so it should be easy for humans to use. 1ifferent operating systems put different restrictions on names4 Si>e

177

Some systems put severe restrictions on the length of names. ,or e.ample 1OS restricts names to << characters while early versions of &ni. (and some still in use today) restrict names to <D characters. The 0acintosh operating system /indows ?@ and most modern version of &ni. allow names to be essentially arbitrarily long. # say *essentially+ since names are meant to be used by humans so they don;t really to to be all that long. A name that is <AA characters long is :ust as difficult to use as one that it forced to be under << characters long (but for different reasons). 0ost modern versions of &ni. for e.ample restrict names to a limit of 9@@ characters. Ca"e Are upper and lower case letters considered differentL The &ni. tradition is to consider the names ,oo and fop to be completely different and unrelated names. #n 1OS and its descendants however they are considered the same. Some systems translate names to one case (usually upper case) for storage. Others retain the original case but consider it simply a matter of decoration. ,or e.ample if you create a file named *,oo + you could open it as *foo+ or *,OO + but if you list the directory you would still see the file listed as *,oo+.

Character Set 1ifferent systems put different restrictions on what characters can appear in file names. The &ni. directory structure supports names containing any character other than 6&( (the byte consisting of all 3ero bits) but many utility programs (such as the shell) would have troubles with names that have spaces control characters or certain punctuation characters (particularly a/U). 0acOS allows all of these (e.g. it is not uncommon to see a file name with the 'opyright symbol l in it). /ith the world-wide spread of computer technology it is becoming increasingly important to support languages other than 2nglish and in fact alphabets other than (atin. There is a move to support character strings (and in particular file names) in the &nicode character set which devotes <B bits to each character rather than H and can represent the alphabets of all ma:or modern languages from Arabic to 1evanagari to Telugu to Ihmer. For#at #t is common to divide a file name into a base name and an e.tension that indicates the type of the file. 1OS re%uires that each name be compose of a bast name of eight or less characters and an e.tension of three or less characters. /hen the name is displayed it is represented as base.e.tension. &ni. internally ma!es no such distinction but it is a common convention to include e.actly one period in a file name (e.g. foo.c for a ' source file). Fi(e Structure &ni. hides the *chun!iness+ of trac!s sectors etc. and presents each file as a *smooth+ array of bytes with no internal structure. Application programs can if they wish use the bytes in the file to represent structures. ,or e.ample a wide-spread convention in &ni. is to use the newline character (the character with bit pattern AAAA<A<A) to brea! te.t files into lines. Some other systems provide a variety of other types of files. The most common are files that consist of an array of fi.ed or variable si3e records and files that form an inde. mapping !eys to values. #nde.ed files are usually implemented as $-trees. Fi(e T!pe" 178

0ost systems divide files into various *types.+ The concept of *type+ is a confusing one partially because the term *type+ can mean different things in different conte.ts. &ni. initially supported only four types of files4 directories two !inds of special files (discussed later) and *regular+ files. Must about any type of file is considered a *regular+ file by &ni.. /ithin this category however it is useful to distinguish te.t files from binary files; within binary files there are e.ecutable files (which contain machine-language code) and data files; te.t files might be source files in a particular programming language (e.g. ' or Mava) or they may be human-readable te.t in some mar!-up language such as html (hyperte.t mar!up language). 1ata files may be classified according to the program that created them or is able to interpret them e.g. a file may be a 0icrosoft /ord document or 2.cel spreadsheet or the output of Te7. The possibilities are endless. #n general (not :ust in &ni.) there are three ways of indicating the type of a file4 • The operating system may record the type of a file in meta-data stored separately from the file but associated with it. &ni. only provides enough meta-data to distinguish a regular file from a directory (or special file) but other systems support more types. The type of a file may be indicated by part of its contents such as a header made up of the first few bytes of the file. #n &ni. files that store e.ecutable programs start with a two byte magic number that identifies them as e.ecutable and selects one of a variety of e.ecutable formats. #n the original &ni. e.ecutable format called the a.out format the magic number is the octal number ADAC which happens to be the machine code for a branch instruction on the "1"-<< computer one of the first computers to implement &ni.. The operating system could run a file by loading it into memory and :umping to the beginning of it. The ADAC code interpreted as an instruction :umps to the word following the <B-byte header which is the beginning of the e.ecutable code in this format. The "1"-<< computer is e.tinct by now but it lives on through the ADAC codeV The type of a file may be indicated by its name. Sometimes this is :ust a convention and sometimes it;s enforced by the OS or by certain programs. ,or e.ample the &ni. Mava compiler refuses to believe that a file contains Mava source unless its name ends with .:ava.





Some systems enforce the types of files more vigorously than others. ,ile types may be enforced • • • • 6ot at all Only by convention $y certain programs (e.g. the Mava compiler) or $y the operating system itself.

&ni. tends to be very la. in enforcing types. Acce"" Mode" Systems support various access modes for operations on a file. SeLuentia(: 5ead or write the ne.t record or ne.t n bytes of the file. &sually se%uential access also allows a rewind operation. Rando#: 5ead or write the nth record or bytes i through :. &ni. provides an e%uivalent facility by adding a see! operation to the se%uential operations listed above. This pac!aging of operations allows random access but encourages se%uential access. 179

Inde?ed4 5ead or write the record with a given !ey. #n some cases the *!ey+ need not be uni%ue--there can be more than one record with the same !ey. #n this case programs use a combination of inde.ed and se%uential operations4 )et the first record with a given !ey then get other records with the same !ey by doing se%uential reads. Fi(e Attri:ute" This is the area where there is the most variation among file systems. Attributes can also be grouped by general category. Na#e O4ner"hip and Protection Owner owner;s *group + creator access-control list (information about who can to what to this file for e.ample perhaps the owner can read or modify it other members of his group can only read it and others have no access). Ti#e "ta#p" Time created time last modified time last accessed time the attributes were last changed etc. &ni. maintains the last three of these. Some systems record not only when the file was last modified but by whom. Si>e" 'urrent si3e si3e limit *high-water mar!+ space consumed (which may be larger than si3e because of internal fragmentation or smaller because of various compression techni%ues). T!pe In8or#ation As described above4 ,ile is AS'## is e.ecutable is a *system+ file is an 2.cel spread sheet etc. Mi"c Some systems have attributes describing how the file should be displayed when a directly is listed. ,or e.ample 0acOS records an icon to represent the file and the screen coordinates where it was last displayed. 1OS has a *hidden+ attribute meaning that the file is not normally shown. &ni. achieves a similar effect by convention4 The ls program that is usually used to list files does not show files with names that start with a period unless you e.plicit re%uest it to (with the -a option). &ni. records a fi.ed set of attributes in the meta-data associated with a file. #f you want to record some fact about the file that is not included among the supported attributes you have to use one of the tric!s listed above for recording type information4 encode it in the name of the file put it into the body of the file itself or store it in a file with a related name (e.g. *foo.attributes+). Other systems (notably 0acOS and /indows 6T) allow new attributes to be invented on the fly. #n 0acOS each file has a resource for! which is a list of (attribute-name attribute-value) pairs. The attribute name can be any four-character string and the attribute value can be anything at all. #ndeed some !inds of files put the entire *contents+ of the file in an attribute and leave the *body+ of the file (called the data for!) empty. Operation" "OS#7 a standard A"# (application programming interface) based on &ni. provides the following operations (among others) for manipulating files4 180

• • • • • • • • • • • • • •

fd W open(name operation) fd W creat(name mode) status W close(fd) bytemcount W read(fd buffer bytemcount) bytemcount W write(fd buffer bytemcount) offset W lsee!(fd offset whence) status W lin!(oldname newname) status W unlin!(name) status W stat(name buffer) status W fstat(fd buffer) status W utimes(name times) status W chown(name owner group) or fchown(fd owner group) status W chmod(name mode) or fchmod(fd mode) status W truncate(name si3e) or ftruncate(fd si3e)

Some types of arguments and results need e.planation. Statu": 0any functions return a *status+ which is either A for success or -< for errors (there is another mechanism to get more information about went wrong). Other functions also use -< as a return value to indicate an error. Na#e: A character-string name for a file. A *file descriptor+ which is a small non-negative integer used as a short temporary name for a file during the lifetime of a process. Bu88er: The memory address of the start of a buffer for supplying or receiving data. 6hence: One of three codes signifying from start from end or from current location. Mode: A bit-mas! specifying protection information. Operation: An integer code one of read write read and write and perhaps a few other possibilities such as append only. The open call finds a file and assigns a decriptor to it. #t also indicates how the file will be used by this process (read only read/write etc). The creat call is similar but creates a new (empty) file. The mode argument specifies protection attributes (such as *writable by owner but read-only by others+) for the new file. (0ost modern versions of &ni. have merged creat into open by adding an optional mode argument and allowing the operation argument to specify that the file is automatically created if it doesn;t already e.ist.) The close call simply announces that fd is no longer in use and can be reused for another open or creat. 181

The read and write operations transfer data between a file and memory. The starting location in memory is indicated by the buffer parameter; the starting location in the file (called the see! pointer is wherever the last read or write left off. The result is the number of bytes transferred. ,or write it is normally the same as the bytemcount parameter unless there is an error. ,or read it may be smaller if the see! pointer starts out near the end of the file. The lsee! operation ad:usts the see! pointer (it is also automatically updated by read and write). The specified offset is added to 3ero the current see! pointer or the current si3e of the file depending on the value of whence. The function lin! adds a new name (alias) to a file while unlin! removes a name. There is no function to delete a file; the system automatically deletes it when there are no remaining names for it. The stat function retrieves meta-data about the file and puts it into a buffer (in a fi.ed documented format) while the remaining functions can be used to update the meta-data4 utimes updates time stamps chown updates ownership chmod updates protection information and truncate changes the si3e (files can be ma!e bigger by write but only truncate can ma!e them smaller). 0ost come in two flavors4 one that ta!e a file name and one that ta!es a descriptor for an open file. To any &ni. system. The a9U means to loo! in section 9 of the manual where system calls are e.plained. Other systems have similar operations and perhaps a few more. ,or e.ample inde.ed or inde.ed se%uential files would re%uire a version of see! to specify a !ey rather than an offset. #t is also common to have a separate append operation for writing to the end of a file.

The )"er Inter8ace to /irectorie"
/e already tal!ed about file names. One important feature that a file name should have is that it be unambiguous4 There should be at most one file with any given name. The symmetrical condition that there be at most one name for any given file is not necessarily a good thing. Sometimes it is handy to be able to give multiple names to a file. /hen we consider implementation we will describe two different ways to implement multiple names for a file each with slightly different semantics. #f there are a lot of files in a system it may be difficult to avoid giving two files the same name particularly if there are multiple uses independently ma!ing up names. One techni%ue to assure uni%ueness is to prefi. each file name with the name (or user id) of the owner. #n some early operating systems that was the only assistance the system gave in preventing conflicts. A better idea is the hierarchical directory structure first introduced by 0ultics then populari3ed by &ni. and now found in virtually every operating system. 8ou probably already !now about hierarchical directories but # would li!e to describe them from an unusual point of view and then e.plain how this point of view is e%uivalent to the more familiar version. 2ach file is named by a se%uence of names. Although all modern operating systems use this techni%ue each uses a different character to separate the components of the se%uence when displaying it as a character string. 0ultics uses ^]; &ni. uses a/U 1OS and its descendants use ajU and 0acOS uses ;4;. Se%uences ma!e it easy to avoid naming conflicts. ,irst assign a se%uence to each user and only let him create files with names that start with that se%uence. ,or e.ample # might be assigned the se%uence (*usr+ *solomon+) written in &ni. as /usr/solomon. So far this is the same as :ust appending the user name to each file name. $ut it allows me to further classify my own files to prevent conflicts. /hen # start a new pro:ect # can create a new se%uence by appending the name of the pro:ect to the end of the se%uence assigned to me and then use this prefi. for all files in the pro:ect. ,or e.ample # might choose /usr/solomon/cs@EC for files associated with this course and 182

name them /usr/solomon/cs@EC/foo /usr/solomon/cs@EC/bar etc. As an e.tra aid the system allows me to specify a *default prefi.+ and a short-hand for writing names that start with that prefi.. #n &ni. # use the system call chdir to specify a prefi. and whenever # use a name that does not start with a/U the system automatically adds that prefi.. #t is customary to thin! of the directory system as a directed graph with names on the edges. 2ach path in the graph is associated with a se%uence of names the names on the edges that ma!e up the path. ,or that reason the se%uence of names is usually called a path name. One node is designated as the root node and the rule is enforced that there cannot be two edges with the same name coming out of one node. /ith this rule we can use path name to name nodes. Start at the root node and treat the path name as a se%uence of directions telling us which edge to follow at each step. #t may be impossible to follow the directions (because they tell us to use an edge that does not e.ist) but if is possible to follow the directions they will lead us unambiguously to one node. Thus path names can be used as unambiguous names for nodes. #n fact as we will see this is how the directory system is actually implemented. =owever # thin! it is useful to thin! of *path names+ simply as long names to avoid naming conflicts since it clear separates the interface from the implementation.

I#p(e#entin Fi(e S!"te#"
Fi(e" /e will assume that all the bloc!s of the dis! are given bloc! numbers starting at 3ero and running through consecutive integers up to some ma.imum. /e will further assume that bloc!s with numbers that are near each other are located physically near each other on the dis! (e.g. same cylinder) so that the arithmetic difference between the numbers of two bloc!s gives a good estimate how long it ta!es to get from one to the other. ,irst let;s consider how to represent an individual file. There are (at leastV) four possibilities4 Conti uou": The bloc!s of a file are the bloc! numbered n nP< nP9 ... m. we can represent any file with a pair of numbers4 the bloc! number of first bloc! and the length of the file (in bloc!s). The advantages of this approach are • • #t;s simple The bloc!s of the file are all physically near each other on the dis! and in order so that a se%uential scan through the file will be fast.

The problem with this organi3ation is that you can only grow a file if the bloc! following the last bloc! in the file happens to be free. Otherwise you would have to find a long enough run of free bloc!s to accommodate the new length of the file and copy it. As a practical matter operating systems that use this organi3ation re%uire the ma.imum si3e of the file to be declared when it is created and pre-allocate space for the whole file. 2ven then storage allocation has all the problems we considered when studying main-memory allocation including e.ternal fragmentation. Lin0ed Li"t: A file is represented by the bloc! number of its first bloc! and each bloc! contains the bloc! number of the ne.t bloc! of the file. This representation avoids the problems of the contiguous representation4 /e can grow a file by lin!ing any dis! bloc! onto the end of the list and there is no e.ternal fragmentation. =owever it introduces a new problem4 5andom access is effectively impossible. To find the <AAth bloc! of a file we have to read the first ?? bloc!s :ust to follow the list. /e also lose the advantage of very fast se%uential access to the file since its bloc!s may be scattered all over the dis!. =owever if we are careful when choosing bloc!s to add to a file we can retain pretty good se%uential access performance. 183

$oth the space overhead (the percentage of the space ta!en up by pointers) and the time overhead (the percentage of the time see!ing from one place to another) can be decreased by using larger bloc!s. The hardware designer fi.es the bloc! si3e (which is usually %uite small) but the software can get around this problem by using *virtual+ bloc!s sometimes called clusters. The OS simply treats each group of (say) four continguous phyical dis! sectors as one cluster. (arge clusters particularly if they can be variable si3e are sometimes called e.tents. 2.tents can be thought of as a compromise between lin!ed and contiguous allocation. /i"0 Inde?: The idea here is to !eep the lin!ed-list representation but ta!e the lin! fields out of the bloc!s and gather them together all in one place. This approach is used in the *,AT+ file system of 1OS OS/9 and older versions of /indows. At some fi.ed place on dis! allocate an array # with one element for each bloc! on the dis! and move the lin! field from bloc! n to #RmQ (see ,igure <<.<C on page EH9). The whole array of lin!s called a file access table (,AT) is now small enough that it can be read into main memory when the systems starts up. Accessing the <AAth bloc! of a file still re%uires wal!ing through ?? lin!s of a lin!ed list but now the entire list is in memory so time to traverse it is negligible (recall that a single dis! access ta!es as long as <A;s or even <AA;s of thousands of instructions). This representation has the added advantage of getting the *operating system+ stuff (the lin!s) out of the pages of *user data+. The pages of user data are now full-si3e dis! bloc!s and lots of algorithms wor! better with chun!s that are a power of two bytes long. Also it means that the OS can prevent users (who are notorious for screwing things up) from getting their grubby hands on the system data. The main problem with this approach is that the inde. array # can get %uite large with modern dis!s. ,or e.ample consider a 9 )$ dis! with 9I bloc!s. There are million bloc!s so a bloc! number must be at least 9A bits. 5ounded up to an even number of bytes that;s E bytes--D bytes if we round up to a word boundary--so the array # is three or four megabytes. /hile that;s not an e.cessive amount of memory given today;s 5A0 prices if we can get along with less there are better uses for the memory. Fi(e Inde?: Although a typical dis! may contain tens of thousands of files only a few of them are open at any one time and it is only necessary to !eep inde. information about open files in memory to get good performance. &nfortunately the whole-dis! inde. described in the previous paragraph mi.es inde. information about all files for the whole dis! together ma!ing it difficult to cache only information about open files. The inode structure introduced by &ni. groups together inde. information about each file individually. The basic idea is to represent each file as a tree of bloc!s with the data bloc!s as leaves. 2ach internal bloc! (called an indirect bloc! in &ni. :argon) is an array of bloc! numbers listing its children in order. #f a dis! bloc! is 9I bytes and a bloc! number is four bytes @<9 bloc! numbers fit in a bloc! so a one-level tree (a single root node pointing directly to the leaves) can accommodate files up to @<9 bloc!s or one megabyte in si3e. #f the root node is cached in memory the *address+ (bloc! number) of any bloc! of the file can be found without any dis! accesses. A two-level tree with @<E total indirect bloc!s can handle files @<9 times as large (up to one-half gigabyte). The only problem with this idea is that it wastes space for small files. Any file with more than one bloc! needs at least one indirect bloc! to store its bloc! numbers. A DI file would re%uire three 9I bloc!s wasting up to one third of its space. Since many files are %uite small this is serious problem. The &ni. solution is to use a different !ind of *bloc!+ for the root of the tree. An inde. node (or inode for short) contains almost all the meta-data about a file listed above4 ownership permissions time stamps etc. (but not the file name). #nodes are small enough that several of them can be pac!ed into one dis! bloc!. #n addition to the meta-data an inode contains the bloc! numbers of the first few bloc!s of the file. /hat if the file is too big to fit all its bloc! numbers 184

into the inodeL The earliest version of &ni. had a bit in the meta-data to indicate whether the file was *small+ or *big.+ ,or a big file the inode contained the bloc! numbers of indirect bloc!s rather than data bloc!s. 0ore recent versions of &ni. contain pointers to indirect bloc!s in addition to the pointers to the first few data bloc!s. The inode contains pointers to (i.e. bloc! numbers of) the first few bloc!s of the file a pointer to an indirect bloc! containing pointers to the ne.t several bloc!s of the file a pointer to a doubly indirect bloc! which is the root of a two-level tree whose leaves are the ne.t bloc!s of the file and a pointer to a triply indirect bloc!. A large file is thus a lop-sided tree. A real-life e.ample is given by the Solaris 9.@ version of &ni.. $loc! numbers are four bytes and the si3e of a bloc! is a parameter stored in the file system itself typically HI (H<?9 bytes) so 9ADH pointers fit in one bloc!. An inode has direct pointers to the first <9 bloc!s of the file as well as pointers to singly doubly and triply indirect bloc!s. A file of up to <9P9ADHP9ADHT9ADH W D <?B EBD bloc!s or ED ECB B<E HHH bytes (about E9 )$) can be represented without using triply indirect bloc!s and with the triply indirect bloc! the ma.imum file si3e is (<9P9ADHP9ADHT9ADHP9ADHT9ADHT9ADH)TH<?9 W CA DAE <9A C?< @@9 bytes (slightly more than 9DB bytes or about BD terabytes). Of course for such huge files the si3e of the file cannot be represented as a E9-bit integer. 0odern versions of &ni. store the file length as a BD-bit integer called a *long+ integer in Mava. An inode is <9H bytes long allowing room for the <@ bloc! pointers plus lots of metadata. BD inodes fit in one dis! bloc!. Since the inode for a file is !ept in memory while the file is open locating an arbitrary bloc! of any file re%uires reading at most three #/O operations not counting the operation to read or write the data bloc! itself. /irectorie" A directory is simply a table mapping character-string human-readable names to information about files. The early "' operating system '"/0 shows how simple a directory can be. 2ach entry contains the name of one file its owner si3e (in bloc!s) and the bloc! numbers of <B bloc!s of the file. To represent files with more than <B bloc!s '"/0 used multiple directory entries with the same name and different values in a field called the e.tent number. '"/0 had only one directory for the entire system. 1OS uses a similar directory entry format but stores only the first bloc! number of the file in the directory entry. The entire file is represented as a lin!ed list of bloc!s using the dis! inde. scheme described above. All but the earliest version of 1OS provide hierarchical directories using a scheme similar to the one used in &ni.. &ni. has an even simpler directory format. A directory entry contains only two fields4 a character-string name (up to <D characters) and a two-byte integer called an inumber which is interpreted as an inde. into an array of inodes in a fi.ed !nown location on dis!. All the remaining information about the file (si3e ownership time stamps permissions and an inde. to the bloc!s of the file) are stored in the inode rather than the directory entry. A directory is represented li!e any other file (there;s a bit in the inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a *regular+ file or another directory allowing arbitrary graphs of nodes. =owever &ni. carefully limits the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree is the file with inumber < (some versions of &ni. use other conventions for designating the root directory). The entries in each directory point to its children in the tree. ,or convenience each directory also two special entries4 an entry with name *..+ which points to the parent of the directory in the tree and an entry with name *.+ which points to the directory itself. #number A is not used so an entry is mar!ed *unused+ by setting its inumber field to A. The algorithm to convert from a path name to an inumber might be written in Mava as 185

int namei(int current, 5tring-. path) ( for (int i ! "; iCpath2length; i##) ( if (inode-current.2t7pe 0! HD<=@TK< ) thro) ne) =8ception($not a director7$); current ! nameToDnumber (inode-current., path-i.); if (current !! ") thro) ne) =8ception($no such file or director7$); ' return current; ' The procedure nameTo#number(#node node String name) (not shown) reads through the directory file represented by the inode node loo!s for an entry matching the given name and returns the inumber contained in that entry. The procedure namei wal!s the directory tree starting at a given inode and following a path described by a se%uence of strings. There is a procedure with this name in the &ni. !ernel. ,iles are always specified in &ni. system calls by a character-string path name. 8ou can learn the inumber of a file if you li!e but you can;t use the inumber when tal!ing to the &ni. !ernel. 2ach system call that has a path name as an argument uses namei to translate it to an inumber. #f the argument is an absolute path name (it starts with a/U) namei is called with current WW <. Otherwise current is the current wor!ing directory. Since all the information about a file e.cept its name is stored in the inode there can be more than one directory entry designating the same file. This allows multiple aliases (called lin!s) for a file. &ni. provides a system call lin! (old-name new-name) to create new names for e.isting files. The call lin! (>/a/b/c> >/d/e/f>) wor!s something li!e this4

if (namei(,, parse($:d:e:f$)) 0! ") thro) ne) =8ception($file alread7 e8ists$); int dir ! namei(,, parse($:d:e$)): if (dir!!" YY inode-dir.2t7pe 0! HD<=@TK< ) thro) ne) =8ception($not a director7$); int target ! namei(,, parse($:a:b:c$)); if (target!!") thro) ne) =8ception($no such director7$); if (inode-target.2t7pe !! HD<=@TK< ) thro) ne) =8ception($cannot link to a director7$); 186

addHirector7=ntr7(inode-dir., target, $f$); The procedure parse (not shown here) is assumed to brea! up a path name into its components. #f for e.ample /a/b/c resolves to inumber <9E the entry (<9E >f>) is added to directory file designated by >/d/e>. The result is that both >/a/b/c> and >/d/e/f> resolve to the same file (the one with inumber <9E). /e have seen that a file can have more than one name. /hat happens if it has no names (does not appear in any directory)L Since the only way to name a file in a system call is by a path name such a file would be useless. #t would consume resources (the inode and probably some data and indirect bloc!s) but there would be no way to read it write to it or even delete it. &ni. protects against this *garbage collection+ problem by using reference counts. 2ach inode contains a count of the number of directory entries that point to it. *&ser+ programs are not allowed to update directories directly. System calls that add or remove directory entries (creat lin! m!dir rmdir etc) update these reference counts appropriately. There is no system call to delete a file only the system call unlin!(name) which removes the directory entry corresponding to name. #f the reference count of an inode drops to 3ero the system automatically deletes the files and returns all of its bloc!s to the free list. /e saw before that the reference counting algorithm for garbage collection has a fatal flaw4 #f there are cycles reference counting will fail to collect some garbage. &ni. avoids this problem by ma!ing sure cycles cannot happen. The system calls are designed so that the set of directories will always be a single tree rooted at inode <4 m!dir creates a new empty (e.cept for the . and .. entries) as a leaf of the tree rmdir is only allowed to delete a directory that is empty (e.cept for the . and .. entries) and lin! is not allowed to lin! to a directory. $ecause lin!s to directories are not allowed the only place the file system is not a tree is at the leaves (regular files) and that cannot introduce cycles. Although this algorithm provides the ability to create aliases for files in a simple and secure manner it has several flaws4 • #t;s hard to figure own how to charge users for dis! space. Ownership is associated with the file not the directory entry (the owner;s id is stored in the inode). A file cannot be deleted without finding all the lin!s to it and deleting them. #f # create a file and you ma!e a lin! to it # will continue to be charged for it even if # try to remove it through my original name for it. /orse still your lin! may be in a directory # don;t have access to so # may be unable to delete the file even though #;m being charged for its space. #ndeed you could ma!e it much bigger after # have no access to it. There is no way to ma!e an alias for a directory. As we will see later lin!s cannot cross boundaries of physical dis!s. Since all aliases are e%ual there;s no one *true name+ for a file. 8ou can find out whether two path names designate the same file by comparing inumbers. There is a system call to get the meta-data about a file and the inumber is included in that information. $ut there is no way of going in the other direction4 to get a path name for a file given its inumber or to find a path name of an open file. 2ven if you remember the path name used to get to the file that is not a reliable *handle+ to the file (for e.ample to lin! two files together by storing the name of one in the other). One of the components of the path name could be removed thus invalidating the name even though the file still e.ists under a different name.

• • •

/hile it;s not possible to find the name (or any name) of an arbitrary file it is possible to figure out the name of a directory. 1irectories do have uni%ue names because the directories form a tree 187

and one of the properties of a tree is that there is a uni%ue path from the root to any node. The *..+ and *.+ entries in each directory ma!e this possible. =ere for e.ample is code to find the name of the current wor!ing directory. class Hirector7=ntr7 ( int inumber; 5tring name; ' 5tring ca)ed () ( FileDnput5tream tinder ! ne) FileDnput5tream ($2$); int thisDnumber ! nameToDnumber (tinder, $2$); get1ath ($2$, thisDnumber); ' 5tring get1ath(5tring currentName, int currentDnumber) ( 5tring parentName ! currentName # $:22$; FileDnput5ream parent ! ne) FileDnput5tream (parentName); int parentDnumber ! nameToDnumber(parent, $2$); 5tring fname ! inumberToName(parent, currentDnumber); if (parentDnumber !! ,) return $:$ # fname; else return get1ath (parentDnumber, parentName) # $:$ # fname; ' The procedure nameTo#number is similar to the procedure with the same name described above but ta!es an #nputStream as an argument rather than an inode. 0any versions of &ni. allow a program to open a directory for reading and read its contents :ust li!e any other file. #n such systems it would be easy to write nameTo#number as a user-level procedure if you !now the format of a directory. The procedure inumberTo6ame is similar but searches for an entry containing a particular inumber and returns the name field of the entry. S!#:o(ic Lin0" To get around the limitations with the original &ni. notion of lin!s more recent versions of &ni. introduced the notion of a symbolic lin! (to avoid confusion the original !ind of lin! described in the previous section is sometimes called a hard lin!). A symbolic lin! is a new type of file distinguished by a code in the inode from directories regular files etc. /hen the namei procedure that translates path names to inumbers encounters a symlin! it treats the contents of the file as a pathname and 188

uses it to continue the translation. #f the contents of the file is a relative path name (it does not start with a slash) it is interpreted relative to the directory containing the lin! itself not the current wor!ing directory of the process doing the loo!up. int namei (int current, 5tring-. path) ( for (int i ! "; iCpath2length; i##) ( if (inode-current.2t7pe 0! HD<=@TK< ) thro) ne) =8ception ($not a director7$); parent ! current; current ! nameToDnumber (inode-current., path-i.); if (current !! ") thro) ne) =8ception ($no such file or director7$); if (inode -current.2t7pe !! 5 I3DNS) ( 5tring link ! get@ontents (inode-current.); 5tring -. link1ath ! parse(link); if (link2char>t (") !! N:N) current ! namei (,, link1ath); else current ! namei (parent, link1ath); if (current !! ") thro) ne) =8ception ($no such file or director7$); ' ' return current; ' The only change from the previous version of this procedure is the addition of the while loop. Any time the procedure encounters a node of type S80(#6I it recursively calls itself to translate the contents of the file interpreted as a path name into an inumber. Although the implementation loo!s complicated it does :ust what you would e.pect in normal situations. ,or e.ample suppose there is an e.isting file named /a/b/c and an e.isting directory /d. Then the command ln -s /a/b /d/e ma!es the path name /d/e a synonym for /a/b and also ma!es /d/e/c a synonym for /a/b/c. ,rom the user;s point of view the picture loo!s li!e this4

189

#n implementation terms the picture loo!s li!e this

/here the he.agon denotes a node of type symlin!. =ere;s a more elaborate e.ample that illustrates symlin!s with relative path names. Suppose # have an e.isting directory /usr/solomon/cs@EC/s?A with various sub-directories and # am setting up pro:ect @ for this semester. # might do something li!e this4 cd /usr/solomon/cs@EC m!dir f?B cd f?B an -s ../s?A/pro:@ pro:@.old cat pro:@.old/foo.c cd /usr/solomon/cs@EC cat f?B/pro:@.old/foo.c cat s?A/pro:@/foo.c (ogically the situation loo!s li!e this4

190

And physically it loo!s li!e this4

All three of the cat commands refer to the same file. The added fle.ibility of symlin!s over hard lin!s comes at the e.pense of less security. Symlin!s are neither re%uired nor guaranteed to point to valid files. 8ou can remove a file out from under a symlin! and in fact you can create a symlin! to a non-e.istent file. Symlin!s can also have cycles. ,or e.ample this wor!s fine4

cd /usr/solomon m!dir bar ln -s /usr/solomon foo ls /usr/solomon/foo/foo/foo/foo/bar

=owever in some cases symlin!s can cause infinite loops or infinite recursion in the namei procedure. The real version in &ni. puts a limit on how many times it will iterate and returns an error code of *too many lin!s+ if the limit is e.ceeded. Symlin!s to directories can also cause the *change directory+ command cd to behave in strange ways. 0ost people e.pect that the two commands cd foo 191

cd .. to cancel each other out. $ut in the last e.ample the commands cd /usr/solomon cd foo cd .. /ould leave you in the directory /usr. Some shell programs treat cd specially and remember what alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo the current directory is /usr/solomon/foo/foo which is an alias for /usr/solomon but the command cd .. is treated as if you had typed cd /usr/solomon/foo. Mountin /hat if your computer has more than one dis!L #n many operating systems (including 1OS and its descendants) a pathname starts with a device name as in '4jusrjsolomon (by convention ' is the name of the default hard dis!). #f you leave the device prefi. off a path name the system supplies a default current device similar to the current directory. &ni. allows you to glue together the directory trees of multiple dis!s to create a single unified tree. There is a system call 0ount (device mountmpoint) /here device names a particular dis! drive and mountmpoint is the path name of an e.isting node in the current directory tree (normally an empty directory). The result is similar to a hard lin!4 The mount point becomes an alias for the root directory of the indicated dis!. =ere;s how it wor!s4 The !ernel maintains a table of e.isting mounts represented as (device< inumber device9) triples. 1uring namei whenever the current (device inumber) pair matches the first two fields in one of the entries the current device and inumber become device9 and < respectively. =ere;s the e.panded code4 int namei (int curi, int curdev, 5tring-. path) ( for (int i ! "; iCpath2length; i##) ( if (disk-curdev.2inode-curi.2t7pe 0! HD<=@TK< ) thro) ne) =8ception ($not a director7$); parent ! curi; curi ! nameToDnumber (disk-curdev.2inode-curi., path-i.); if (curi !! ") thro) ne) =8ception ($no such file or director7$); if (disk-curdev.2inode-curi.2t7pe !! 5 I3DNS) ( 5tring (disk-curdev.2inode-curi.); 192 link ! get@ontents

5tring -. link1ath ! parse(link); if (link2char>t(") !! N:N) current ! namei (,, curdev, link1ath); else current ! namei (parent, curdev, link1ath); if (current !! ") thro) ne) =8ception ($no such file or director7$); ' int ne)dev ! mount3ookup (curdev, curi); if (ne)dev 0! *,) ( curdev ! ne)dev; curi ! ,; ' ' return current; ' #n this code we assume that mount(oo!up searches the mount table for matching entry returning -< if no matching entry is found. There is a also a special case (not shown here) for *..+ so that the *..+ entry in the root directory of a mounted dis! behaves li!e a pointer to the parent directory of the mount point. The 6etwor! ,ile System (6,S) from Sun 0icrosystems e.tends this idea to allow you to mount a dis! from a remote computer. The device argument to the mount system call names the remote computer as well as the dis! drive and both pieces of information are put into the mount table. 6ow there are three pieces of information to define the *current directory+4 the inumber the device and the computer. #f the current computer is remote all operations (read write creat delete m!dir rmdir etc.) are sent as messages to the remote computer. #nformation about remote open files including a see! pointer and the identity of the remote machine is !ept locally. 2ach read or write operation is converted locally to one or more re%uests to read or write bloc!s of the remote file. 6,S caches bloc!s of remote files locally to improve performance. Specia( Fi(e" # said that the &ni. mount system call has the name of a dis! device as an argument. =ow do you name a deviceL The answer is that devices appear in the directory tree as special files. An inode whose type is *special+ (as opposed to *directory + *symlin! + or *regular+) represents some sort of #/O device. #t is customary to put special files in the directory /dev but since it is the inode that is mar!ed *special + they can be anywhere. #nstead of containing pointers to dis! bloc!s the inode of a special file contains information (in a machine-dependent format) about the device. The operating system 193

tries to ma!e the device loo! as much li!e a file as possible so that ordinary programs can open close read or write the device :ust li!e a file. Some devices loo! more li!e real file than others. A dis! device loo!s e.actly li!e a file. 5eads return whatever is on the dis! and writes can scribble anywhere on the dis!. ,or obvious security reasons the permissions for the raw dis! devices are highly restrictive. A tape drive loo!s sort of li!e a dis! but a read will return only the ne.t physical bloc! of data on the device even if more is re%uested. The special file /dev/tty represents the terminal. /rites to /dev/tty display characters on the screen. 5eads from /dev/tty return characters typed on the !eyboard. The see! operation on a device li!e /dev/tty updates the see! pointer but the see! pointer has no effect on reads or writes. 5eads of /dev/tty are also different from reads of a file in that they may return fewer bytes than re%uested4 6ormally a read will return characters only up through the ne.t end-of-line. #f the number of bytes re%uested is less than the length of the line the ne.t read will get the remaining bytes. A read call will bloc! the caller until at least one character can be returned. On machines with more than one terminal there are multiple terminal devices with names li!e /dev/ttyA /dev/tty< etc. Some devices such as a mouse are read-only. /rite operations on such devices have no effect. Other devices such as printers are write-only. Attempts to read from them give an end-of-file indication (a return value of 3ero). There is special file called /dev/null that does nothing at all4 reads return end-of-file and writes send their data to the garbage bin. (6ew 2"A rules re%uire that this data be recycled. #t is now used to generate federal regulations and other meaningless documents.) One particularly interesting device is /dev/mem which is an image of the memory space of the current process. #n a sense this device is the e.act opposite of memory-mapped files. #nstead of ma!ing a file loo! li!e part of virtual memory it ma!es virtual memory loo! li!e a device. This idea of ma!ing all sorts of things loo! li!e files can be very powerful. Some versions of &ni. ma!e networ! connections loo! li!e files. Some versions have a directory with one special file for each active process. 8ou can read these files to get information about the states of processes. #f you delete one of these files the corresponding process is !illed. Another idea is to have a directory with one special file for each print :ob waiting to be printed. Although this idea was pioneered by &ni. it is starting to show up more and more in other operating systems.

Node
/irect Inode Indirect Inode" The last E sector pointers are special. The first points to inode structures that contain only pointers to sectors; this is an indirect bloc!. The second to pointers to pointers to sectors (a double indirect node) and the third to pointers to pointers to sectors (triple indirect). This results in increasing access times for bloc!s later in the file. (arge files will have longer access times to the end of the file. #-nodes specifically optimi3e for short files. /irectorie" 1irectories are generally simply files with a special interpretation. Some directory structures contain the name of a file its attributes and a pointerE either into its ,AT list or to its i-node. This choice bears directly on the implementation of lin!ing. #f attributes are stored directly in the directory node (hard) lin!ing is difficult because changes to the file must be mirrored in all 194

directories. #f the directory entry simply points to a structure (li!e an i-node) that holds the attributes internally only that structure needs to be updated. Mu(tip(e /i"0" There are two approaches to having multiple dis!s on a system (where dis!s are really devices that e.port a file system interface). 2ither the dis!s can be e.plicit or implicit. An e.ample of e.plicit dis! naming is 0S-1OSUs A4jS8Sj,#(2.T7T. Other systems from #$0 G0/'0S to Amiga 1OS have done the same thing. 0a!ing dis!s e.plicit ma!es the boundaries between physical devices clear. &6#7\ clouds the issues by allowing one device to be grafted onto the name space established by another at a mount point. A mount point loo!s li!e a directory to the user but to the operating system mar!s the boundary between devices. As a result the file system appears to be one seamless name space but there are subsets of the space on different devices. E #tUs usually a sector number but thin!ing of it as a pointer is closer to its function. Fi(e S!"te# I#p(e#entation - Per8or#ance This lecture discusses the details of ma!ing the file system perform well. This is primarily done with two mechanisms - caching and file layout. Cachin ,ar and away the most effective strategy for improving file system performance is caching. 2very time a file is opened or created it re%uires accesses to all the directories from the root to the file itself and many files are referenced often. 'onsider e.ecutables for commonly e.ecuted utilities. Accesses to the physical dis! for these common cases can be avoided if the bloc!s that would have been read from the dis! are cached in memory rather than being read from dis! each time. Cached B(oc0" 'aching is accomplished by setting aside a portion of memory and using it to store bloc!s as they are brought into memory. $loc!s are read or written into the memory if possible and written out to dis! when convenient or when the bloc! must be removed from the cache to ma!e way for a new bloc!. The important issue with any caching system is the tradeoff between performance and consistency of the caching system. #n this case that means the difference between the versions of file bloc!s that are in the cache and on the dis!. 1uring operation this distinction isnUt terribly interesting processes always read from the file cache first and only from the dis! if the bloc! is not cached. The consistency problem arises when the computer catastrophically fails; for e.ample a power outage when all the file cache bloc!s havenUt been written out to dis!. /hen the OS restarts the file system may be in an inconsistent state. The issue is how writes are cached. There are several models of consistency that can be used each of which mar!s a point on the consistency/ performance curve. The (inu. 27T9,S file system ma!es no special concessions to consistency. #ts goal is a bla3ingly fast &6#7\ file system and itUs willing to ris! the dangers of file system corruption if operation is interrupted at an inopportune moment. The $S1 ,,S systems ta!e a more conservative approach but not the most conservative approach. They write metadata synchronously to dis! - that is any changes to the file system structures themselves are not cached. ,ile system structures include the free bloc! list and file allocations. Thus its possible to lose modifications to allocated bloc!s in a file but not to have a bloc! appear once on the free list and once on a fileUs allocation list. The slowest and most reliable choice is to write all data synchronously to dis!. This is the choice made by 0S-1OS and /indows (for floppies anyway). 195

The consistency issue underscores the difference between file cache management and virtual memory management algorithms. #n paging all pages are transient and therefore e%ually important. #n the file system some bloc!s contain data that is essential to the continued correct operation of the filesystem< and must be treated specially. 0odulo these differences paging algorithms are a good fit for managing the file cache. #n general a variant of (5& is usually used with the modifications appropriate for the level of consistency re%uired being made. #n fact some systems li!e ,ree$S1 and Solaris use only one cache for both virtual memory pages and cached file bloc!s. This allows the number of pages or file bloc!s to grow as needed. Fi(e S!"te# Chec0in 0ost systems have a mechanism for chec!ing the file system integrity even if they ta!e great pains to maintain that integrity. Occasional bad bloc!s or software bugs may corrupt data and itUs worthwhile to periodically chec! the file systems to reduce the impact of any damage. The chec0in proce"" i" :a"ica((!: • 'ompare the free list to the list of allocations for all the files. 2ach bloc! should appear e.actly once. $loc!s no appearing at all should be considered free. $loc!s appearing on both the free list and one file can (probably) be allocated to the file. $loc!s allocated to two or more files are a complicated case that probably re%uires human intervention. 'hec! all directories for pointers to the files. #f files e.ist that have no directory entries deleted them and reclaim their allocations. This can be caused by a system li!e &6#7 that allow open files to remain on the dis! while a process has them open.



The less consistency the operating system enforces the more fre%uently the file system should be chec!ed. /i"0 La!out Avoiding dis! see!s is the most fruitful optimi3ation to a file system after caching9. One source of see!s is see!ing from i-nodes allocated on one end of the dis! to data bloc!s on the other. Alternatively see!ing from directory entry to data bloc!. 'onsider the inode case. #n the original $er!eley &6#7 file system i-nodes were allocated at the beginning of the dis! and data bloc!s everywhere else. The $er!eley fast file system disperses in-nodes throughout the dis! and does its best to place data sectors near the i-node that owns the file to reduce see! times. • • • Old ,ile system ,ast ,ile system #nodes

Although that e.ample is &6#7-specific the concept covers many operating systems. /hen see!s are e.pensive try to place related data physically close together on the dis!. A related issue is reducing rotational latency. ,iles are usually read se%uentially but if file bloc!s are laid out contiguously on the dis! additional rotational delays can be incurred. The problem is that after the first bloc! is read and returned the OS issues a re%uest for the ne.t bloc! which has 196

partially rotated past the dis! head. The dis! now must wait one rotation time until the beginning of the sector appears again. $y interleaving file bloc!s the time to return the bloc! and get the ne.t re%uest can be spend spinning over an unneeded bloc!. <. #tUs even a little worse than this because not only are the bloc!s essential for the correct ,S operation sometimes the information in the bloc!s are interrelated. The easiest e.ample is that a bloc! being removed from the free list and added to a file depends on at least two bloc!s being consistently updated. 9. Assuming that see!s are e.pensive in the file system. ThatUs true of dis!s tapes '1s but not of memory. A #e#or! 8i(e "!"te# can a;oid the"e i""ue"C • • • Me#or! ,re%uently devices are not directly connected to the bus but are managed by controllers. This allows multiple devices to share a single bus slot. $us slots may be a scarce resource. 1evice controllers are also a fre%uent location for additional intelligence in the networ!. 1evices or controllers are fre%uently controlled by accessing device registers which pass parameters from '"& to device (or controller). Such device registers are either accessible through special #/O instructions or by memory mapping the device registers into the processorUs address space.< The Operating System is also responsible for coordinating the interrupts generated by the devices (and controllers). )enerally a priority ordering is used. Some devices can have their handlers interrupted by higher priority devices. 1epending on the services offered by the hardware this may result in interrupts being delayed or even lost. #f a system bloc!s rather than delaying interrupts the system must poll device state when a high-priority interrupt returns in case a low-priority interrupt has been lost. /irect Me#or! Acce"" */MA+ #n simple systems the '"& must move each data byte to or from the bus using a (OA1 or STO52 instruction as if the data were being moved to memory. This %uic!ly uses up much of the '"&Us computational power. #n order to allow systems to support high #/O utili3ation while the '"& is getting useful wor! done on the usersU half devices are allowed to directly access memory. This direct access of memory by devices (or controllers is called 1irect 0emory Access commonly abbreviated 10A). The '"& is still responsible for scheduling the memory accesses made by 10A devices but once the program has been established the '"& has no further involvement until the transfer is complete. Typically 10A devices will issue interrupts on #/O completion. 197 The #/O Subsystem The #/O subsystem is concerned with the wor! of providing the virtual interface to the hardware. This is the code that has to deal with the idisyncracies of devices converting from the OSUs logical view to the messy realities of the hardware.

$ecause this memory is not being manipulated by the '"& and therefore addresses may not pass through an 00& 10A devices often confuse or are confused by virtual memory. #t is important to thatUs the address space of the processor itself not a user process. &ser processes themselves are generally restricted from accessing devices directly although such accesses may be allowed to improve performance at the cost of security and stability. )uarantee that memory intended for use by a 10A device is not manipulated by the paging system while the #/O is being performed. Such pages are usually fro3en (or pinned) to avoid changes. #n some sense 10A is simply an intermediate step to general purpose programmability on devices and device controllers. Several such smart controllers e.ist with features ranging from bit swapping to digital signal processing chec!sum calculations encryption and compression and general purpose processors. 1ealing with that programmibility re%uires synchroni3ation and care. 0oreover in order for code to be portable writing an interface to such smart peripherals is often a delicate balancing act between ma!ing features available and ma!ing the device unrecogni3able. I=O So8t4are The I=O "o8t4are o8 the OS ha" "e;era( oa(": • 1evice #ndependence4 All peripherals performing the same function should have the same interface. All dis!s should present logical bloc!s. All networ! adapters should accept pac!ets. The protection of devices should be managed consistently. ,or e.ample devices should all be accessible by capability or all by the file system. #n practice this is mitigated by the need to e.pose some features of the hardware. &niform 6aming4 The OS needs to have a way to describe the various devices in the system so that it can administer them. Again the naming system should be as fle.ible as possible. Systems also have to deal with devices :oining or leaving the name space ("'0'#A cards). 1evice Sharing4 0ost devices are shared at some granularity by processes on a general purpose computer. #tUs the #/O systemUs :ob to ma!e sure that sharing is fair (for some fairness metric) and efficient. 2rror =andling4 1evices can often deal with errors without user input - retrying a dis! read or something similar. ,atal errors need to be communicated to the user in an understandable manner as well. ,urthermore although hiding errors can be good at some level at other levels they should be seen. &sers must be able to tell that their dis!s are slowly failing. Synchrony and Asynchrony4 The #/O system needs to deal with the fact that e.ternal devices are not synchroni3ed with the internal cloc! of the '"&. 2vents on dis! drives occur without any regard for the state of the '"& and the '"& must deal with that. The #/O system code is what turns the asynchronous interrupts into system events that can be handled by the '"&.









So8t4are Le;e("
Interrupt 5and(er" The #nterrupt Service 5outines (#S5s) are short routines designed to turn the asynchronous events from devices (and controllers) into synchronous ones that the operating system can deal with 198

in time. /hile an #S5 is e.ecuting some set of interrupts is usually bloc!ed which is a dangerous state of affairs that should be avoided as much as possible. #S5s generally encode the information about the interrupt into some %ueue that the OS chec!s regularly - e.g. on a conte.t switch. /e;ice /ri;er": 1evice drivers are primarily responsible for issuing the low-level commands to the hardware that gets the hardware to do what the OS wants. As a result much of them is hardware dependent. 'onceptually perhaps the most important facet of device drivers is the conversion from logical to physical addressing. The OS may be coded in terms of logical bloc! numbers for a file but it is the device driver that converts such logical addresses to real physical addresses and encodes them in a form that the hardware can understand. 1evice drivers may also be responsible for programming smart controllers multiple.ing re%uests and demultiple.ing responses and measuring and reporting device performance. /e;ice Independent OS Code: This is the part of the OS weUve really been tal!ing the most about. This part of the OS provides consistent device naming and interfaces to the users. #t enforces protection and does logical level caching and buffering. #n addition to providing a uniform interface the uniform interface is sometimes pierced at this level to e.pose specific hardware features -- '1 audio capabilities for instance. The device independent code also provides a consistent error mode to users letting them !now what general errors occurred when the device driver couldnUt recover. )"er Code: 2ven the OS code is relatively rough and ready. &ser libraries provide simpler interfaces to #/O systems. )ood e.amples are the standard #/O library that provides a simplified interface to the file system. "rint and open are easier to use than write and open. Specifically such systems handle data formatting and buffering. $eyond that there are user level programs that specifically provide #/O services (daemons). Such programs spool data or directly provide the services users re%uire. /e;ice /ri;er Speci8ic": /e now consider some of the details of device drivers. 2ach device is a different with different purposes different implementations and different worldviews. /e will consider a representative sample of several !inds of device drivers and the specifics of how they wor!. /i"0 /ri;e": 1is! drivers control one or more physical magnetic dis!s. The most common use of dis!s is for file systems but they are also used for swap space raw bac!ups databases and assorted other uses. The device drives is responsible for logical \ physical translations multiple.ing and demultiple.ing data transfers and for error handling. B(oc0 Na#in : #n the simplest case logical \ physical mapping is :ust a matter of placing an ordering on the physical dis! sectors that matches the logical dis! sectors which is pretty simple. 1ifferences between logical and physical bloc! si3es can confuse this. The two may differ because multiple dis!s managed by the same OS may have different physical bloc! si3es. Also because the file system overhead may depend strongly on the si3e of the OSUs logical bloc!s sometimes a logical bloc! si3e much in e.cess of the dis!s sectors is chosen to 199

!eep the si3e of the OS tables small. One e.ample of this is large dis!s using a ,AT file system. The si3e of the ,AT is directly dependent on the number of sectors (and is bounded by the si3e of the entry in a ,AT cell). To address a large dis! partition may re%uire using large bloc! si3es. 2rror handling may also interfere with a straightforward logical \ physical mapping. Some smart dis! systems leave a few bloc!s unassigned on the dis! when it is formatted for use when other bloc!s go bad. /hen a bad bloc! is detected the data is moved from the bad bloc! to one of the setaside bloc!s and the logical \ physical mapping changed so that the logical bloc! maps to the setaside bloc! rather than the bad bloc!. ,rom the OS point of view this transparently repairs bad bloc!s. Originally such bloc!-shuffling shenanigans were all done in the software but as dis! controllers (and drives) get smarter these remapping are often done in the hardware. This can ma!e life much tric!ier.

Mu(tip(e?in and Ar# Schedu(in
The first issue in multiple.ing multiple re%uests is doing so efficiently. 2fficiently in this case means with returning data to the users as %uic!ly as possible. 0ost of the latency in serving a dis! re%uest is in see!s time i.e. moving the dis! arm over the proper trac!. Scheduling the re%uests to minimi3e see! time is often handled by the device driver. Scheduling user dis! re%uests is called arm scheduling because the problem is essentially scheduling accesses by the dis! arm. Some familiar algorithms appear as well as some new ones. /e will illustrate the scheduling algorithms by using them to order re%uests for the following trac!s (in the oder they were %ueued)4 << < EB <B ED ? <9 • ,#,O4 The re%uests are served in the order they appear4 << < EB <B ED ? <9. This is easy to implement but almost never used. /ithout optimi3ing at all see! times can vary widely based on the applications ma!ing re%uests. Shortest See! ,irst (SS,)4 This is analogous to shortest :ob first the re%uest that re%uires moving the arm the least distance is served ne.t4 << <9 ? <B < ED EB. The Rproblem with this algorithm is that it encourages access patterns that !eep the dis! head in one place. (one accesses to distant trac!s suffer very long access times or in the worst case never get served at all. A compromise between optimi3ation and fairness is needed. The 2levator Algorithm The elevator algorithm tries to !eep the dis! arm moving one direction. On our canonical input with the head moving toward higher sectors the access pattern is << <9 <B ED EB ? <. #n practice the elevator algorithm stri!es a good balance between efficiency and fairness.



• •

There are other issues in multiple.ing mostly related to how intelligent the controller is. An intelligent controller may be able to handle several outstanding re%uests from the software in which case the device driver needs to do a little boo!!eeping but can generally leave it to the controller. Of course if the controller cannot multiple. the simple arm scheduling above applies. Error 5and(in The dis! device is the first element of the OS to see an error. Accordingly it has to adopt some strategy for which errors to try to correct and which to report. There are a variety of things that can go 200

wrong in the dis! and Tannenbaum discusses %uite a few. #n most cases the appropriate response to the error is to reset some confused part of the hardware and retry the operation. 2rrors that are resolved by this are transient errors. #n some cases transient errors can be ignored. Some of them are not reproducible and will never bother the system again. Some however are predictors of future woe. /hen a sector shows a sharp increase in chec!sum errors itUs li!ely that the sector or the dis! itself is wearing out. A human being or a higher level of the operating system may want to chec! into the matter further. )ood device drivers try to stri!e a balance between reporting too many and reporting too few errors. Bi er< Fa"ter< S#arter< More

1is! drivers and controllers are getting smarter by the revision. ,unctionality that was traditionally in the drivers is now being moved to controllers and dis! firmware. As a result device drivers are less concerned with directly manipulating the devices as they are with programming the controllers to do so. Some important functionalities that have begun to appear in dis! hardware4 • #nterleaving. The $er!eley fast file system did clever layout of file bloc!s within a trac! to reduce the rotational latency when the file was read se%uentially. 0any dis!s today encode this layout as the interleaving of sectors on the dis!. The dis! firmware renumbers the sectors rather than the OS doing layout. 'aching. 0ost dis!s and dis! controllers cache one or more tgrac!s of data in memory on the hardware. Sectors read from those trac!s are read not from dis! but from the memory eliminating the rotational and see! delays. $ad bloc!s. As mentioned above some dis! firmware locates and remaps bad bloc!s on dis! directly. Arm Scheduling. Smart dis! controllers for e.ample high-end S'S# controllers allow several outstanding simultaneous re%uests for data form the same dis!. The controller firmware schedules the re%uests internally.



• •

#n some sense this is all positive news. =ardware is getting smarter the software has less to do and life is great. There are two problems4 the software and hardware may be unaware of each other and burning algorithms in hardware ma!es them hard to change. ,or an e.ample of the hardware and software being at cross purposes consider the dis! interleaving case. The file system spends some additional time when bloc!s are allocated to ensure that theyUre placed to minimi3e latency and then the hardware moves them again because of its interleaving. The result is a bloc! layout that is almost certainly suboptimal. 8ou can find similar problems with the other helpful features above - transparently repaired bad bloc!s may show up as a performance penalty; caching sectors both in memory and on dis! is wasteful and degrades the value of one of the caches; spending '"& time to schedule the dis! arm in the device driver only to have the controller do the same caching algorithm in hardware is a waste of '"& time. The solution is to ma!e sure that the hardware and OS are aware of what the other is doing. #deally the OS should detect hardware features and either disable the ones that are replicated in the OS or disable the OS routines that do wor! done by the controller. #n practice this may be less straightforward. The other problem lac! of fle.ibility e.ists primarily if the features of the hardware cannot be disabled. 201

As we have seen for every scheduling algorithm there is a counter scheduling strategy that confuses it. #f your wor!load is a counter strategy for the algorithms wired into your hardware performance will suffer. ,re%uently its; faster to tune or recode an algorithm in the OS than in hardware but if you canUt disable the smart feature of the hardware youUre sun!.

Ter#ina(" Terminal is a generic term for the !eyboard/screen pair through which much of the computer input in the world today occurs. #n times past this was largely through serial line (5S-9E9) terminals that passed data a bit at a time although these days a large number of terminals are intelligent or memory mapped (or both). The console !eyboard screen and mouse of a "' or wor!station fall into the latter category. Serial terminals process data conceptually process data a character or line at a time. (#n reality data may be transmitted a bit at a time down the serial wire but the device generally collects C or H bit words to wor! with.) "articularly intelligent terminals may have data stream editing capabilities built in or they may be provided by the device driver. /hen we say character editing capabilities we mean everything from cursor control to simple character erasures. To effect such editing the device driver often collects characters as the !eyboard transmits them only committing the characters to the standard input of the running process when the enter !ey is sent. Smart terminals may do the same thing in the hardware. ,or output terminals have a simple language for output functions. A particular string of nonprinting characters may serve to move the output cursor (the point at which the ne.t character will be output) to a given position or to clear or scroll areas of the screen<. The device driver is responsible for arranging for canonical output control se%uences to be translated into the specific se%uences that the terminal hardware understands. 2ven in a world of windowing systems and )&#s the concept of a terminal is useful for its simplicity and power. The concepts carry over to line-driven modem systems and other simple devices. ,or e.ample 'o!e machines.9 On the other end of the spectrum are bit-mapped displays and modern graphics processors. A bitmapped display has its drawing memory directly accessible to the diver in a way that allows the driver to draw graphics on the screen directly. 'oupled with a pointing device this allows completely graphically oriented )"er inter8ace" *G)I"+C The screen driver directly draws the windows and other screen cues that allow a user to navigate the )&#. There are usually routines either in the !ernel or in user libraries to facilitate such constructs. The interface contains simple interface elements (called /idgets or )adgets or other things depending on whether youUre on an 7 machine an Amiga or some lesser machine). These libraries create the visual cues for the user and respond to events generated by the pointer driver. The pointer driver !eeps trac! of the userUs reference point into the bit-mapped display used to manipulate )&# elements. The driver receives interrupts from the pointer whenever the pointer device is moves changing the location of the reference point. )enerally any motion results in an interrupt so the #S5s to follow the pointer must be %uic!. 0ost pointers supply only relative motion events; the device driver must trac! the reference point and update the gui element indicating its location.E 202

The pointer and drawing routines wor! together to provide an event stream to the OS and applications which the applications can then use to get user input. 2vents are asynchronous notifications that contain information li!e *the use has activated button number @+ or *the reference pointer is over slider 9.+ The e.act mechanisms of delivering events vary widely and this class wonUt really discuss them in detail. =opefully your understanding of #"' already has given you some ideas4 a record-based file of events; signals that carry additional information; monitors with procedures defined for various events. <. < Some terminals allow individual pi.els to be addressed or vectors drawn for graphics capabilities. 9. Some devices li!e touch screens and graphics pads do give absolute coordinate locations. $eyond bit mapped displays are terminals with graphics co-processors. These are dedicated processors that do nothing but render detailed images on the terminal perhaps employing sophisticated lighting and te.ture effects. The language used to communicate with such processors may be very intricate and more resembles programming a multiprocessor than running a peripheral. /e wonUt discuss these in detail either but again your e.periences with interprocess communication should give you some ideas of the interfaces in use here. (oc!s and semaphores to control access to the lists of polygons to be drawn by the co-processor but arranged by the main '"& for e.ample. 'onte.t switching a color map on the co-processor when the reference point moves from one window to another etc. There are other devices in the world too that we wonUt have time to investigate. Sound recording and playbac! systems that input and output sampled streams of data that have to be filtered in real time. 6etwor! adapters that need to fragment and reassemble large data bloc!s into small ones to be transferred across a networ! (and that may have to determine a route across such a networ!). Other stranger things.

203

C5APTER 1$ NET6ORAING
Net4or0
The networ! layer connects hosts on different physical networ!s. #t e.tends the ideas of addressing naming and routing to their global e.treme. The headers added to the networ! layer are independent of the Net4or0 hard4are The networ! layer solves some difficult distributed problems e.g. how to store routes from every host to every host efficiently. Actually it :ust ma!es sure that certain routers in the networ! !now enough to sent the pac!et the right general direction with each router !nowing more about its local area. # donUt have time to really address these problems in this class but # strongly advise you to chec! out one of the networ!ing classes to find out for yourself. 'omputer 6etwor!ing is becoming a bigger and bigger issue every day. #tUs a versatile and ine.pensive way to share resources and trade data. This section addresses the basic OS issues involved in communicating between computers. Net4or0 ;"C S!"te# 1omain host router host router host router host networ! 204

host ThatUs the same diagram from our discussion of the #/O system only relabeled to represent a computer networ!. Some of the issues are remar!ably similar. The system still has to address4 • • Asynchrony4 2vents on different hosts are not synchroni3ed. 1ata corruption and reordering4 5eordering is similar to the problem of multiple.ing responses and is handled the same way. 1ata in the pac!ets can be used to order data. $ecause there are more sources of error in the networ! the OS has to address errors directly. $uffering4 2ach host is responsible for %ueuing data until the interested process retrieves it similar to the way dis! bloc!s are %ueued. There are some significant differences between a networ! and a hardware system though4 Autonomy4 #ndividual pieces of hardware in a system are all controlled by the same entity the owner of the machine or the '"&. #n a networ! each host may be an autonomous (or selfcontrolling) entity with goals that may be in direct opposition of other hosts and no central authority to which to appeal to resolve conflicts. #f this wasnUt enough to worry about in the abstract there is the problem of two communicating entities sitting in different human domains. The legal re%uirements on the hardware or even the data content are often at issue. (atency4 The latency between a '"& and a dis! is a few tens of milliseconds at worst and this is perceived as a glacial pace. The round trip time to a geosynchronous satellite and bac! is a %uarter second. There are documented reports of pac!ets ta!ing minutes to get from host to host in the #nternet. The latencies are often considerably higher in a networ!. $ut sometimes they are lower. =osts on an uncontested (A6 sometimes use a distributed file system to reduce dis! latency. The range of latencies with which systems have to contend in networ!s is the issue. 'onnectivity 5ichness4 #n a physical bo. there are only so many elements that can sit physically on one bus so the '"& need only concern itself with a few entities. There are millions of computers connected to the #nternet. Systems have to e.hibit vastly different scaling properties in networ!s.









Ba"ic Concept"
$ecause there are so many more elements connected sparsely issues of naming addressing and routing become paramount. #tUs important to grasp the distinction between a name an address and a route. 6ames are a convenient way for humans (or programs) to refer to an entity. 0y name is Ted; my computerUs name is vermouth.isi.edu. #n both cases this is :ust a convenient string of characters that refers to a physical entity. Addresses are a special !ind of name that can be used to plot a path to reach one of the entities. 0y address in 0adison /isconsin was CB@ /. /ashington Ave. hEA9 0adison /# @EC<@. 0y computerUs address is <9H.?.<BA.9DC . These names are special because they can be used to convey information to the place they name. (6ot all things that are namable have an address).

205

5outes are a description of how to convey information between two addresses. A route to my address in /isconsin from &S' would be the sets of interstate highways and side streets to use to get from here to there. A route to my computer from a computer at &S' would be a list of the #" addresses to pass through in order. #n some sense addresses and routes are the only entities that are re%uired for networ!ing but having names is so useful that most general purpose networ!ing systems have some mechanism. 1efining and allocating these entities is one of the most difficult parts of networ!ing. That the #nternet provides a global (in the purest sense of that word) naming addressing and routing system is nothing short of phenomenal.< This is only possible because the system was designed to scale to global si3es. 2ven at that crac!s are showing - the address space may not be large enough. 5outing information ta.es the ability of hardware to store and search it. And the one that we had wor!ing naming is under assault from lawyers. Another basic concept that underlies networ!ing is the protocol. A protocol is a set of rules that communicating entities follow in order to communicate meaningfully. ,or e.ample e.changing electronic mail re%uires a se%uence of e.changes between the mailing machine and the receiving (or forwarding) machine. The mailer identifies itself the receiver ac!nowledges it the mailer tells who the mail is from the receiver accepts or re:ects the address the mailer tells who the mail it to and the receiver again accepts or re:ects and finally the message is e.changed and ac!nowledged. That set of rules is a protocol. "rotocols give rise to standards. A standard is a formal presentation of a protocol that has been sanctioned by some official body. ,or e.ample the electronic mail protocol above has been sanctioned by the #nternet 2ngineering Tas! ,orce (#2T,). #f your system claims to e.change 5,'H99-compliant mail9 it must follow those rules - this is called conforming to the standard. Of course if your mailer doesnUt conform to the standard but sends mail without losing any the only thing that happens is that you canUt put an 5,'H99-compliant stic!er on it. =owever because they represent an agreement between ma:or practitioners of the field conforming to standards provide a loose guarantee that systems interoperate. another word that net wor!ers use a lot is pac!et. A pac!et is li!e a dis! bloc! - itUs an element of data e.changed between 9 hosts. 1epending on the underlying hardware and its associated protocols they may be fi.ed or variable length and there are different ma.ima and minima for the various pac!et parameters. #tUs best to thin! of them as atomic elements of networ!ing although as weUll see that can be an illusion. The final basic distinction to draw is between connection-oriented an connectionless communications. This is e.actly the difference between the post office and the phone system. #n the mail two units of transmission (9 letters) have no relation to each other. #f you send them to the same place on the same day you canUt usually tell what order they were sent or even if they bear any relation to each other - even if sent between the same 9 people. ThereUs no state that ties them together. On a phone call the notion that the various transmissions (words or different family members tal!ing) have some relation that is encapsulated by the idea of a call. The words that go in one end of the phone are not arbitrarily reordered for e.ample. <. The #nternet is not the only such system of course. "ostal addresses form a similar if less structured name/address/route space. The impressive part of the internet is that the space is well enough defined that machines can move the data with minimal per-message human intervention. 9. #nternet standards are presented in 5e%uest ,or 'omments documents - 5,' for short. 206

There are networ!s that support both these paradigms. #n fact each can be supported by the other electronic mail is connectionless in the sense that each piece of mail has no ordering relative to others yet the mail is transferred using T'" a connection-oriented protocol. The Se;en La!er Mode( The seven layer model is the OS# (an international standards body) model for designing networ!ing. As a tool for understanding the various issues in networ!ing itUs not bad. As a model for implementation itUs a recipe for a slow networ!. /eUll use it to tal! about protocol design but thin! more seriously about what youUre doing before you implement something this way. The each level of the stac! ta!es provides services to layers above it using building bloc!s provided by layers below it. 'onceptually this is very nice but weUll see that some servcies are replicated and some donUt fit neatly into a layer. 'onceptually each layer adds a header to outgoing pac!ets and strips them off incoming pac!ets before passing the pac!et up or down the stac! as the case may be. 6umbering layers from the bottom up an outgoing pac!et would have headers4 Ph!"ica( The physical layer specifies the format of bits on the wire and what !inds of wire you can use. This is very nuts and bolts electrical (or opticalV) engineering stuff and # wonUt discuss it in any great detail. 2ach type of hardware has itUs own standard4 thereUs an 2thernet standard a ,11# standard an 7.9@ standard and a bunch more. They tell you what !inds of hardware to buy how far apart nodes have to be (or must be) and what youUd see if you hoo!ed up an oscilliscope (or spectrum analy3er) to the medium. Lin0 The lin! layer describes the protocols used by communicating nodes connected by the same physical hardware. The scope of names addresses and routes is therefore constrained. #n a shared medium networ!E the lin! layer is responsible for medium access. 0edium access is the process of determining which host has the right to send information on the shared medium. There are many ways to do this. 2thernet uses 'S0A/'1 ('arrier Sense 0edium Access/'ollision 1etection) which means that each host listens to the shared line and doesnUt send until the line is silent. ThatUs the 'S0A; the '1 is that even listening beforehand its possible for 9 hosts far enough apart to hear the line clear begin transmitting and have their signals collide. #f that happens they both stop transmitting and remain silent for a random time period before trying again. The time they remain silent gets geometrically larger. Other medium access methods involve passing a to!en from host to host. (i!e the shell in (ord of the ,lies the to!en allows the holder the right to send uninterrupted. To!ens generally have a fi.ed lifetime so that a host can only transmit for a given time period before it is forced to relin%uish the to!en and passs it to the ne.t host. The protocols guarantee that every host gets the to!en eventually. ,11# (,iber 1istributed 1igitial #nterface) and to!en rings use to!ens. (in! layers are also the first layer that detects (and potentially recovers from) transmission errors.

207

This is usually accomplished by including a chec!sum in each pac!et. A chec!sum is a mathematical function that depends on the full contents of the pac!et li!e the one-way functions used for authentication. &pon receiving the pac!et a host will recomputed the function (assuming the chec!sum field to be A) and unless it gets the same answer as the pac!et contained re:ect the pac!et. 2thernets are a shared medium because many hosts use the same wire to communicate while dile up modems are a point-to-point medium because the connections directly connect only 9 hosts. #Ud use the party line analogy but # fear that no one of an age to read or hear this !nows what one is. 'hoosing a good chec!sum is a difficult tradeoff. The more effective the chec!sum is at detecting errors the slower it is to calculate. $ecause the chec!sum must be calculated for every pac!et the speed of calculating it can determine the networ! speed. The science of constructing efficient strong chec!sums is interesting in its own right. Some lin! layers correct errors either by labeling the pac!ets and ac!nowledging each pac!et receipt and retransmitting pac!ets if there is no ac!nowledgement in a reasonable time. Another approach is to send redundant data and reconstruct damaged pac!ets. This idea is also used in dis! drive arrays li!e 5A#1. 8ouUll hear more about it in 'S @@@. Tran"port The transport layer provides lin! layer style guarantees at across the networ! layer. ,or e.ample transport resends lost pac!ets and prevents reordering. Similar techni%ues to the lin! layer are used for these. Transport also provides demultiple.ing within the computer systems. The networ! layer can name address and route to a given computer. /ithin the computer the transport layer provides a way to name address and route to given processes. Transport is also the layer that addresses global performance of the networ! for e.ample congestion control and resource allocations. Se""ion: The session layer provides further multiple.ing control over which endpoint is sending data and some chec! pointing behavior. #tUs not often used. #tUs something of an open %uestion if this functionality is important. Pre"entation: This layer is responsible for reformatting data between machines and providing data-based semantics. 'onverting floating point formats between hosts or only returning pac!ets that contain a given type field are things that fall under "resentationUs umbrella. (i!e the Session layer "resentation isnUt often used. App(ication: These are protocols designed to carry out some useful conrete service. S0T" the email protocol -is an application layer protocol. So is =TT" (although itUs e.tending its tenticles into Session and "resentation as well). These are also standardi3ed; there are lengthy documents on what a valid =TT" re%uest loo!s li!e or on what behavior an ,T" server has to support. TheyUre dry reading but important to interoperability. 208

Other G(o:a( I""ue"
# probably wonUt have time to tal! about these in detail but other netwo!ing issues include4 • • • • • • Scalable 6aming Authentication and Security 6etwor! 0anagement Other 'ommunication models - broadcast J multicast "erformance tuning of protocols Active 6etwor!ing

Net4or0in Note 6etwor!ing deals with interconnected groups of machines tal!ing with each other. #s a very different field than operating systems. =ave a lot of standards stuff because everyone must agree on what to do when connect machines together. /hat is a networ!L A collection of machines lin!s and switches set up so that machines can communicate with each other. Some e.amples4 • • • Telephone system. 0achines are telephones lin!s are the telephone lines and switches are the phone switches. 2thernet. 0achines are computers there is one lin! (the ethernet) and no switches. #nternet. 0achines are computers there are multiple lin!s both long-haul and local-area lin!s. The switches are gateways.

0essage may have to traverse multiple lin!s and multiple switches to go from source to destination. 'ircuit-switched versus "ac!et-switched networ!s. $asic disadvantage of circuit-switched 6etwor!s - cannot use resources fle.ibly. $asic advantage of circuit-switched networ!s - deliver a uaranteed resource. Ba"ic Net4or0in Concept": • • • • • • • "ac!eti3ation. Addressing. 5outing. $uffering. 'ongestion. ,low control. &nreliable 1elivery. 209



,ragmentation.

(ocal Area 6etwor!s. 'onnect machines in a fairly close geographic area. Standard for many years4 2thernet. Standardi3ed by 7ero. #ntel and 12' in <?CH. Still in wide use. "hysical hardware technology4 coa. cable about </9 inch thic!. 0a.imum length4 @AA meters. 'an e.tend with repeaters. 'an only have two repeaters between any two machines so ma.imum length is <@AA meters. Gampire taps to connect machines to 2thernet. Attach an ethernet transceiver to tap; the transceiver does the connection between the 2thernet and the host interface. The host interface then connects to the host machine. 2thernet is <A 0bps bus with distributed access control. #t is a broadcast medium - all transceivers see all pac!ets and pass all pac!ets to host interface. The host interface chooses pac!ets the host should receive and discards others. Access scheme4 'arrier sense multiple access with collision detection. 2ach access point senses carrier wave to figure out if machine is idle. To transmit waits until carrier is idle then starts transmitting. 2ach transmission consists of a pac!et; there is a ma.imum pac!et si3e. 'ollision detection and recovery. Transceivers monitor carrier during transimission to detect interference. #nterference can happen if two transceivers start sending at same time. #f interference happens transceiver detects a collision. /hen collision detected uses a binary e.ponential bac! off policy to retry the send. Adds on a random delay to avoid synchroni3ed retries. #s there a fi.ed bound on how long it will ta!e a pac!et to get successfully transmittedL #s any pac!et guaranteed to be transmitted at allL Addressing. 2ach host interface has a hardware address built into it. Addresses are DH bits long. /hen change host interface hardware address changes. Are three !inds of addresses4 • • • "hysical address of one networ! interface. $roadcast address for the networ!. (All <;s). 0ulticast addresses for a subset of machines on networ!.

=ost interface loo!s at all pac!ets on the ethernet. #t passes a pac!et on to the host if the address in the pac!et matches its physical address or the broadcast address. Some host interfaces can also recogni3e several multicast addresses and pass pac!ets with those addresses on to the host. =ow do vendors avoid ethernet physical address clashesL $uy bloc!s of addresses from a central authority. "ac!et (frame) format. • • "reamble. BD bits of alternating < and A to synchroni3e receivers. 1estination address. DH bits. 210

• • • •

Source address. DH bits. "ac!et type. <B bits. =elps OS route pac!ets. 1ata. EBH-<9AAAA bits. '5'. E9 bits.

2thernet frames are self-identifying. 'an :ust loo! at frame and !now what to do with it. 'an multiple. multiple protocols on same machine and networ! without problems. '5' lets machine identify corrupted pac!ets. To!en-ring networ!s. Alternative to ethernet style networ!s. Arrange networ! in a ring and pass a to!en around that lets machine transmit. 0essage flows around networ! until reaches destination. Some problems4 long latency to!en regeneration. A5"A62T. Ancestor of current #nternet. (ong-haul pac!et-switched networ!. 'onsisted of about @A 'EA and 'EAA $$6 computers in &S and 2urope connected by long-haul leased data lines. All computers are dedicated pac!et-switching machines ("S6s). #nteresting fact4 A5"A62T li!e highway system was initially a 1O1 pro:ect set up officially for defense purposes. #n original A5"A62T each computer connected to A5"A62T connected directly to a "S6. 2ach pac!et contained address of destination machine and "S6 networ! routed the pac!et to that machine. 6ow this is totally impractical and have a much more comple. local structure before get onto #nternet. 1esign of #nternet driven by several factors. • • /ill have multiple networ!s. 1ifferent vendors compete plus have different technical tradeoffs for local area wide area and long haul networ!s. "eople want universal interconnection.

/ill have multiple networ!s around the world. An internet wor! or internet connects the different networ!s. So :ob of internet is to route pac!ets between networ!s. One goal of internet4 6etwor! transparency. /ant to have a universal space of machine identifiers and refer to all machines on the internet using this universal space of machine identifiers. 1o not want to impose a specific interconnection topology or hardware structure. #nternet architecture. 'onnect two networ!s using a gateway machine. The :ob of the gateway is to route pac!ets from one networ! to another. As networ! topologies become more complicated gateways must understand how to route data through intermediate networ!s to reach final destination on a remote networ!. #n #nternet gateways provide all interconnections between physical networ!s. All gateways route pac!ets based on the networ! that the destination is on. #nternet addressing. 2ach host on the #nternet has a uni%ue E9-bit address that is used for all #nternet traffic to that host. 2ach internet address is a (netid hosted) pair. The networ! identifies the networ! that the host is on; the hosted identifies the host within the networ!. 211

Three classes of #nternet addresses4 • • • • • 'lass A. ,irst $it4 A. $its <-C4 6etid. $its H-E<4 =ostid. 'an have <9H 'lass A networ!s. 'lass $. $its A-<4 <A. $its 9-<@4 6etid. $its <B-E<4 =ostid. 'an have <B EHD 'lass $ networ!s. 'lass '. $its A-94 <<A. $its E-9E4 6etid. $its 9D-E<4 =ostid. 'an have 9 )ig 'lass ' networ!s. 'lass 1. (multicast addresses). $its A-E4 <<<A. &sed for #nternet multicast. 'lass 2. $its A-E <<<<. 5eserved.

#nteresting point4 /hole structure of internet is available in 5,';s (re%uest for comments). Available over the #nternet - use the net search functionality for 5,' and you;ll find pointers. 'an read them to figure out what is going on. )ateways can e.tract networ! portion of address %uic!ly. )ateways have two responsibilies4 • • 5oute pac!ets based on networ! id to a gateway connected to that networ!. #f they are connected to destination networ! ma!e sure the pac!et gets delivered to correct machine on that networ!.

'onceptually an #nternet address identifies a host. 2.ceptions4 gateways have multiple internet addresses at least one per networ! that they are connected to. $ecause networ! id is encoded in #nternet address a machine;s internet address must change if it switches networ!s. 1otted 1ecimal notation4 5eading #nternet addresses. ,our decimal integers with each integer representing one byte. • • • • • cs.stanford.edu - EB.H.A.DC (what !ind of networ! is it on). cs.ucsb.edu - <9H.<<<.D<.9A ecrc.de - <D<.<.<.< lcs.mit.edu - <H.9B.A.EB sri.org - <??.HH.99.@

/ho assigns internet addressesL The 6etwor! #nformation 'enterV A centrali3ed authority. #t :ust allocates networ! ids leaving re%uesting authority to allocate host ids. 0apping #nternet addresses to "hysical 6etwor! addresses. /ill discuss case when physical networ! is an 2thernet. )iven a E9 bit #nternet address gateway must map to a DH bit 2thernet address. &ses Address 5esolution "rotocol (A5"). )ateway broadcasts a pac!et containing the #nternet address of the machine that it wants to send the pac!et to. /hen machine receives pac!et it sends bac! a response containing its physical address. )ateway uses physical address to send pac!et directly to machine. 212

Also wor!s for machines on same networ! even when they are not gateways. &se a address resolution cache to eliminate A5" traffic. A5" re%uest and response frames have specific type fields. An A5" re%uest has a type field of AHAB responses have HAE@. Standard set up by the 2thernet standard authority. =ow does a machine find out its #nternet addressL Store it on dis! and loo!s there to find out when it boots up. /hat if it is dis!lessL 'ontacts server and finds it out there using 5everse A5" (5A5"). 5,' ?AE - 5oss ,inlayson etc. 5A5" re%uest is broadcasted to all machines on networ!. 5A5" server loo!s at physical address of re%uestor and sends it a 5A5" response containing the internet address. &sually have a primary 5A5" server to avoid e.cessive traffic. 6ow switch to tal!ing about #" - the #nternet "rotocol. The internet conceptually has three !inds of services layered on top of each other4 'onnectionless unreliable pac!et delivery service reliable transport service and application services. #" is the lowest level - the pac!et delivery. The basic unit of transfer in the #nternet is the #" datagram. #" datagram has header and data. =eader contains internet addresses and the #nternet routes #" datagrams based on #nternet addresses in header. #nternet ma!es a best effort attempt to deliver each datagram but does not deal with error cases. #n particular can have4 • • • (ost "ac!ets 1uplicated "ac!ets Out of order "ac!ets

=igher level software layered on top of #" deals with these conditions. #" pac!ets always travel from gateway to gateway across physical networ!s. #f the #" pac!et is larger than the physical networ! frame si3e the #" pac!et will be fragmented4 chopped up into multiple physical pac!ets. #" is designed to deal with this situation and provides for fragmentation. Once a pac!et has been fragmented must be reassembled bac! into a complete pac!et. &sually reassembled only when fragments reach final destination. $ut could build a system that reassembled fragments when got to a physical networ! with a larger frame si3e. /hy is there a need for possibility of fragmentationL 6o good way to impose a uniform pac!et si3e on all networ!s. Some networ!s may support large pac!ets for performance while others can only route small pac!ets. Should not prevent some networ!s from using large pac!ets :ust because there e.ists a networ! somewhere in the world that can not handle large pac!ets. $ut must be able to route large pac!ets through a networ! that only handles small pac!ets - networ! transparency. #mportant fields in #" header4 • • • G25S4 protocol version. (264 length of header in E9-bit words. TOTA( (264 total length of #" pac!et. 213

• • •

SO&5'2 #" A1152SS4 #" address of source machine. 12ST #" A1152SS4 #" address of destination machine. TT(4 time to live. =ow many hops the pac!et may ta!e without getting removed from #nternet. 2very time a gateway forwards the pac!et it decrements this field. 5e%uired to deal with things li!e cycles in routing etc. #126T4 pac!et identifier. &ni%ue for each source. Typically source maintains a global counter it increments for every #" datagram sent. ,(A)S4 A do not fragment flag (dangerous) and a more fragments flag - A mar!s end of datagram. ,5A)026T O,,S2T - gives offset of this fragment in original datagram.

• • •

=ow to reassemble a fragmented pac!etL Allocate a buffer for each pac!et. &se #126T and SO&5'2 #" A1152SS to identify the original datagram to which the fragment belongs. &se the ,5A)026T O,,S2T field to write each fragment into correct spot in the buffer. &se more fragments flag to find end of original datagram. &se some mechanism to ma!e sure all fragments arrived before consider datagram complete. 5outing #" datagrams. There are multiple possible paths between hosts in an internet. =ow to decide which path for which datagramL 5outing for hosts on same networ!. 5eali3e that are on same networ! by loo!ing a notified of #nternet address and :ust use underlying physical networ!. 5outing for hosts on different networ!s. )ateways pass datagrams from networ! to networ! until reach a gateway connected to destination networ!. 2ach gateway must decide ne.t gateway to send datagram to. • • • Source routin . The source specifies the route in the datagram. &seful for debugging and other cases in which #nternet should be forced to use a certain route. 5o"t-"peci8ic route"C 'an specify a specific route for each host. &sed mostly for debugging. Ta:(e dri;en routin . 2ach gateway has a table inde.ed by destination networ! id. 2ach table entry tells where to send datagrams destined for that networ!. 1o e.ample on page H9. /e8au(t route"C Specify a default ne.t gateway to be used if other routing algorithms don;t give a route.



0ost routers use a combination of table driven routing and default routing. They !now how to route some pac!ets and pass others along to a default router. 2ventually all defaults point to a router that !nows how to route A(( pac!ets. =ow are routing tables ac%uired and maintainedL There are a lot of different protocols but the basic idea is that the gateways send messages bac! and forth advertising routes. 2ach advertisement says that a specific networ! is reachable via 6 hops. Some protocols also include 214

information about the different hops. The gateways use the route advertisements to build routing tables. #nternet was originally designed to survive military attac!s. #t has lots of physical redundancy and its routing algorithm is very dynamic and resilient to change. #f a lin! goes away the networ! should be able to route around the failure and still deliver pac!ets. So routing tables change in response to changes in the networ!. #n practice doesn;t always wor! as well as designed. 'hief threat to #nternet lin!s these days is bac!hoes not bombs. 'ommon error is routing all of the lin!s that are supposed to give physical redundancy in the same fiber run so are vulnerable to one bac!hoe. #n original internet partition gateways into two groups. 'ore and noncore gateways. 'ore gateways have complete information about routes. Original core gateways used a protocol called ))" ()ateway to )ateway "rotocol) to update routing tables. ))" messages allow gateways to e.change pairs of messages. 2ach message advertises that the sender can reach a given networ! 6 in 1 hops. 5eceiver compares its current route to the new route through the sender and updates its tables to use the new route if it is better. ,amous case4 =arvard gateway bug. 0emory fault caused it to advertise a A hop route to everybodyV "roblem with ))" - distributed shortest path algorithm may ta!e a long time to converge. (ater algorithm (S",) replicated a complete database of networ! topology in every gateway. )ateway runs a local shortest path computation to build its tables. #n current #nternet there is no longer any central bac!bone or authority. #nstead have internet providers. The whole system has switched over to private enterprise. A top-down view of system. There are D 6etwor! Access "roviders. 2ach 6A" is a very fast router connected via high-capacity lines to other gateways and 6A"s. (ines may be TE (BDD 0b/s) lines. Typically big communications companies (0'# Sprint ATT) own the lines. (ines are typically fiber. Organi3ations go to internet providers to get access to the internet. An internet provider buys a bunch of routers (usually from 'isco) and leases a bunch of lines. The internet provider must also buy access to a 6A" or to a gateway that leads to a 6A". The routers tal! a route advertisement protocol and implement some routing algorithm. The internet provider can then turn around and sell internet access to whoever wants to buy it. &'S$ buys its internet access from '25,62T and it pays e9E AAA per year for its internet access. All of the &' schools will band together and buy internet access from 0'# getting more bandwidth but at a higher price. Organi3ations tend to chop their communications up into multiple networ!s so there are too many networ!s in the world to give every networ! an #nternet address. ,or e.ample the &'S$ 'S department has more than <A networ!s. The solution is subnetting. #nternet views whole organi3ation as having one networ!. The organi3ation itself chops the host part of #" address up into a pair of local networ! and local host. ,or e.ample &'S$ has one class $ #nternet networ!. The third byte of every #" address identifies a local networ! and the fourth byte is the host on that networ!. 215

All #" pac!ets from outside come to one &'S$ gateway (by default). As far as the #nternet is concerned all of &'S$ has only one networ!. #nside &'S$ there is a set of networ!s connected by routers. These routers interpret the #" address as containing a local networ! identifier and a host on that networ! and route the pac!et within the &'S$ domain. The routers periodically advertise routes using a protocol called 5#". This is an e.ample of hierarchical routing. #nternet routes to &'S$ gateway based on #nternet networ! id then routers within &'S$ route based on the subnet id.

216

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close