Virtual memory pre-final-formatting

1
Virtual Memory (Galvin)
Outline
 Background: About precedingsections, concept ofa process not havingall of its pages inmemory, virtual memoryconcept, virtual
address space, sharedmemoryusing virtual memory
 Demand Paging: Basic concepts, performance
 Copy-on-Write
 Page Replacement: Basic, FIFO, Optimal, LRU (Algorithms:Additional-Reference-Bits, Second-Chance, EnhancedSecond-Chance),
Counting-based, Page-Buffering, Applications and Page Replacement
 Allocation of Frames: Minimumnumber of frames, Allocation Algorithms, Global vs Local Allocation, NUMA
 Thrashing: Cause, LocalityModel, Working-Set Model, Page Fault Frequency
 Memory-Mapped Files: Basic Mechanism, SharedMemoryinthe Win32 API, Memory-Mapped I/O
 Allocating Kernel Memory: Buddysystem, SlabAllocation
 Other Considerations: Prepaging, Page size, TLB Reach, Inverted Page Tables, Program Structure, I/O Interlock andPage Locking
 OS examples (Optional): Windows, Solaris
Contents
 Precedingsections talkedabout howto avoid memoryfragmentation bybreaking process memoryrequirements downintosmaller bites (pages),
and storing the pages non-contiguouslyinmemory.
 Most real processesdo not needalltheir pages, or at least not allat once, for several reasons:Error handling code is not neededunless that
specific error occurs, some ofwhichare quite rare. Arrays are oftenover-sizedfor worst-casescenarios, andonlya smallfractionof the arrays are
actuallyusedin practice. Certain features of certainprograms are rarelyused such as the routine to balance the federalbudget. (Me thinks this
holds the keyto the larger-than-physical virtual memoryconcept)
 The abilityto load onlythe portions ofprocesses that were actuallyneeded(andonlywhen theywere needed) has several benefits:Programs
could be writtenfor a much larger address space (virtual memoryspace) than physicallyexists onthe computer. Because each processis only
using a fractionof their total address space, there is more memoryleft for other programs, improvingCPU utilization and system throughput.
Less I/O is neededfor swappingprocesses inandout of RAM, speeding things up. (Fig9.1 show layout of VM)
 Figure 9.2 shows virtual address space, whichis the programmer’s logical view of process memorystorage. The actual physical layout is
controlledbythe process's page table. Note that the address space shown inFigure 9.2 is sparse - A great hole inthe middle of the address space
is never used, unlessthe stackand/or the heap grow to fill the hole.

 Virtual memoryalsoallows the
sharing offiles andmemorybymultiple processes, withseveral benefits:
#System librariescanbe sharedbymapping them into the virtual address
space of more thanone process. #Processescanalsoshare virtual
memorybymapping the same blockof memoryto more thanone process. #Process pages canbe sharedduring a fork()systemcall,
eliminating the needto copyall of the pages ofthe original (parent) process.
DEMAND PAGING
 The basic idea behind demand paging is that when a process is swappedin, its pages are not swapped
in all at once. Rather theyare swappedinonlywhenthe processneeds them. ( on demand. ) Thisis
termed a lazyswapper.

2
 The basic idea behind paging is that when a process is
swappedin, the pager onlyloads into memorythose pagesthat it
expects the processto need (right away.) Pages that are not loaded
into memoryare markedas invalid in the page table, using the invalid
bit. (The rest of the page table entrymayeither be blankor contain
informationabout where to findthe swapped-out page on the hard
drive.) If the process onlyever accesses pagesthat are loaded in
memory(memoryresident pages), then the process runs exactlyas if all the pages were loaded into memory.
 On the other hand, ifa page is neededthat wasnot originallyloadedup, thena page fault trapis generated, which must be handled ina series of
steps:The memoryaddressrequested is first checked, to make sure it wasa valid memoryrequest. If the reference was invalid, the process is
terminated. Otherwise, the page must be paged in. A free frame is located, possiblyfrom a free-frame list. A diskoperationis scheduled to bring
in the necessarypage from disk. (This willusuallyblock the process onan I/O wait, allowingsome other processto use the CPU in the meantime.)
When the I/O operationis complete, the process's page table is updatedwith the newframe number, and the invalidbit is ch angedto indicate
that this is now a validpage reference. The instruction that causedthe page fault must nowbe restartedfromthe beginning, (as soonas this
process gets another turnon the CPU.)
 In an extreme case, NO pagesare swapped infor a process until theyare requestedbypage faults. Thisis knownas pure demand paging.
 In theoryeachinstruction couldgenerate multiple page faults. Inpractice this is veryrare, due to localityof reference, coveredin section 9.6.1.
 The hardware necessaryto support virtual memoryis the same as for pagingand swapping:A page table andsecondarymemory. (Swap space,
whose allocationis discussedinchapter 12.)
 A crucial part of the processis that the instruction must be restartedfromscratchonce the desired page hasbeen made available in memory. For
most simple instructions this is not a major difficulty. However there are some architectures that allow a single instruction to modifya fairlylarge
block of data, (which mayspana page boundary), andifs ome of the data gets modifiedbefore the page fault occurs, this couldcause problems.
One solutionis to access bothends of the block before executing the instruction, guaranteeing that the necessarypagesget pagedinbefore the
instruction begins.
 Performance of Demand Paging: There are manysteps that occur whenservicing a page fault (see bookfor full details), andsome of the steps
are optional or variable. But just for the sake ofdiscussion, suppose that a normalmemoryaccess requires 200 nanoseconds, andthat servicing a
page fault takes 8 milliseconds. (8,000,000 nanoseconds, or 40,000 times a normal memoryaccess.)Witha page fault rate ofp , (ona scale from 0
to 1), the effective access time is now: (1 - p) * (200) + p * 8000000 = 200 + 7,999,800 * p
which clearlydepends heavilyon p! Even if onlyone access in 1000 causes a page fault, the effective access time drops from200
nanoseconds to 8.2 microseconds, a slowdownof a factor of 40 times. In order to keep the slowdownless than10%, the page fault rate must be
less than0.0000025, or one in399,990 accesses.
 A subtletyis that swapspace is faster to access thanthe regular file system, because it does not have to go throughthe wh ole directorystructure.
For this reasonsome systems will transfer anentire processfrom the file system to swapspace before starting upthe process, so that future
pagingalloccurs fromthe (relatively) faster swapspace.
 Some systems use demandpaging directlyfrom the file system for binarycode (which never changes andhence doesnot have to be storedona
page operation), andto reserve the swapspace for data segments that must be stored. This approachis used byboth Solarisa ndBSD Unix.
Copy-on-Write:
 The idea behinda copy-on-write forkis that the pages for a parent process do not have to be actuallycopiedfor the childuntil one or the other of
the processes changesthe page. Theycanbe simplysharedbetweenthe two processesinthe meantime, with a bit set that the page needs to be
copied if it ever gets writtento. This is a reasonable approach, since the childprocessusuallyissues an exec() system call immediatelyafter the
fork (Last line grey). Obviouslyonlypages that can be modifiedevenneed to be labeledas copy-on-write. Code segments cansimplybe shared.

3
Pages usedto satisfycopy-on-write duplications are typicallyallocated using zero-fill-on-demand, meaning that their previous contents are
zeroed out before the copyproceeds.
 Some systems provide analternative to the fork()systemcall calleda virtual memory fork, vfork(). In this case the parent is suspended, andthe
childuses the parent's memorypages. Thisis veryfast for process creation, but requires that the childnot modifyanyof the sharedmemory
pages before performingthe exec()system call. (Inessence thisaddresses the questionof whichprocessexecutesfirst after a call to fork, the
parent or the child. Withvfork, the parent is suspended, allowing the child to execute first until it calls exec(), sharing pages withthe parent in the
meantime.)
Page Replacement
 In order to make the most use ofvirtualmemory, we loadseveral processesintomemoryat the same time. Since we onlyload the pages that are
actuallyneededbyeach process at anygiventime, there is room to load manymore processes thanif we hadto loadinthe entire process.
 Memoryis alsoneeded for other purposes (suchas I/O buffering), andif
some process suddenlydecides it needs more pages and there aren't
anyfree frames available, thenthere are several possible solutions to
consider:
o Adjust the memoryusedbyI/O buffering, etc., to free upsome
frames for user processes. The decisionof how to allocate
memoryfor I/O versus user processes is a complex one, yielding
different policieson different systems. (Some allocate a fixed
amount for I/O, and others let the I/O system contendfor
memoryalong witheverything else.)
o Put the process requestingmore pagesintoa wait queue until
some free frames become available.
o Swap some processout of memorycompletely, freeing upits
page frames.
o Find some page inmemorythat isn't being usedright now, and
swapthat page onlyout to disk, freeing upa frame that canbe
allocated to the process requesting it. This is knownas page
replacement, andis the most commonsolution. There are manydifferent algorithms for page replacement, whichis the subject of the
remainder of thissection.
Basic Page Replacement:
 The previouslydiscussed page-fault processing assumedthat there would
be free framesavailable on the free-frame list. Nowthe page-fault
handling must be modifiedto free up a frame ifnecessary, as follows:
1. Find the locationof the desired page onthe disk, either in
swapspace or inthe file system.
2. Find a free frame:
a) If there is a free frame, use it.
b)If there is nofree frame, use a page-replacement
algorithmto select anexistingframe to be replaced,
known as the victim frame.
c) Write the victim frame to disk. Change all related
page tables to indicate that this page is nolonger in
memory.
3. Read inthe desiredpage and store it inthe frame. Adjust all
relatedpage andframe tables to indicate the change.
4. Restart the process that was waitingfor this page
 Note that step 2c adds anextra disk write to the page-fault handling, effectivelydoublingthe time requiredto processa page fault. This can be
alleviatedsomewhat byassigninga modify bit, or dirty bit to each page, indicatingwhether or not it has beenchangedsince it was last loadedin

4
from disk. If the dirtybit has not beenset, thenthe page is unchanged,and does not needto be writtenout to disk. Otherwise the page write is
required. It shouldcome as nosurprise that manypage replacement strategiesspecificallylook for pages that do not have their dirtybit set, and
preferentiallyselect cleanpages as victimpages. It should alsobe obvious that unmodifiable code pages never get their dirtybits set.
 There are two major requirements to implement a successful demandpaging system. We must developa frame-allocationalgorithm and a
page-replacement algorithm. The former centers around how manyframes are allocatedto each process (and to other needs), andthe latter
dealswithhow to select a page for replacement whenthere are no free frames available. The overallgoalinselectingandtuningthese
algorithms is to generate the fewest number of overallpage faults. (Because diskaccessis soslowrelative to memory access, evenslight
improvements to these algorithms can yieldlarge improvements inoverall systemperformance.)
 Algorithms are evaluatedusinga given string of memoryaccesses known as a reference string, which canbe generatedinone of ( at least ) three
common ways:
o Randomlygenerated, either evenlydistributed or withsome distributioncurve based onobserved system behavior. This is the
fastest andeasiest approach, but maynot reflect real performance well, as it ignores localityof reference.
o Specificallydesigned sequences. These are useful for illustrating the properties of comparative algorithms inpublished papers and
textbooks, ( and alsofor homeworkandexam problems. :-) )
o Recorded memoryreferences from a live system. Thismaybe the best approach, but the amount ofdata collectedcan be enormous,
on the order of a millionaddresses per second. The volume of collecteddata canbe reducedbymaking two important observations:
 Onlythe page number that was accessedis relevant. The offset within that page does not affect pagingoperations.
 Successive accesses within the same page can be treated as a single page request, because allrequests after the first are
guaranteedto be page hits. ( Since there are nointerveningrequests for other pages that couldremove this page from the
page table. )
**So for example, if pages were ofsize 100 bytes, then the sequence of addressrequests ( 0100, 0432, 0101, 0612, 0634,
0688, 0132, 0038, 0420 ) wouldreduce to page requests ( 1, 4, 1, 6, 1, 0, 4 )
FIFO Page Replacement
 As new pagesare brought in, theyare addedto the tail of a queue, andthe page at the headof the queue is the next victim. Inthe following
example, 20 page requests result in15 page faults:
 Although FIFO is simple andeasy, it is not always optimal, or even efficient. An interesting
effect that canoccur withFIFO is Belady's anomaly, inwhichincreasing the number of frames
available canactuallyincrease the number of page faults that occur! Consider, for example, the
following chart basedonthe page sequence (1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5) and a varying
number of available frames. Obviouslythe maximumnumber of faults is 12 (everyrequest
generates a fault), andthe minimum number is 5 (eachpage loadedonlyonce)...
 In FIFO algorithm, whichever page hasbeen inthe frames the longest is the one that is cleared.
Until Bélády's anomalywas demonstrated, it was believedthat anincrease inthe number of
page frameswouldalways result inthe same number or fewer page faults. Bélády, Nelsonand
Shedler constructedreference strings for which FIFOpage replacement algorithm produced
nearlytwice more page faults ina larger memorythanin a smaller one (wiki).
Optimal Page Replacement
 The discoveryof Belady's anomalylead to the searchfor anoptimalpage-replacement algorithm, whichis simplythat whichyields the lowest ofall
possible page-faults, andwhichdoes not suffer from Belady's anomaly.
 Such an algorithm does exist, and is calledOPT or MIN. This algorithmis simply"Replace the page that will not be used for the longest time inthe
future." (www.youtube.com/watch?v=XmdgDHhx0fg clearlyexplains:Lookahead intothe sequence to see whichnumber won’t be requiredfor
the longest period, page out that number). FIFO can
take 2-3 times more time thanOPT/MIN.
 OPT cannot be implemented in practice, because it
requiresforetellingthe future, but it makes a nice
benchmark for the comparisonandevaluationof real
proposednew algorithms.
 In practice most page-replacement algorithms tryto
approximate OPT bypredicting(estimating) in one
fashionor another what page will not be usedfor the

5
longest periodof time. The basis of FIFO is the predictionthat the page that was brought in the longest time ago is the one that willnot be needed
againfor the longest future time, but as we shall see, there are manyother prediction methods, all strivingto match the performance of OPT.
LRU Page Replacement
 The predictionbehind LRU, the Least RecentlyUsed, algorithm is
that the page that has not beenusedinthe longest time is the
one that will not be usedagaininthe near future. (Note the
distinctionbetweenFIFO andLRU:The former looks at the oldest
loadtime, andthe latter looks at the oldest use time.) Some view
LRU as analogous to OPT, except lookingbackwards intime
insteadof forwards. (OPT has the interestingpropertythat for any
reference string S andits reverse R, OPT will generate the same number of page faults for S and for R. It turns out that LRU hasthis same
property.) Figure 9.15 illustrates LRU for our sample string, yielding12 page faults, (ascompared to 15 for FIFO and9 for OPT.)
 LRU is considereda goodreplacement policy, andis oftenused. The problemis howexactlyto implement it. There are two simple
approaches commonlyused:
o Counters: Everymemoryaccess increments a counter, andthe current value ofthis counter is storedinthe page table entryfor
that page. Then finding the LRU page involves simple searching the table for the page withthe smallest counter value. Note that
overflowing of the counter must be considered.
o Stack:Another approach is to usea stack, and whenever a page is accessed, pullthat page from the middle of the stack andplace it
on the top. The LRU page will always be at the bottomof the stack. Because thisrequiresremovingobjects from the middle of the
stack, a doublylinkedlist is the recommendeddata structure (last line grey).
 Both implementations of LRU require hardware support, either for
incrementing the counter or for managingthe stack, as these operations
must be performedfor everymemoryaccess.
 Neither LRU or OPT exhibit Belady's anomaly. Bothbelongto a class of
page-replacement algorithms calledstackalgorithms, which cannever
exhibit Belady's anomaly. A stackalgorithmis one in whichthe pages
kept in memoryfor a frame set of size N will always be a subset of the
pages kept for a frame size of N + 1. In the case of LRU, (andparticularly
the stack implementationthereof), the topN pages of the stackwill be
the same for all frame set sizes ofN or anythinglarger.
 LRU-Approximation Page Replacement: Full implementationof LRU
requireshardware support, andfew systems provide the full hardware
support necessary. However manysystems offer some degree of HW
support, enough to approximate LRU fairlywell. (Inthe absence of ANY hardware support, FIFO might be the best available choice.)In
particular, manysystems provide a reference bit for everyentryina page table, whichis set anytime that page is accessed. Initiallyall bits
are set to zero, andtheycan alsoall be cleared at a nytime. One bit of precisionis enough to distinguishpages that have beenaccessed since
the last clear fromthose that have not, but does not provide anyfiner grain ofdetail.
 Additional-Reference-Bits Algorithm: Finer grainis possible bystoring the most recent 8 reference bits for each page inan 8-bit byte inthe
page table entry, which is interpretedas anunsignedint. At periodic intervals (clock interrupts), the OS takes over, and right-shifts eachof
the reference bytes byone bit. The high-order (leftmost)bit is then filled inwith the current value of the reference bit, andthe reference bits
are cleared. At anygiventime, the page withthe smallest value for the reference byte is the LRU page. Obviouslythe specific number of bits
usedand the frequencywith whichthe reference byte is updated are adjustable, andare tunedto give the fastest performance on a given
hardware platform.
 Second-Chance Algorithm: Imagine a pointer that moves continuously
from the topmost frame to the bottom and thenback again. If the
pointer is at position Xat a point of time, andthat frame gets filled witha
page fromthe page sequence provided, then the pointer moves/points
to the next frame. The reference bits are set to 0 the first time a new
page is paged in. Anymore reference to that page sets its reference bit
to 1. If the pointer is at a frame whose reference bit is 1, and the next
reference is againto the same page as present in the current frame, then
the bit doesn't become 2! A frame's content is cleanedand replaced only
if the pointer is pointingto it andit's reference bit is 0. If its reference bit
is 1, then the next frame who reference bit is 0 is replaced, but at the
same time, the current frame's reference bit (whichis currently1), is
changed/set to zero before the pointer moves aheadto the next frame
(http://www.mathcs.emory.edu/~cheung/Courses/355/Syllabus/9-
virtual-mem/SC-replace.html). The book's figure is not clear at all for understanding, but nevertheless providingit below.

6
 The second chance algorithm (or ClockAlgorithm)is essentiallya FIFO,
except the reference bit is used to give pagesa secondchance at staying
in the page table. Whena page must be replaced, the page table is
scannedina FIFO (circular queue) manner. If a page is foundwithits
reference bit not set, thenthat page is selectedas the next victim. If,
however, the next page inthe FIFO does have its reference bit set, then
it is given a second chance: The reference bit is cleared, and the FIFO
search continues. If some other page is foundthat didnot have its
reference bit set, thenthat page will be selectedas the victim, and this
page (the one beinggiventhe second chance)will be allowedto stayin
the page table. If, however, there are noother pages that donot have
their reference bit set (to put it simply, all have their bits set), thenthis
page will be selected as the victimwhenthe FIFO search circles back
aroundto this page on the secondpass. If all reference bits in the table
are set, then secondchance degrades to FIFO, but alsorequires a
complete search ofthe table for everypage-replacement. As long as
there are some pageswhose reference bits are not set, thenanypage
referenced frequentlyenoughgets to stayinthe page table indefinitely.
 Enhanced Second-Chance Algorithm: The enhancedsecondchance
algorithmlooks at the reference bit andthe modifybit (dirtybit) as an
orderedpage, and classifiespagesintoone of four classes:(0, 0) - Neither recentlyused nor modified. (0, 1) - Not recentlyused, but
modified. (1, 0) - Recentlyused, but clean. (1, 1) - Recentlyused andmodified. This algorithm searches the page table in a circular fashion (in
as manyas four passes), lookingfor the first page it canfindin the lowest numberedcategory. I.e. it first makes a pass looking for a (0, 0),
and then if it can't find one, it makes another passlooking for a (0, 1), etc. The main difference betweenthis algorithm and the previous one
is the preference for replacingclean pages if possible.
 Counting-Based Page Replacement: There are several algorithms based oncountingthe number ofreferences that have beenmade to a
given page, such as: (A) Least Frequently Used, LFU – Replace the page withthe lowest reference count. A problem can occur if a page is
usedfrequentlyinitiallyandthennot used anymore, as the reference count remains high. A solutionto this problem is to right-shift the
counters periodically, yielding a time-decaying average reference count. (B) Most Frequently Used, MFU – Replace the page withthe
highest reference count. The logic behindthis ideais that pagesthat have alreadybeen referenceda lot have been inthe system a long time,
and we are probablydone withthem, whereas pagesreferenced onlya few times have onlyrecentlybeen loaded, andwe still needthem.
In general counting-based algorithms are not commonlyused, as their implementationis expensive andtheydo not approximate OPT well.
 Page-Buffering Algorithms: There are a number of page-buffering algorithms that canbe usedinconjunctionwith the afore-mentioned
algorithms, to improve overallperformance andsometimes make up for inherent weaknesses in the hardware and/or the underlyin gpage-
replacement algorithms —
o Maintaina certainminimum number of free frames at alltimes. Whena page-fault occurs, goaheadandallocate one ofthe free
frames fromthe free list first, to get the requestingprocess upand running againas quicklyas possible, and thenselect a victim
page to write to disk andfree upa frame as a secondstep.
o Keep a list of modifiedpages, andwhenthe I/O system is otherwise idle, have it write these pagesout to disk, andthenclear the
modifybits, therebyincreasing the chance of finding a "clean" page for the next potential victim.
o Keep a pool of free frames, but remember what page was init before it was made free. Since the data in the page is not actually
clearedout whenthe page is freed, it canbe made anactive page againwithout having to loadinanynew data from disk. This is
useful when analgorithmmistakenlyreplacesa page that in fact is needed again soon.
 Some applications like database programs undertake their ownmemorymanagement overridingthe general-purpose OS for data accessing
and caching needs. Theyare oftengiven a rawdiskpartition to work with, containingrawdata blocks, andno file system structure.
Allocation of Frames
We saidearlier that there were twoimportant tasks invirtualmemorymanagement:a page-replacement strategyanda frame-allocation strategy. This
sectioncovers the secondpart of that pair.
 Minimum Number of Frames: The absolute minimum number of frames that a process must be allocated is dependent onsystem
architecture, and corresponds to the worst-case scenarioof the number of pagesthat could be touched bya single (machine)instruction. If an
instruction (andits operands) spans a page boundary, thenmultiple pagescouldbe needed just for the instruction fetch. Memoryreferences
in an instructiontouchmore pages, andifthose memorylocations can spanpage boundaries, then multiple pages could be neededfor
operandaccess also. The worst case involves indirect addressing, particularlywhere multiple levels of indirect addressinga re allowed. Left
unchecked, a pointer to a pointer to a pointer to a pointer to a . . . couldtheoreticallytoucheverypage inthe virtual address space in a single
machine instruction, requiringeveryvirtual page be loaded into physicalmemorysimultaneously. For this reasonarchitectures place a limit
(say16) on the number of levels of indirectionallowedinan instruction, whichis enforced witha counter initializedto th e limit and
decremented witheverylevel of indirection inaninstruction - Ifthe counter reaches zero, thenan "excessive indirection" trap occurs. This
example would still require a minimum frame allocationof 17 per process.

7
 Allocation Algorithms:
o Equal Allocation - If there are m framesavailable andn processes to share them, eachprocess gets m/nframes, andthe leftovers are
kept in a free-frame buffer pool.
o Proportional Allocation - Allocate the framesproportionallyto the size of the process, relative to the total size ofallprocesses. So if
the size of process i is S_i, and S is the sumof all S_i, thenthe allocationfor process P_i is a_i = m * S_i / S. Variations onproportional
allocationcouldconsider priorityof process rather thanjust their size. Obviouslyallallocations fluctuate over time as the number of
available free frames, m, fluctuates, andall are also subject to the constraints ofminimum allocation. (If the minimumallo cations
cannot be met, thenprocesses must either be swappedout or not allowedto start untilmore free framesbecome available.)
 Global versus Local Allocation: One bigquestion is whether frame allocation(page replacement) occurs ona local or global level. With local
replacement, the number of pages allocated to a process is fixed, andpage replacement occurs onlyamongst the pages allocatedto this
process. With global replacement, anypage maybe a potential victim, whether it currentlybelongs to the process seekinga free frame or not.
Local page replacement allows processesto better control their ownpage fault rates, andleads to more consistent performance of a given
process over different system loadlevels. Global page replacement is overall more efficient, andis the more commonlyusedapproach.
 Non-Uniform Memory Access (Consolidates understanding): The above arguments all assume that all memoryis equivalent, or at least has
equivalent access times. This maynot be the case in multiple-processor systems, especiallywhere each CPU is physicallylocatedona separate
circuit board which also holds some portionof the overall system memory. In these latter systems, CPUs canaccessmemorytha t is physically
locatedon the same board muchfaster thanthe memoryon the other boards. The basic solutionis akin to processor affinity- At the same
time that we tryto schedule processes onthe same CPU to minimize cache misses, we also tryto allocate memoryfor those processeson the
same boards, to minimize access times. The presence of threads complicates the picture, especiallywhen the threads get loadedonto different
processors. Solarisuses anlgroup as a solution, ina hierarchicalfashionbased onrelative latency. For example, all processors andRAMon a
single board would probablybe in the same lgroup. Memory assignments are made within the same lgroup if possible, or to the next nearest
lgroup otherwise. (Where "nearest" is definedas having the lowest access time.)
Thrashing
If a process cannot maintainits minimum required number of frames, then it must be swappedout, freeing upframesfor other processes. This is an
intermediate level of CPU scheduling. But what about a process that cankeep its minimum, but cannot keep all of the frames that it is currentlyusing on
a regular basis? In thiscase, it is forcedto page out pagesthat it will needagaininthe verynear future, leadingto large numbers of page faults. A
process that is spendingmore time pagingthanexecutingis said to be thrashing.
 Cause of Thrashing: Earlyprocess scheduling schemes wouldcontrol the level of
multiprogramming allowedbased onCPU utilization, adding inmore processes
when CPU utilizationwas low. The problem is that whenmemoryfilledup and
processes started spending lots of time waitingfor their pagesto page in, then
CPU utilization would lower, causingthe schedule to add ineven more
processes andexacerbating the problem! Eventuallythe systemwould
essentiallygrindto a halt. Local page replacement policiescanprevent one
thrashing process fromtaking pagesawayfrom other processes, but it still tends
to clog up the I/O queue, therebyslowing downanyother process that needs to
do even a littlebit of paging (or anyother I/O for that matter.)
To prevent thrashing we must provide processes withas manyframes as theyreallyneed"right now", but how dowe know what that is? The
localitymodel notes that processes typicallyaccess memoryreferences ina given locality, makingl ots of referencesto the same general area
of memorybefore moving periodicallyto a new locality, as shown inFigure 9.19 below. If we couldjust keepas manyframes as are involvedin
the current locality, thenpage faultingwouldoccur primarilyon switches fromone localityto another. (E.g. when one functionexits and
another is called.)
 Working-Set Model: The workingset modelis basedon the concept of locality, and defines a working set window, oflengthdelta. Whatever
pages are included inthe most recent delta page references are
said to be inthe processes working set window, andcomprise its
current working set, as illustratedin Figure 9.20:
The selection ofdelta is critical to the success ofthe workingset
model - If it is too small thenit does not encompass all ofthe pages
of the current locality, andif it is too large, thenit encompasses
pages that are nolonger being frequentlyaccessed. The total
demand, D, is the sum of the sizes of the working sets for all processes. If D exceeds the total number of available frames, thenat least one
process is thrashing, because there are not enough framesavailable to satisfyits minimum working set. If D is significantly lessthanthe
currentlyavailable frames, thenadditional processescanbe launched. The hard part of the working-set model is keeping trackof what pages
are in the current workingset, since everyreference adds one to the set and removes one older page. An approximationcanbe made using
reference bits anda timer that goes off after a set interval of memoryreferences:For example, suppose that we set the timer to go off after
every5000 references (byanyprocess), and we canstore twoadditional historical reference bits inadditionto the current reference bit. Every
time the timer goes off, the current reference bit is copied to one of the twohistorical bits, andthen cleared. If anyof the three bits is set, then

8
that page was referenced withinthe last 15,000 references, andis considered to be in that processes reference set. Finer resolutioncanbe
achieved withmore historical bits and a more frequent timer, at the expense of greater overhead.
 Page-Fault Frequency: A more direct approach is to recognize that what we reallywant to control is the page-fault rate, andto allocate frames
basedonthis directlymeasurable value. Ifthe page-fault rate exceeds a certainupper boundthenthat process needs more frames, andif it is
below a givenlower bound, thenit can affordto give upsome ofi ts frames to other processes. (Illinois professor supposes a page-replacement
strategycouldbe devisedthat wouldselect victim framesbasedonthe processwith the lowest current page-fault frequency.). Note that there
is a direct relationshipbetween the page-fault rate andthe working-set, as a process moves fromone localityto another (unnumbered side-
bard-9th Ed).
.
Memory-Mapped Files
Rather thanaccessing data files directlyvia the file system witheveryfileaccess, data files canbe pagedinto memorythe same as process files, resulting
in much faster accesses (except ofcourse when page-faults occur.) Thisis knownas memory-mapping a file.
 Basic Mechanism: Basicallya file is mapped to an address range withina
process's virtual address space, and thenpaged in as neededusing the
ordinarydemandpagingsystem. Note that file writes are made to the
memorypage frames, andare not immediatelywrittenout to disk. (This is the
purpose of the "flush()" systemcall, whichmayalsobe neededfor stdout in
some cases.) Thisis alsowhyit is important to "close()" a file whenone is
done writing to it - So that the data can be safelyflushedout to disk and so
that the memoryframes canbe freedup for other purposes. Some systems
provide specialsystemcalls to memorymapfiles anduse direct disk access
otherwise. Other systems mapthe file to process address space ifthe special
systemcalls are used andmapthe file to kernel address space otherwise, but
do memorymapping in either case. File sharing is made possible bymapping
the same file to the address space of more than one process, as shownin
Figure 9.23 below. Copy-on-write is supported, andmutual exclusion
techniques (chapter 6) maybe neededto avoid synchronization problems.
 Memory-Mapped I/O: All access to devices is done bywriting into(or reading from) the device's registers. Normallythis is done via special I/O
instructions. For certaindevices it makes sense to simplymap the device's registers to addressesinthe process's virtual address space, making
device I/O as fast andsimple as anyother memoryaccess. Videocontroller cards are a classic example ofthis. Serialand parallel devicescan
also use memorymappedI/O, mapping the device registers to specific memorya ddresses known as I/O Ports, e.g. 0xF8. Transferring a series
of bytes must be done one at a time, movingonlyas fast as the I/O device is preparedto processthe data, through one of twomechanisms:
o Programmed I/O (PIO), also known as polling – The CPU periodicallychecks the control bit on the device, to see ifit is readyto
handle another byte of data.
o Interrupt Driven – The device generates aninterrupt when it either hasanother byte of data to deliver or is readyto receive another
byte.
Allocating Kernel Memory
Previous discussions have centeredonprocessmemory, which can be convenientlybrokenup into page-sizedchunks, andthe onlyfragmentation
that occurs is the average half-page lost to internal fragmentationfor each process (segment). There is alsoadditional memoryallocatedto the kernel,
however, which cannot be so easilypaged. Some ofit is usedfor I/O buffering anddirect access bydevices, for example, and must therefore be
contiguous andnot affectedbypaging. Other memoryis usedfor internal kernel data structures of various sizes, andsince kernelmemoryis often
locked(restrictedfrom being ever swapped out), management of thisresource must be done carefullyto avoid internal fragmentation or other waste.

9
(i.e. you wouldlike the kernelto consume as littlememoryas possible, leavingas muchas possible for user processes.)Accord inglythere are several
classic algorithms inplace for allocating kernel memorystructures.
 Buddy System: The BuddySystemallocates memoryusing
a power of twoallocator. Under this scheme, memoryis
always allocatedas a power of 2 (4K, 8K, 16K, etc),
roundingupto the next nearest power of two if
necessary. If a block ofthe correct size is not currently
available, then one is formed bysplittingthe next larger
block in two, forming twomatchedbuddies. (Andifthat
larger size is not available, thenthe next largest available
size is split, and soon.)One nice feature ofthe buddy
systemis that if the address of a block is exclusivelyORed
with the size ofthe block, the resultingaddress is the
address of the buddyof the same size, whichallows for
fast andeasycoalescing of free blocks back intolarger
blocks. Free lists are maintained for everysize block. If the
necessaryblock size is not available uponrequest, a free
block fromthe next largest size is split into twobuddies of
the desiredsize. (Recursivelysplitting larger size blocks if necessary.) When a block is freed, its buddy's addressis calculated, andthe free list
for that size block is checked to see if the buddyis alsofree. If it is, thenthe two buddies are coalescedintoone larger free block, andthe
process is repeated withsuccessivelylarger free lists. See the (annotated) Figure 9.27 below for an example.
 Slab Allocation: Slab Allocationallocatesmemoryto the kernelin
chunks calledslabs, consisting ofone or more contiguous pages. The
kernel thencreatesseparate cachesfor eachtype ofdata structure it
might need from one or more slabs. Initiallythe cachesare marked
empty, andare markedfull as theyare used. Newrequests for space in
the cache is first grantedfrom emptyor partiallyemptyslabs, and if all
slabs are full, thenadditional slabs are allocated. Thisessentially
amounts to allocating space for arrays of structures, in large chunks
suitable to the size ofthe structure beingstored. For example if a
particular structure were 512 bytes long, space for them wouldbe
allocated in groups of8 using 4Kpages. Ifthe structure were 3K, then
space for 4 of them couldbe allocated at one time ina slabof 12K
using three 4Kpages. Benefits ofslab allocationinclude lackof internal
fragmentationandfast allocation ofspace for individual structures
Solaris usesslab allocationfor the kerneland also for certainuser-
mode memoryallocations. Linux used the buddysystemprior to 2.2
and switched to slaballocation since then.
Other Considerations:
 Prepaging: The basic idea behindprepaging is to predict the pages that will be needed inthe near future, and page them inbefore theyare
actuallyrequested. If a process was swapped out andwe know what its workingset wasat the time, thenwhenwe swapit back inwe cango
aheadandpage back in the entire working set, before the page faults actuallyoccur. Withsmall(data)files we cango aheadand prepage allof
the pagesat one time. Prepaging can be ofbenefit ifthe predictionis goodand the pages are neededeventually, but slows the system downif
the predictionis wrong.
 Page Size: There are quite a few trade-offs ofsmall versus large page sizes:Small pages waste less memorydue to internal fragmentation.
Large pages require smaller page tables. For disk access, the latencyandseek times greatlyoutweigh the actual data transfer times. Thismakes
it much faster to transfer one large page of data thantwo or more smaller pages containing the same amount of data. Smaller pages match
localitybetter, because we are not bringing in data that is not reallyneeded. Small pages generate more page faults, with attending overhead.
The physical hardware mayalsoplaya part in determiningpage size. It is hard to determine an"optimal" page size for anyg ivensystem.
Current norms range from4K to 4M, and tendtowards larger page sizes as time passes.
 TLB Reach: TLB Reachis defined as the amount of memorythat canbe reachedbythe pages listedinthe TLB. Ideallythe working set wouldfit
within the reach ofthe TLB. Increasing the size of the TLB is an obvious wayof increasing TLB reach, but TLB memoryis veryexpensive and also
draws lots of power. Increasingpage sizesincreasesTLB reach, but alsoleads to increasedfragmentationloss. Some systems provide multiple
size pagesto increase TLB reachwhile keeping fragmentationlow. Multiple page sizesrequires that the TLB be managed bysoftware, not
hardware.
 Program Structure: Consider a pair of nestedloops to access everyelement ina 1024 x 1024 two-dimensional arrayof 32-bit ints. Arrays in C
are storedinrow-major order, which means that eachrow of the arraywould occupya page of memory. If the loops are nested sothat the

10
outer loopincrements the row and the inner loopincrements the column, thenanentire row canbe processedbefore the next page fault,
yielding 1024 page faults total. On the other hand, if the loops are nested the other way, so that the program worked downth e columns
insteadof across the rows, theneveryaccess would be to a different page, yielding a new page fault for eachaccess, or over a millionpage
faults all together. Be aware that different languagesstore their arrays differently. FORTRAN for example storesarrays incolumn-major format
insteadof row-major. This means that blindtranslation ofcode fromone language to another mayturn a fast programintoa veryslow one,
strictlybecause ofthe extra page faults.
 I/O Interlock and Page Locking: There are severaloccasions whenit maybe desirable to lock pages in memory, and not let them get pagedout
— Certainkerneloperations cannot tolerate having their pages swapped out. IfanI/Ocontroller is doingdirect-memoryaccess, it would be
wrong to change pages in the middle of the I/O operation. Ina prioritybasedscheduling system, lowpriorityjobs mayneed to wait quite a
while before getting their turn onthe CPU, and there is a danger of their pages being pagedout before theyget a chance to use them even
once after paging them in. In this situation pages maybe lockedwhentheyare paged in, until the processthat requested themgets at least
one turn inthe CPU.
Operating-System Examples (Optional)
This section is onlyto consolidate your understanding andhelprevise the concepts inyour mindin a real-life case study. Just read throughit, noneed
to push yourself to memorize anything. Just mapmentallywhat youlearnt intothese realOS examples.
Windows:
 Windows uses demandpaging with clustering, meaning theypage in multiple pages whenever a page fault occurs.
 The working set minimumandmaximum are normallyset at 50 and345 pages respectively. (Maximums can be exceededinrare
circumstances.)
 Free pages are maintained ona free list, witha minimumthresholdindicatingwhenthere are enough free frames available.
 If a page fault occurs andthe processis below their maximum, thenadditional pages are allocated. Otherwise some pagesfrom thisprocess
must be replaced, using a local page replacement algorithm.
 If the amount offree frames falls below the allowable threshold, thenworking set trimming occurs, taking frames awayfroma nyprocesses
which are above their minimum, until all are at their minimums. Thenadditional framescanbe allocated to processesthat need them.
 The algorithmfor selecting victimframes depends onthe type of processor:
o On single processor 80x86 systems, a variationof the clock( secondchance ) algorithm is used.
o On Alpha andmultiprocessor systems, clearing the reference bits mayrequire invalidating entriesinthe TLB on other processors,
which is anexpensive operation. Inthis case Windows uses a variationof FIFO.
Solaris:
 Solaris maintains a list of free pages, andallocatesone to a faulting
thread whenever a fault occurs. It is therefore imperative that a
minimum amount of free memorybe kept on handat all times.
 Solaris hasa parameter, lotsfree, usuallyset at 1/64 of total physical
memory. Solaris checks 4 times per secondto see if the free memory
falls belowthis threshhold, and if it does, then the pageout process is
started.
 Pageout uses a variationof the clock(secondchance) algorithm, with
two hands rotating around throughthe frame table. The first hand
clears the reference bits, andthe secondhandcomesbyafterwards and
checks them. Anyframe whose reference bit hasnot beenreset before
the second handgets there gets pagedout.
 The Pageout methodis adjustable bythe distance betweenthe two hands, (the handspan), andthe speedat which the hands move. For
example, if the hands each check100 frames per second, andthe handspanis 1000 frames, thenthere wouldbe a 10 secondinterval between
the time whenthe leadinghandclears the reference bits and the time when the trailing handchecks them.
 The speedof the hands is usuallyadjusted according to the amount of free memory, as shownbelow. Slowscan is usuallyset at 100 pagesper
second, and fastscanis usuallyset at the smaller of 1/2 of the total physical pages per second and 8192 pages per second.
 Solaris alsomaintains a cache of pages that have beenreclaimedbut whichhave not yet beenoverwritten, as opposedto the free list which
onlyholds pages whose current contents are invalid. If one of the pages fromthe cache is neededbefore it gets movedto the free list, thenit
can be quicklyrecovered.
 Normallypageout runs 4 timesper secondto check if memoryhas fallenbelow lotsfree. However if it falls belowdesfree, thenpageout will
run at 100 times per second inanattempt to keepat least desfree pages free. If it is unable to dothis for a 30-secondaverage, thenSolaris
begins swapping processes, starting preferablywithprocesses that have beenidle for a long time.
 If free memoryfallsbelow minfree, then pageout runs witheverypage fault.
 Recent releases ofSolaris have enhancedthe virtual memorymanagement system, includingrecognizing pages fromsharedlibraries, and
protecting themfrom beingpagedout.

11
 Specifics:
 Linux-specific stuff
o XX
 Hardware-specific:
o XX
To be cleared
 Inverted Page Tables: Invertedpage tables store one entryfor eachframe instead ofone entryfor eachvirtual page. This reduces the memory
requirement for the page table, but loses the information neededto implement virtual memorypaging. A solutionis to keep a separate page
table for each process, for virtualmemorymanagement purposes. These are kept ondisk, andonlypaged inwhena page fault o ccurs. (i.e.
theyare not referencedwitheverymemoryaccess the waya traditional page table would be.)—Greyandinadequate as ofnow
Q’s Later
 XXX

12
Glossary
ReadLater
Further Reading
 Skipped: SharedMemoryinthe Win32 API (Memory-mapped filessection. There’s a figure there that says “Figure 9.26 Consumer reading from
sharedmemoryusingthe Win32 API”)

Grey Areas
 XXX
CHEW
 Whether the logicalpage size is equal to the physicalframe size (Yes!)
 Note that paging is like having a table ofrelocation registers, one for each page of the logicalmemory
 Page table entries (frame numbers) are typically32 bit numbers, allowing access to 2^32 physical page frames. Ifthose frames are 4 KB in size
each, that translates to 16 TB of addressable physical memory. (32 + 12 = 44 bits of physicaladdress space.)
 One optionis to use a set of registers for the page table. For example, the DECPDP-11 uses16-bit addressingand8 KB pages, resultinginonly
8 pages per process. (It takes13 bits to address8 KB of offset, leaving only3 bits to define a page number.)
 On page 12 of the lecture, do the TLB mathunder "(EighthEditionVersion:)". Required.
 More on TLB
 Apropos page 10 second bullet point of lecture, does it implicitlymean that the offset for bothpage number andframer number should be
same?
 Page 15 VAXArchitecture divides 32-bit addresses into 4 equalsized sections, andeachpage is 512 bytes, yieldinganaddress form of:
 What are segmentationunit and paging unit?
 Can parts of a page table/ page directorybe swappedout too?

Virtual memory pre-final-formatting

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie Virtual memory pre-final-formatting

Ähnlich wie Virtual memory pre-final-formatting (20)

Mehr von marangburu42

Mehr von marangburu42 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Virtual memory pre-final-formatting