ms

Published on June 2016 | Categories: Documents | Downloads: 55 | Comments: 0 | Views: 354

of 84

Content

Resource Management in Linux
BITS ZG629T: Dissertation
by
Balbir Singh
2005HZ12140
Dissertation work carried out at
IBM, Bangalore
BIRLA INSTITUTE OF TECHNOLOGY &
SCIENCE
PILANI (RAJASTHAN)
March 2007
3
Resource Management in Linux
BITS ZG629T: Dissertation
by
Balbir Singh
2005HZ12140
Dissertation work carried out at
IBM, Bangalore
Submitted in partial fulﬁllment of M.S.
Software Systems degree programme
Under the Supervision of
Srivatsa Vaddagiri, Software Engineer, IBM,
Bangalore
BIRLA INSTITUTE OF TECHNOLOGY &
SCIENCE
PILANI (RAJASTHAN)
March 2007
CERTIFICATE
This is to certify that the Dissertation entitled Resource Management in
Linux and submitted by Balbir Singh having ID-No. 2005HZ12140 for the
partial fulﬁllment of the requirements of M.S. Software Systems degree of
BITS, embodies the bonaﬁde work done by him under my supervision.
Signature of the Supervisor
Place :
Date :
Name, Designation & Organization & Location
i
ii
Abstract
This dissertation is an eﬀort to develop a resource management framework for
Linux. The framework consists of a basic infrastructure that allows resources
such as CPU, Memory, Disk I/O to be controlled and monitored. It inves-
tigates the development of resource management, the alternative solutions
proposed and the challenges faced for developing such a framework.
The dissertation describes the most essential concepts, such as an infras-
tructure, controllers, feedback and monitoring of resources. Each resource
controller has it’s own set of problems to solve. Since the work is being done
for a large community of Linux users, the addition of a new feature such as
this, should not impact existing users or users who are not interested in using
resource management.
The dissertation looks at two commonly used resources, CPU and mem-
ory. The resource control and monitoring mechanisms developed for them is
described in great detail in two separate chapters devoted to them.
iii
iv
Acknowledgments
I am greatly indebted to the my supervisor Srivatsa Vaddagiri for reviewing
the project and discussing it’s progress with me. My additional examiner
Maneesh Soni was very helpful throughout the duration of the project. I
want to thank Dr. Rahul Banerjee for being patient with me and answering
several of my questions about the dissertation.
I want to thank my parents (Mahendra Singh and Jaswant Kaur) and
my wife (Manpreet Kaur) for allowing me time to work on the dissertation,
while they waited for me at the dinner table, served me food, had a cheer-
ful conversation and encouraged me to ﬁnish my dissertation. My brother
(Tajendra Pal), oﬀered encouragement and logistical support to help ﬁnish
the project on time.
Finally I want to thank the Linux Community, who reviewed many lines
of code (some of which was silly or plain wrong), answered questions and gave
ideas to improve the project. Jamal Hadi, in particular carried out detailed
reviews of the taskstats code I wrote.
v
vi
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Resource Management Architecture 3
2.1 Accounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4.1 Grouping of tasks . . . . . . . . . . . . . . . . . . . . . 5
2.4.2 Movement of tasks . . . . . . . . . . . . . . . . . . . . 5
2.4.3 Limits and guarantees for resource groups . . . . . . . 5
3 Feedback 9
3.1 top . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 getrusage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Delay Accounting . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Delay Accounting Architecture . . . . . . . . . . . . . 13
3.3.2 Delay Accounting Performance . . . . . . . . . . . . . 14
4 Infrastructure 19
4.1 Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Implementing a controller . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Disadvantages of the Single Filesystem Hierarchy . . . . . . . 22
4.5 Implementation Alternatives . . . . . . . . . . . . . . . . . . . 23
4.5.1 Beancounters . . . . . . . . . . . . . . . . . . . . . . . 23
vii
viii CONTENTS
4.5.2 Class based Kernel Resource Management . . . . . . . 24
4.5.3 Containers . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Linux Internals 33
5.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Process Address Space . . . . . . . . . . . . . . . . . . 36
5.1.3 Memory Reclaim . . . . . . . . . . . . . . . . . . . . . 37
6 Memory Controller 45
6.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2.1 Accounting . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.2 Shared Pages Accounting . . . . . . . . . . . . . . . . . 49
6.2.3 Per Container Reclaim . . . . . . . . . . . . . . . . . . 49
6.2.4 LRU Behaviour . . . . . . . . . . . . . . . . . . . . . . 50
6.2.5 OOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.6 Freeing A Page . . . . . . . . . . . . . . . . . . . . . . 50
7 Results 55
7.1 System Time Test . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 %CPU Time Test . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Explanation Of Results . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Minor Fault Tests . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.5 Major Fault Tests . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.6 Explanation of Results . . . . . . . . . . . . . . . . . . . . . . 59
8 Summary 61
9 Directions For Future Work 63
List of Tables
3.1 Table of information that can be read from status and statm
ﬁles of /proc . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Data returned from getrusage . . . . . . . . . . . . . . . . . 12
3.3 Conﬁgurations used for performance analysis . . . . . . . . . . 15
3.4 Lmbench results, Processor, Processes - times in microseconds
- smaller is better . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Hackbench results, 200 groups, using sockets Elapsed time, in
seconds, lower better . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Kernbench results, Average of 5 iterations Elapsed time, in
seconds, lower better . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Context switching - times in microseconds - smaller is better . 16
3.8 *Local* Communication latencies in microseconds - smaller is
better . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.9 File & VM system latencies in microseconds - smaller is better 17
5.1 Kernel Memory Allocation Contexts . . . . . . . . . . . . . . . 38
ix
x LIST OF TABLES
List of Figures
2.1 Resource Management Architecture . . . . . . . . . . . . . . . 3
3.1 Screenshot of the top(1) program . . . . . . . . . . . . . . . . 10
3.2 Architectural View of Delay Accounting . . . . . . . . . . . . 13
3.3 TLV format of data exchange . . . . . . . . . . . . . . . . . . 14
4.1 Infrastructure components . . . . . . . . . . . . . . . . . . . . 20
4.2 Resource Distribution . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 A Bean Counter . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Bean Counter Overview . . . . . . . . . . . . . . . . . . . . . 28
4.5 Aggregated Bean Counters . . . . . . . . . . . . . . . . . . . . 29
4.6 CKRM Overview . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Containers Overview . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 VM overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Zone Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Page Cache Radix Tree . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Prio Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Anonymous Pages Reverse Mapping . . . . . . . . . . . . . . . 44
6.1 Memory Allocation From Within Container Flowchart . . . . 51
6.2 RSS Controller Overview . . . . . . . . . . . . . . . . . . . . . 52
6.3 Per Container LRU List . . . . . . . . . . . . . . . . . . . . . 53
6.4 Page State Transition Diagram . . . . . . . . . . . . . . . . . 53
7.1 System Time Variation With Varying Container Sizes . . . . . 56
7.2 % CPU Time Variation With Varying Container Sizes . . . . . 57
7.3 Minor Page Fault Variation With Varying Container Sizes . . 58
7.4 Major Page Fault Variation With Varying Container Sizes . . 59
xi
xii LIST OF FIGURES
List Of Acronyms
CPU Central Processing Unit
I/O Input Output
POSIX IEEE/Open group standard for Portable Operating System
Interfaces
RSS Resident Set Size
PID Process Identiﬁer
UID User Identiﬁer
GID Group Identiﬁer
TGID Thread Group Identiﬁer
TLV Type, Length and Value
TCP Transmission Control Protocol
IP Internet Protocol
VM Virtual Memory
OS Operating System
CKRM Class based Kernel Resource Management
BSS Block Started by Segment
LRU Least Recently Used
PTE Page Table Entry
xiii
xiv LIST OF FIGURES
RMAP Reverse Mapping
VMA Virtual Memory Area
ABI Application Binary Interface
OOM Out Of Memory
Chapter 1
Introduction
In this chapter we look at the motivation behind implementing resource man-
agement and discuss its scope and application to today’s enterprises.
1.1 Background
Virtualization is a key need for large enterprises. It allows enterprises to par-
tition their workloads without adding the overhead of additional hardware,
space and system administration. Each virtual system has a unique role to
play, for example, an enterprise might conﬁgure it’s system into two virtual
systems. It might use one of the virtual systems for use in production and
the other for use in a test environment. The test system might be in use
by developers working on the next release of the product. The production
system needs to use all available resources and the test system should receive
either a small percentage of available resources or unused resources.
To achieve this goal, resource management is used. In the example above,
the system administrator might conﬁgure the system to provide 80% of the
CPU to the production system and 20% to the test system. Resources are
usually administrated by grouping tasks and associating them with resources.
For example all tasks in a system belonging to a particular customer could
be grouped together and then they may be assigned a certain percentage
(say 30%), based on the contract with the customer. Accounting is needed
to monitor report usage. Control consists of providing guarantees and lim-
its on resource usage. A guarantee ensures that a certain percentage of a
resource is always available to the group of tasks. A limit ensures that the
1
2 CHAPTER 1. INTRODUCTION
usage (monitored by accounting the resource usage) does not exceed a certain
amount.
1.2 Motivation
The examples given in section 1.1 point us towards the need for sharing
resources yet achieving isolation. A group of tasks are isolated as a container.
Each container has a set of resources associated with it, examples include
CPU, Memory and Disk I/O. Each of these resources is shared across the
operating system, but in an unfair manner. A certain container might be
business critical and thus requires more resources than another lower priority
container. The isolation is achieved by setting limits and guarantees for each
resource.
Traditional UNIXes provide limited resource control. They provide a per
task POSIX interface called rlimit(2). rlimit, however does not meet the
needs of resource management for the following reasons —
1. It is per task based, most resource management requirements, need to
control a group of unrelated tasks
2. In the case of some of the resources (Resident Set Size (RSS)), rlimit
is not enforced
Thus, there is a need for developing a new resource management frame-
work and resource controllers to meet the ever increasing need for resource
management in the enterprise.
Chapter 2
Resource Management
Architecture
This chapter discusses an architecture that is applicable to most resource
management solutions that have been proposed so far.
Figure 2.1: Resource Management Architecture
In ﬁgure 2, the resource manager is shown in modular form as compo-
nents. The components include an accounting subsystem, a control subsys-
3
4 CHAPTER 2. RESOURCE MANAGEMENT ARCHITECTURE
tem and a feedback subsystem. The resource in the diagram represents a
managed resource.
2.1 Accounting
The accounting subsystem tracks the resource usage of each container. In
the case of a CPU controller (we shall deﬁne a controller later in section 4.1),
the accounting subsystem would track the cpu usage of each task and then
track the usage of the group of tasks (container). For renewable resources
like CPU time, it is required to deﬁne an interval over which accounting takes
place. In other words, the accounting must be restarted at the end of the
interval.
2.2 Control
The control subsystem implements resource control for the group of tasks.
If the system administrator limited the CPU usage of a group to say 10%
of the total CPU bandwidth available, the controller would jump into action
and prevent the group of tasks from using more than 10% of the bandwidth.
2.3 Feedback
Feedback tells us whether the group of tasks are making forward progress
as per their resource requirements and allocations. A system administrator
might initially assign 10% of the CPU to a critical business application. A
system administrator might then need feedback from the system indicating
how the tasks (applications) are fairing with their current resources. If the
feedback indicates that the tasks end up waiting for CPU most of the time,
the administrator knows that it is time to boost the CPU bandwidth of the
tasks.
The form in which feedback is provided depends on the resource being
controlled. In the case of CPU, the feedback would include the following
parameters
• Time on Runqueue
• Time waiting on the Runqueue
2.4. REQUIREMENTS 5
Good time on the runqueue and minimal waiting time on the runqueue
are good indications of a CPU hungry task making forward progress. The
subject of feedback is discussed in more detail in chapter 3
2.4 Requirements
The basic minimum common requirements for resource management were
discussed by Srivatsa in [3]
1. Resource Management should be supported for a group of tasks
2. A task should be able to move across resource groups
3. Setting of resource limits for a group of tasks should be supported
2.4.1 Grouping of tasks
As per requirement 1, it should be possible to group unrelated tasks for
resource management. Traditional UNIXes support task grouping in the
form of parent and child grouping. A session leader has to be the parent
or ancestor of all other tasks in that session. Consider a web server and
database server that host a critical application. Even though these processes
are not related, it should be possible to group them into one task group.
2.4.2 Movement of tasks
Requirement 2 states that tasks be allowed to migrate across task groups.
This is extremely useful for database servers. A query thread can run on
behalf of several databases. Depending on the priority or criticality of the
database instance, the thread might migrate to the appropriate resource
group and execute the query.
2.4.3 Limits and guarantees for resource groups
A limit on a resource in a resource group deﬁnes the extent to which the
group might utilize the resource. Limits can be classiﬁed as
6 CHAPTER 2. RESOURCE MANAGEMENT ARCHITECTURE
• Hard Limit
Hard limits do not allow the group to consume more resources than the
speciﬁed hard limit value. If the limit for the CPU resource utilization
of a group is 10%, then when the group has reached it’s limit
1
, it is
preempted.
• Soft Limit
Soft limits can be exceeded as long as the system has idle resources to
spare. In the example above, if the soft limit is 10%, the group might
end up using more than 10% of the CPU bandwidth, provided it does
not impact the resource usage of any other group in the system.
Resource Guarantees ensure that a minimum amount of resources will be
available to a group. A guarantee of 10% CPU bandwidth will ensure that
at least 10% of the CPU bandwidth is available to the group. If the group
does not utilize it’s guaranteed resources, the resource manager is free to
redistribute them to other groups.
It has been shown that guarantees can be implemented using resources
by Pavel, et.al [4]. If a group of tasks g
i
require a guarantee of G
i
units of
resource, then limiting resource usage of the rest of the groups to 100 − G
i
units provides the desired guarantee.
Let R be the total resources available, g
i
be the guarantee to be provided
to group i. Let L
i
represent the limit applied to each group. G
i
is the
guarantee of each group. Then we have

L
2
+L
3
+. . . +L
N
= R −G
1
L
1
+L
3
+. . . +L
N
= R −G
2
. . .
L
1
+L
2
+. . . +L
N−1
= R −G
N
(2.1)
In matrix form, the equation becomes
AL = G (2.2)
where
1
The accounting system is responsible for monitoring the resource usage of every group
2.4. REQUIREMENTS 7
A =

0 1 1 · · · 1 1
1 0 1 · · · 1 1
· · ·
1 1 1 · · · 1 0
¸
¸
¸
¸
, L =

L
1
L
2
.
.
.
L
N
¸
¸
¸
¸
¸
, G =

R −G
1
R −G
2
.
.
.
R −G
N
¸
¸
¸
¸
¸
(2.3)
and thus the solution is
L = A
−1
G (2.4)
After manipulating rows and columns, the solution is
A
−1
=
1
N − 1
(A − (N − 2)I) (2.5)
I is the identity matrix.
The system administrator can at run-time reassign newer limits to any
of the groups. With this approach for providing guarantees the main issue
is that it takes O(n) to modify the guarantees of a group. This is
because a change of any of a guarantee G
i
or limit L
i
of a group g
i
impacts
the calculation of all other groups.
8 CHAPTER 2. RESOURCE MANAGEMENT ARCHITECTURE
Chapter 3
Feedback
Feedback is a very important component of resource management. It was
brieﬂy described in section 2.3. Feedback can help the administrator or
a system monitor decide if the resources allocated to a certain group are
suﬃcient.
Feedback can be obtained from the operating system in several forms.
The most important forms are covered and a new form of feedback called
delay accounting developed for resource management is described. Delay
accounting has been written by the author and accepted into the mainline
linux kernel.
3.1 top
A common utility used for tracking and monitoring resource usage is top(1).
top is an interactive utility that displays tasks and information about the
%CPU, %MEM of each task.
As shown in ﬁgure 3.1, the utility displays CPU idle time, available mem-
ory, swap number of tasks running, etc. It can also sort the output by any of
the displayed ﬁelds. Although top provides good feedback, it is an interac-
tive program and requires somebody to monitor the system, understand the
output and then apply the feedback.
One way to automate the feedback is to get the information programat-
ically from the same place that top gets its information. Linux provides a
/proc ﬁlesystem, where all process speciﬁc information is made available.
top also gets its information from /proc
9
10 CHAPTER 3. FEEDBACK
Figure 3.1: Screenshot of the top(1) program
The /proc ﬁlesystem provides the following information per task —
A program can open the relevant ﬁle in /proc and read task statistics.
3.2 getrusage
getrusage(2) is a system call that provides the information about the fol-
lowing parameters of a task
Integral value implies that the ﬁnal value is a product of the execution
time of the task and the value represented. In the case of ru ixrss for
example, the total shared memory size is multiplied by the execution time of
the task
Table 3.2 shows the values returned by calling getrusage(2)
3.3 Delay Accounting
We listed several methods of gathering feedback about a task’s execution in
this chapter. These mechanisms are not completely suﬃcient for the purpose
of providing feedback to the resource management subsystem. The mecha-
nisms have the following limitations —
3.3. DELAY ACCOUNTING 11
CPU Memory
Name of the task Total Program Size
State of the task (Sleeping,
Running, Stopped, Traced,
Blocked )
Number of Shared Pages
PID, Parent PID, Thread
group ID, UID, GID
Code size of the program
Number of Threads Stack and Data Size
Signals Pending, Blocked, Ig-
nore
Number of library pages
Task capabilities Number of dirty pages
Table 3.1: Table of information that can be read from status and statm
ﬁles of /proc
• getrusage works only on the current task (the task calling the
system call) or on a child task of the task making the call
• The data provides information about the time a resource was used by
a task. No information is provided about the contention for a task
or the delay incurred waiting for a resource
• The information provided is not event based. A program can read
this data, but there is no mechanism to read the data on speciﬁc events,
like a task exiting, a new task being forked, etc.
To address these issues, a delay accounting system was developed for
Linux by Balbir Singh and Shailabh Nagar in [5]. Delay accounting provides
the following beneﬁts —
1. It includes delay information
Delay accounting provides the following statistics
(a) CPU
Data includes CPU run time, wait time on the runqueue and vir-
tualized cpu usage time
(b) Block IO
Data includes the time spent waiting on synchronous block I/O
and the total number of bytes transferred
12 CHAPTER 3. FEEDBACK
Field Meaning
ru utime The time the task spent in user space
ru stime The time the task spent in kernel space
ru maxrss Maximum resident set size of the task
ru ixrss Integral shared memory size
ru idrss Integral unshared data size of the task
ru isrss Integral unshared stack size of the task
ru minﬂt Page reclaims from the task
ru majﬂt Page faults seen by the task
ru nswap Number of pages swapped of the task
ru inblock Block input operations for the task
ru oblock Block output operations for the task
ru msgsnd Messages sent by the task
ru msgrcv Messages received by the task
ru nsignals Signals sent to the task
ru nvcsw Voluntary context switches of the task
ru nivcsw Involuntary context switches of the
task
Table 3.2: Data returned from getrusage
(c) Swap
Data includes the total bytes swapped and the time waiting for
swapping to take place
2. It’s event based
It notiﬁes the user of certain system events and the event notiﬁcation
also includes data. Every time a task exits, the task data is multicast
to all listeners. There is also a provision to explicitly request data of a
particular task
3. Both process and thread data is provided
Delay accounting reports both thread data, and the data of the entire
thread group (commonly referred to as process)
3.3. DELAY ACCOUNTING 13
3.3.1 Delay Accounting Architecture
The delay accounting subsystem consists of two independent modules. One
that collects delay information from all tasks and the other that communi-
cates this information to user space. The second component is known as
taskstats and is now the standard interface in Linux to communicate per
task information. Taskstat is based on the genetlink socket interface avail-
able in the Linux kernel. Genetlink is well described by Jamal Hadi and Paul
Moore in [7].
.

.

.
tasks
.

.

.
tasks
.

.

.
tasks
CPU
Kernel
User
CPU
genetlink socket
port
port
application
application
CPU
delayacct
taskstats
Figure 3.2: Architectural View of Delay Accounting
Figure 3.2 shows the delay accounting architecture. The user space appli-
cation binds to a particular port (which is usually the pid of the application)
and sends/receives data from the kernel’s genetlink component. The kernel’s
14 CHAPTER 3. FEEDBACK
genetlink component binds on port 0, to indicate that it is a kernel compo-
nent. All socket data ﬂow is shown in red color. As shown in the diagram,
multiple applications can simultaneously talk through genetlink to receive
data.
The delayacct component is responsible for data collection, where as
the taskstats component is responsible for communication with user space.
The application can specify the CPUs and tasks it is interested in. The data
from only those tasks or tasks on those CPUs will be communicated back to
the application.
type
length
data
l
e
n
g
t
h
Figure 3.3: TLV format of data exchange
Figure 3.3 shows the format of the data exchanged between kernel and
user space. This format is used by all netlink applications and provides for
type checking of otherwise unstructured data.
3.3.2 Delay Accounting Performance
The delay accounting code was subject to several benchmarks to measure
the overhead of the feature
3.3. DELAY ACCOUNTING 15
Results Highlights
• Conﬁguring delay accounting adds < 0.5% overhead in most cases and
even reduces overhead in some cases
• Enabling delay accounting has similar results with a maximum over-
head of 1.2% for hackbench, most other overheads < 1% and reduction
in overhead, in some cases
The results were collected for the following three conﬁgurations
Base Vanilla 2.6.16-rc6 kernel without any patches applied
patch Delay accounting conﬁgured but not enabled at boot
patch+enable Delay accounting enabled at boot but no stats read
Table 3.3: Conﬁgurations used for performance analysis
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
base Linux
2.6.16-
2783 0.17 0.33 5.17 6.49 13.4 0.64 2.61 146. 610. 9376
+patch Linux
2.6.16-
2781 0.17 0.32 4.75 5.85 13.0 0.64 2.62 145. 628. 9393
+patch+en Linux
2.6.16-
2784 0.17 0.32 4.71 6.14 13.4 0.64 2.60 150. 616. 9402
Table 3.4: Lmbench results, Processor, Processes - times in microseconds -
smaller is better
16 CHAPTER 3. FEEDBACK
%Overhead Time
Base 0 12.468
+patch 0.4% 12.523
+patch+enable 1.2% 12.622
Table 3.5: Hackbench results, 200 groups, using sockets Elapsed time, in
seconds, lower better
%Overhead Time
Base 0 195.776
+patch 0.2% 196.246
+patch+enable 0.3% 196.282
Table 3.6: Kernbench results, Average of 5 iterations Elapsed time, in sec-
onds, lower better
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
base Linux
2.6
4.340 4.9600 7.3300 6.5700 30.3 10.4 36.0
+patch Linux
2.6
4.390 4.9800 7.3100 6.5900 29.7 9.62000 35.8
+patch+en Linux
2.6
4.560 5.0800 7.2400 5.6900 22.7 10.3 33.8
Table 3.7: Context switching - times in microseconds - smaller is better
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
base Linux-2.6 4.340 15.9 12.2 18.3 24.9 21.5 29.1 45.3
+patch Linux-2.6 4.390 15.7 11.8 18.6 22.2 22.0 29.1 44.8
+patch+en Linux-2.6 4.560 15.6 12.1 18.9 25.3 21.9 27.1 45.1
Table 3.8: *Local* Communication latencies in microseconds - smaller is
better
3.3. DELAY ACCOUNTING 17
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
base Linux-2.6 39.8 58.0 112.0 82.6 8417.0 0.838 2.00000
+patch Linux-2.6 39.6 58.2 111.0 82.3 8392.0 0.864 2.00000
+patch+en Linux-2.6 39.6 59.1 112.8 83.2 8308.0 0.821 2.00000
Table 3.9: File & VM system latencies in microseconds - smaller is better
18 CHAPTER 3. FEEDBACK
Chapter 4
Infrastructure
A very important aspect of resource management is the infrastructure pro-
vided to its users and programmers.
4.1 Controllers
A controller is a generic term used for a manager of a resource. The manager
for CPU is called the CPU controller. Controllers use the basic infrastructure
provided by the resource manager to manage a particular resource. Con-
trollers can actually be seen as plugins. It is possible to plug-in a controller
for any resource.
Figure 4.1 shows the architectural view of the infrastructure component of
the resource manager. Each resource has an associated controller (as stated
earlier). This allows for evolutionary or organic development of resource
management. When the infrastructure is ready, then as and when required,
controllers can be added.
4.2 User Interface
The user interface allows the user/system administrator to conﬁgure the
resource manager. The conﬁguration options are mostly derived from the
requirements stated in 2.4.
The user interface is usually one of
• File system interface
19
20 CHAPTER 4. INFRASTRUCTURE
C
o
n
t
r
o
l
l
e
r
C
o
n
t
r
o
l
l
e
r
C
o
n
t
r
o
l
l
e
r
CPU
... ...
Infrastructure
mount point
filesystem
system calls
kernel user
u
s
e
r
controller
Memory
Figure 4.1: Infrastructure components
The ﬁle system interface allows the user to mount a special resource
management ﬁlesystem. On mounting this ﬁlesystem, a set of ﬁles are
exported. These ﬁles when written to, change a particular attribute
of resource management. Some ﬁles exported are controller speciﬁc.
For example, there is usually a ﬁle named tasks, writing to the ﬁle,
adds the particular task to the current container. Reading it, will
list all tasks that are grouped under the container. Each controller
usually exports a ”statistics” ﬁle, that shows the resource usage of the
container, it’s current allocation and many other interesting statistics.
Another ﬁle when written to, will control the allocated resources to the
4.2. USER INTERFACE 21
container, this is called the ”control” ﬁle.
A ﬁle system interface easily lends itself to a hierarchy. It allows ﬁles
and directories to be created under the each directory, thus allowing
us to create a full ﬂedged tree. In this tree, the directories are nodes
and individual ﬁles is equivalent to the data stored at each node. A
novel use of the hierarchy is to allow each subtree to sub-divide among
themselves, the resources available at the root of the subtree.
In ﬁgure 4.2 the root container has 100% of the resources allocated to
it. It’s child A has allocated 50% of the resources. If the root had
100 units of resources to begin with, then A would have 50. With the
ﬁlesystem interface, by setting appropriate permission and ownership,
any user could be given access or control of a subtree of resources. In
this example, user U1 for example could own A and would then be free
to decide how to distribute the resources available to him/her. In this
example, the owner of A, creates A
1
, A
2
and A
3
with 10%, 50% and
10% of A’s resources. That would mean that A
1
has 5 units, A
2
has 25
units and A
3
has 5 units. 15 units of resources of A are left unutilized.
In the example, irrespective of the resource type, the percentage units
allocated to each group is the same. A
1
has 5 units for CPU, memory,
IO or any other resource for which a resource controller is provided.
• System call interface
Another mechanism for providing a user interface is through system
calls. A system call can be written for each operation. The advantage
of this approach is that a new ﬁlesystem need not be developed for
resource management. The disadvantages are that
– Implementing a resource hierarchy is diﬃcult
– Adding a system call, requires changes across all supported archi-
tectures
– A new system call needs to be added for every operation
– A small change in the system call will break the ABI (Application
Binary Interface)
22 CHAPTER 4. INFRASTRUCTURE
4.3 Implementing a controller
Both CKRM [8] and Containers [9] provide an infrastructure to develop new
controllers. The basic operations supported by the infrastructure are —
1. Registration of a controller. Along with the register, a set of call-
back operations are provided. These are invoked on each controller
registered. A resource is controlled only once a controller is regsistered.
2. De-registration of a controller. After deregistration, resource control/
management on that resource is no longer supported.
4.3.1 Callbacks
The callbacks registered are typically invoked when
• A new task is added to the resource group
• A task is removed from the resource group
• The resource limits of the resource group are changed
• A resource group is created or deleted
4.4 Disadvantages of the Single Filesystem
Hierarchy
The ﬁlesystem hierarchy as deﬁned in section 4.2 has a big disadvantage that
all resources associated with a hierarchy have to share the same limits. In
ﬁgure 4.2, all resources CPU, Memory, I/O bandwidth for resource group A
2
have 25 units of resources available.
It might be desirable (see [3]) to have a diﬀerent hierarchy for each re-
source. The CPU resource’s hierarchy might be diﬀerent from the memory
resource hierarchy. If this is possible, one could group tasks for each resource
group diﬀerently.
4.5. IMPLEMENTATION ALTERNATIVES 23
4.5 Implementation Alternatives
There are several alternatives for implementing the infrastructure, they im-
plement various interface mechanisms discussed earlier
4.5.1 Beancounters
Beancounters were developed and implemented by Alan Cox [10] . They were
later enhanced by OpenVZ developers [15]. Beancounters implement a small
counter that is referenced counted.
Figure 4.3 shows what a beancounter looks like. The bean counter con-
tains accounting information for resource (how much of it used, etc) and limit
information, which dictates the hard and soft limits for the resource.
Figure 4.4 shows the architecture of the bean Counter implementation.
The implementation uses a global hash table of all beancounters. Each bean-
counter has a unique identiﬁer (a number). The number is input to the hash-
ing function which determines where the bean counter will be placed. Each
task on the system has two members fork bc and exec bc which point to
the bean counter to which the task belongs. fork bc points to the parent’s
bean counter, where as exec bc points to the current bean counter which is
being charged for the resources that the task consumes. During fork, both
fork bc and exec bc are inherited from the parent.
Bean counters provide system calls to –
1. Set/Change limits of a bean counter
2. Create a new bean counter
3. Get information from the bean counter about its resource usage
Aggregated Beancounters
The limitation of bean counters is that does not allow movement of tasks (as
discussed in section 2.4). To address this limitation Balbir [11] developed
aggregated bean counters.
Figure 4.5 shows the architecture of aggregated bean counters. Each
aggregated bean counter as the name suggests is an aggregation of other
bean counters. Under this scheme, each task is now associated with one
bean counter. A group of bean counters form an aggregated bean counter.
24 CHAPTER 4. INFRASTRUCTURE
Each aggregated bean counter like the bean counter has a set of limits and
resource usage statistics for the aggregation.
In the ﬁgure, task T
1
is associated with bean counter B
1
and aggregated
bean counter A
1
. The ﬁgure shows a task T
5
of the task group moving. It
moves from aggregated bean counter A
1
to A
2
(the movement is shown by
dotted lines). When the task moves, its associated bean counter B
5
also
moves to A
2
. Resource control and monitoring is now applied to the aggre-
gated bean counter, instead of the individual bean counter.
4.5.2 Class based Kernel Resource Management
CKRM (Class based Kernel Resource Management) [8] was proposed to the
Linux Kernel Community in 2003. CKRM provides an infrastructure very
similar to the AIX 5L Workload Manager [14]. CKRM consists of an infras-
tructure to write controllers and group tasks. CPU, Memory and Disk I/O
controllers were developed for CKRM.
Figure 4.6 shows the CKRM architecture. CKRM supports ﬁle system
based hierarchical conﬁguration. control and monitoring of resource parame-
ters. CKRM also provides an additional component called the Classiﬁcation
Engine. As stated previously, the container and resource hierarchy is inher-
ited by the child on fork. The classiﬁcation engine classiﬁes the task to a
class on fork or exec based on a pre-deﬁned set of rules set by the system
administrator. The system administrator could for example, classify all cron
jobs to be automatically moved a particular class.
4.5.3 Containers
The Containers implementation by Paul Menage [9] is based on an existing
linux kernel feature called CPUsets. CPUsets [12] are lightweight objects that
allow the system administrator to partition multiprocessor systems. Tasks
can then be attached to CPUsets and each task would run in its own CPUset
domain. CPUsets provide a pseudo hierarchical ﬁlesystem for conﬁguration
and control.
Containers leverage the similarities and features provided by CPUsets.
Containers reuse the CPUset code and provide generic capabilities to
• Group unrelated tasks into a task group
• To register a controller
4.5. IMPLEMENTATION ALTERNATIVES 25
• Each controller can create custom ﬁles for control and monitoring
• It is possible to add and remove tasks from a container
Figure 4.7 shows the containers architecture. One diﬀerentiating factor
between containers and CKRM is the support for multiple hierarchies. As
shown in the ﬁgure, each task has a pointer to an array of pointers called
resource heads. Each resource (CPU, Memory, etc) has its own hierarchy
and a task can belong to a diﬀerent container in each hierarchy. In the
ﬁgure, the task belongs to container C
2
for CPU bandwidth management
and gets the resources allocated to C
2
. It belongs to container M
1
and gets
the resources of container M
1
for its memory management purposes.
This ﬂexible grouping allows for better resource management as each task
might have diﬀerent resource management needs.
26 CHAPTER 4. INFRASTRUCTURE
50%
Root
A
A1
A2 A3
A21
A22
B
C
100%
50%
10%
10%
50%
50%
Figure 4.2: Resource Distribution
4.5. IMPLEMENTATION ALTERNATIVES 27
Accounting
information
Limits
Bean Counter
Figure 4.3: A Bean Counter
28 CHAPTER 4. INFRASTRUCTURE
Tasks
B
e
a
n
C
o
u
n
t
e
r

G
l
o
b
a
l

H
a
s
h

T
a
b
l
e
Beancounter(s) CPU(s)
e
x
e
c
_
b
c
fork_bc
Figure 4.4: Bean Counter Overview
4.5. IMPLEMENTATION ALTERNATIVES 29
Aggregated Bean Counter
A1
Aggregated Bean Counter
A2
Beancounters
B1
B2
B3
B4
B5
task group
T1
T2
T5
T4
T3
Figure 4.5: Aggregated Bean Counters
30 CHAPTER 4. INFRASTRUCTURE
Figure 4.6: CKRM Overview
4.5. IMPLEMENTATION ALTERNATIVES 31
task
resource heads
CPU
Memory
resource
hierarchy
resource
hierarchy
C1 C2
C11
C12
C111
C121
C122
C3
C31
C32
C21
M1
M2
M11
M12
M3
M31
M111
M121
M122
M32
Figure 4.7: Containers Overview
32 CHAPTER 4. INFRASTRUCTURE
Chapter 5
Linux Internals
This chapter explores the internal working of the subsystems that we intend
to control and monitor for the purpose of resource management
5.1 Virtual Memory
Linux’s virtual memory underwent a major overhaul between 2.4 and 2.6.
Even in 2.6 there are some major changes going into the VM subsystem. Fig-
ure 5.1 shows the top level organization of the VM. The Linux VM subsystem
has two basic data structures called node and zone to identify and map all
available main memory. Each node is represented by a struct pglist data
and each zone is represented by a zone. The data structures are shown below
struct zone {
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
unsigned long lowmem_reserve[MAX_NR_ZONES];
int node;
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
struct per_cpu_pageset *pageset[NR_CPUS];
spinlock_t lock;
seqlock_t span_seqlock;
struct free_area free_area[MAX_ORDER];
spinlock_t lru_lock;
33
34 CHAPTER 5. LINUX INTERNALS
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long nr_active;
unsigned long nr_inactive;
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
atomic_t reclaim_in_progress;
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
int prev_priority;
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
char *name;
} ____cacheline_internodealigned_in_smp;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_NR_ZONES];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
spinlock_t node_size_lock;
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
5.1. VIRTUAL MEMORY 35
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
} pgdata_t;
Inactive
Node 1 Node 2
Node n
Zone DMA32
Zone Normal Zone highmem
...
...
Hardware Topology
Physical Address Space
Active
Figure 5.1: VM overview
A node represents a physical segment of memory, it is usually associated with
NUMA machines
1
. One can think of the node as a topology mapper. A zone is
used to map physical memory ranges, ZONE DMA for example typically ranges from
0-16MB. Zone setup is very architecture speciﬁc and node setup is very machine
speciﬁc. As shown in the ﬁgure, it is possible for multiple zones to be present in
one node and for one zone to span multiple nodes. Each zone is associated with
two LRU lists, the active and inactive list. The page reclaimer algorithm uses
these two lists to free pages. Rik Van Riel [16] provides more details on the page
reclamation algorithm in Linux.
1
Some architectures like x86 64 support emulation of nodes, such nodes are called fake
nodes
36 CHAPTER 5. LINUX INTERNALS
5.1.1 Memory Allocation
Linux uses the buddy allocator for page allocation. The basic principles behind
buddy allocation are well described by Knuth in [17]. The buddy allocator is also
known as the power of 2 allocator. Each page in memory is kept in multiple lists;
each list is a power of 2. When a new page is to be allocated, if the request cannot
be satisﬁed from the current list (because of fragmentation), the request is passed
on to the next list, the list which satisﬁes the request is split into two. Each group
of pages is associated with a buddy group of pages, when the buddy group is freed,
pages are coalesced to form a bigger grouping. These pages are then moved to a
higher (power of 2) list.
The biggest advantage of the buddy system is the simplicity of calculating page
buddies and splitting page groups. All these operations can be performed using
arithmetic bit manipulation operators quite eﬃciently.
Each zone in Linux provides its own buddy allocator. The ﬁelds free area of
struct zone contains all the relevant data structures. At system boot up time,
the kernel frees all pages, this puts all the pages in the buddy system and builds
the buddy information for the system.
Figure 5.2 shows the zone allocator. The Linux kernel optimizes allocation by
using a per-cpu cache of pages. The per cpu pageset is further split into hot and
cold cache. The hot cache indicates that the contents of the page are likely to
be in the cpu cache. The cold cache is the opposite. User mode and kernel tasks
usually allocate a page from the hot cache, where as operations such as DMA are
likely to use the cold cache.
Kernel memory allocation also takes into account the context from which the
allocator is called. Table 5.1.1 documents the ﬂags that can be passed to the
allocator to control its behaviour.
5.1.2 Process Address Space
The process address space typically consists of the following regions
• A memory map of the code called the text section
• A memory map of the initialized data called the data section
• A memory map of the zero page (used by the BSS section). BSS stands for
Block Started by Symbol. It contains uninitialized globals and uninitialized
static locals.
• A memory map of the zero page used for the user space stack
• Other text, data and bss sections of shared libraries loaded by the program
5.1. VIRTUAL MEMORY 37
Figure 5.2: Zone Allocator
• Any other memory mapped ﬁles
• Any shared memory segments
• Any anonymous memory (anonymous memory refers to memory that is not
ﬁle backed)
A more detailed description of the process address space can be found in Robert
Love [2] and Bovet, et.al [1]
5.1.3 Memory Reclaim
The kernel keeps pages associated with user space in a special list called the LRU
list. The page reclaimer scans this list to free pages at regular intervals. The
38 CHAPTER 5. LINUX INTERNALS
Flag Name Meaning
GFP WAIT Can Wait and reschedule
GFP HIGH The kernel should access emergency pools to
satisfy the requests if required
GFP IO The kernel can start I/O to free some pages if
required
GFP COLD Request for cache cold page
GFP NOWARN Don’t warn in case the page allocation fails
GFP REPEAT Retry the allocation on failure
GFP NOFAIL The allocation cannot fail, keep retrying the
allocation
GFP DMA Allocate memory from the DMA zone
GFP DMA32 Allocate memory from the DMA32 zone
GFP HIGHMEM Allocate memory from the highmem zone
GFP NORETRY The kernel should not retry the allocation on
failure
GFP COMP Add compound page metadata
GFP ZERO Zero out the contents of the page upon suc-
cessful allocation
GFP NOMEMALLOC The kernel should not use its emergency pools
to satisfy the request on failure
GFP HARDWALL Used by CPUsets
GFP THISNODE Used by CPUsets
Table 5.1: Kernel Memory Allocation Contexts
page reclaimer is also activated under memory pressure to free some memory. The
reclaimer works only on user space memory. Kernel memory in Linux is pinned
and therefore cannot be swapped out or moved to disk.
The memory to be reclaimed can be broadly classiﬁed into mapped pages and
unmapped pages. Mapped pages are pages that are mapped into the page tables
of processes (pages brought into memory as a result of malloc(3) or through
mmap(2) system call are examples of mapped pages). Unmapped pages memory
refer to pages that are cached into memory from disk, these pages are not mapped
into the page table of any process (pages read using the read(2) system call are
examples of unmapped page cache pages). Linux maintains a single global LRU
per zone for both mapped and unmapped pages.
5.1. VIRTUAL MEMORY 39
Linux collectively refers to all ﬁle backed memory as Page Cache. The page
cache is stored in the form of a radix tree.
Figure 5.3 shows a sample radix tree. The radix tree consists of one root and
other nodes. The nodes are organized hierarchically. Nodes use RADIX TREE MAP SHIFT,
(typically 6 bits) to indicate the position of the node in the hierarchy. Each node
uses an array of pointers, called slots. The ﬁrst 6 bits of the bitmask point to the
node in the ﬁrst level of the radix tree, the next 6 bits point to the node in the
second level of the tree and so on. The slot pointer is empty if the corresponding
bit is cleared in the bitmask of the node. Each node in the tree can also be tagged,
to indicate that the pages are dirty or to set writeback. The radix tree allows
for eﬃcient tagging of pages and bulk retrieval and I/O write operations. More
details on the radix tree implementation can be found in the article on LWN by
Corbet [18]
Pages that are mapped into the page table of the process are called mapped
pages. Pages obtained via malloc(3) and mmap(2) (on a ﬁle) calls are examples
of mapped pages. Mapped pages can be further categorized as anonymous and
ﬁle backed pages. Anonymous pages are the ones that are swapped out (provided
swapping is conﬁgured), ﬁle backed pages are written back to disk and brought in
when required. Reclaiming mapped pages poses a problem. Mapped pages can be
shared (the same page can be mapped in the page tables of diﬀerent processes or
at diﬀerent locations in the same process). To solve the problem with reclaiming
shared pages, Linux uses the concept of rmap. rmap stands for reverse mapping.
Linux maintains a mapping from each page to all places where it is mapped. This
can be a one to many mapping. A normal mapping maps the PTE to page.
Linux has three kinds of page mappings —
1. Anonymous Mapping
2. File Mapping
3. Non Linear Mapping
Each mapping requires a unique solution for reverse mapping. The ﬁrst two
types of mapping are also referred to as linear mappings. Non linear mapping
help applications that want to map several pieces of a ﬁle into diﬀerent parts
of memory as opposed to mmap(2) which maps the full ﬁle linearly into memory.
Non Linear mappings are especially useful for database applications which manage
several chunks of data (relational tables, metadata, etc). Non linear mappings are
exposed through a new system call remap file pages(2).
Linux uses a prio tree for maintaining reverse mapping of ﬁle mappings.
40 CHAPTER 5. LINUX INTERNALS
Figure 5.4 shows a sample organization of a prio tree
2
. The section on the
left shows the ranges (in terms of pages) of a ﬁle mapped by diﬀerent VMA’s
3
.
The prio tree is a mix of binary search trees and heap trees. The tree uses three
parameters, the [radix index, size, heap index]. The radix index, indicates the
starting page oﬀset of the ﬁle mapping
4
. The size is the size of the mapping in
pages and the heap index represents the last page of the mapping.
All the radix indices are stored in binary search tree order and the heap indices
are stored in heap tree order. This helps quickly search all the VMA’s that share
a particular page. More details on the working of the prio tree can be found in
Bovet, et. al [1].
Reverse mapping for anonymous pages is done using anon vma.
Figure 5.5 shows the organization of anonymous reverse mapping. Each page
has metadata associated with it, the metadata structure is called page descrip-
tor. In the example shown in the ﬁgure, the page descriptors’ mapping ﬁeld points
to a anon vma structure. The anon vma structure has a linked list of all VMA’s that
share that page. Thus to obtain the reverse mapping, one should get to the page
descriptor and the associated mapping and walk through all the linked VMA’s.
The forward mapping is done through the page table descriptor, stored in the
mm struct associated with the VMA. Each process has one dedicated mm struct.
An mm struct represents the entire process address space of a process.
Linux uses a variant of the LRU algorithm for page reclaim. The algorithm is
shown in pseudo code below
1. Shrink the LRU in 5 passes
(a) Reclaim from inactive list only
(b) Reclaim from active list but don’t reclaim mapped
(c) 2nd pass of type 1
(d) Reclaim mapped (normal reclaim)
(e) 2nd pass of type 3
2. In each pass on the LRU do
3. For each zone do
(a) shrink the active list
i. Isolate LRU pages
2
short for priority tree
3
The VMA represents a range of virtual address used by the process
4
see the mmap(2) system call parameters for more details
5.1. VIRTUAL MEMORY 41
ii. If the page is in use or if we are not allowed to reclaim mapped
pages put this page back to acitve list
iii. Put this page on the inactive list
(b) shrink the inactive list
i. Isolate LRU pages
ii. Shrink page list, this step involves unmapping pages from their
mappings, swapping out anonymous pages and writing out dirty
pages back to disk.
iii. Put the unreclaimed pages back to active or inactive list depending
on the state of the page
42 CHAPTER 5. LINUX INTERNALS
Figure 5.3: Page Cache Radix Tree
5.1. VIRTUAL MEMORY 43
Figure 5.4: Prio Tree
44 CHAPTER 5. LINUX INTERNALS
Figure 5.5: Anonymous Pages Reverse Mapping
Chapter 6
Memory Controller
Memory is a very unique resource, unlike other resources like CPU time and Disk
IO bandwidth, this resource is non-renewable and highly shareable. This brings
in additional complexity since we need to now —
1. Track each page to its owner
2. Track the history of allocation and freeing
Figure 6.1 shows the ﬂowchart of actions that are taken by the memory con-
troller, when a page is allocated. The accounting subsystem checks to see if the
container is overlimit. If so, it tries to bring the container under control, by
reclaiming pages belonging to the container. If the reclaim is successful, the allo-
cation proceeds normally. Otherwise, we select a task from the container to kill.
The selection algorithm uses the existing algorithm for selecting a bad task to kill
when the system is running out of memory.
Freeing pages is relatively straight forward. When a page is freed, the account-
ing for the container is updated.
6.1 Design Considerations
While designing the controller, the following constraints need to be kept in mind
—
1. Feature Overhead
The overhead of resource control on users not using the feature should be
zero This constraint in-turn has several implications
45
46 CHAPTER 6. MEMORY CONTROLLER
(a) We cannot change the page descriptor. The page descriptor is meta
data structure that keeps information about a page in memory. It
includes information about what zone the page belongs to, what is
the state of the page (I/O, locked, dirty, etc), the LRU lists to which
the page belongs, etc. To implement a resource controller, we need
to change this data structure to add additional information about the
container it belongs to and a per container LRU list. The current page
data structure is shown below
struct page {
unsigned long flags;
atomic_t _count;
atomic_t _mapcount;
union {
struct {
unsigned long private;
struct address_space *mapping;
};
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
spinlock_t ptl;
#endif
};
pgoff_t index;
struct list_head lru;
#if defined(WANT_PAGE_VIRTUAL)
void *virtual;
#endif /* WANT_PAGE_VIRTUAL */
};
Adding a new ﬁeld would imply that the size of the page descriptor
would go up. There is one page descriptor per page. On a system
with 8GB of RAM with a 4KB page size, there are 2097152 pages. An
addition of 4 bytes to the page descriptor would mean that the page
descriptors would use an additional 8MB of RAM. This situation is
not acceptable to users not using the containers feature.
Even if we decide to use a #ifdef around the new ﬁeld we add, then the
distributors of Linux (RedHat, SuSE, etc) need to decide at compile
time whether or not they support this feature.
(b) There should be no change in functionality for non users. Users who
decide to either compile out the feature or not use this feature, should
6.2. IMPLEMENTATION 47
not see any change in the functional behaviour of the system or any
performance hit due to changes in the core memory management logic
2. RSS Accounting
The deﬁnition of RSS in Linux was debated in a thread by Peter [19]. Im-
plementing a controller would require that we account for RSS as per its
deﬁnition
3. Shared Page Handling
It is important to track all shared pages and charge them appropriately.
Charging the wrong container for a shared page that is used more frequently
by other containers would be an unfair implementation.
6.2 Implementation
The container infrastructure has been used implement the RSS controller. As
a ﬁrst step, a controller is registered. Two ﬁles called memcontrol limit and
memcontrol usage are registered. These ﬁles are used to display the current RSS
usage of the container and to set the limit for maximum RSS use of the container.
The container subsystem provides a ﬁle system based interface for conﬁguration
of controllers. Containers also support a hierarchy, which means that it is possible
to create several subsystems of control under one parent of the container and
this can be done recursively. For each directory that is created, the container
subsystem calls a create callback of the container. When a directory is deleted
or the container destroyed, the destroy callback is called. Each container has a
tasks ﬁle. Tasks can be attached to a container by writing the pid of the task to
this ﬁle. A container cannot be destroyed, unless it is empty. When a container
is destroyed it means either that all tasks associated with the container are either
now dead or have migrated to other containers.
Figure 6.2 shows the structural organization of the RSS controller. The ac-
counting information is maintained in a counter structure, this structure is shown
below
struct res_counter {
atomic_long_t usage; /* The current usage of the resource being */
/* counted */
atomic_long_t limit; /* The limit on the resource */
};
The counter consists of atomic ﬁelds represented by atomic long t. Two ﬁelds
representing the current usage and limit are stored.
48 CHAPTER 6. MEMORY CONTROLLER
Each mm struct and container ﬁeld have counters associated with them. The
container struct is shown below
struct memcontrol {
struct container_subsys_state css;
struct res_counter counter;
spinlock_t lock;
struct list_head active_list;
struct list_head inactive_list;
};
The memcontrol data structure stores information about the container it be-
longs to in the css ﬁeld. It stores the usage and limit information in the counter
ﬁeld. There is a spinlock ﬁeld called lock to provide synchronized access to the
ﬁelds on this member. We maintain a per container LRU using two linked lists
called the active list and inactive list.
The changes to mm struct are shown below
#ifdef CONFIG_CONTAINER_MEMCONTROL
/*
* Each mm_struct’s container, sums up in the container’s counter
* We can extend this such that, VMA’s counters sum up into this
* counter
*/
struct res_counter *counter;
struct container *container;
rwlock_t container_lock;
#endif /* CONFIG_CONTAINER_MEMCONTROL */
The page descriptor was modiﬁed to add a pointer to the page metadata. The
page metadata structure is shown below
struct page_container {
struct page *page;
struct rss_container *cnt;
struct list_head list;
};
Each page metadata structure has a pointer to the container and page that
they belong to. They also have a list head, so that they can be added to the per
container LRU list.
6.2. IMPLEMENTATION 49
6.2.1 Accounting
The memory controller uses the page fault interrupt for accounting RSS pages.
Every mapped page in Linux goes through the rmap subsystem, on each mapping
of the page to a page table, the mapcount ﬁeld of the page descriptor is incre-
mented. Whenever a page fault occurs, before the page can be mapped in the
task, the container’s limit and usage ﬁeld are checked to see if the the container
is over its limit. If the container is not over its limit, the page metadata structure
(page container) is allocated and initialized. The page metadata structure is
then added to the per container LRU active list.
6.2.2 Shared Pages Accounting
One of the most diﬃcult problems is accounting for shared pages. A shared page
is mapped into the page tables of one or more processes. The technique used in
this implementation is to use the touch ﬁrst approach. The container to ﬁrst
touch the page is charged for the page and the page is added to the LRU of that
container. If over a period of time, other containers use this page more than
the container that brought in the page, it will eventually get reclaimed from the
original container and move to the container that is using it most frequently.
6.2.3 Per Container Reclaim
As stated earlier, when a container is over it’s limit, we reclaim pages from that
container. Figure 6.3 shows the organization of the active and the inactive list.
The pages at the end of the inactive list are prime candidates for reclamation.
The pages at the head of the active list are the most active pages. When it is
time to reclaim pages from the container, the page reclaimer takes the following
actions —
• It isolates container pages.
• In the ﬁrst pass, pages are moved from the active list to inactive list. If the
page is in use or actively referenced, it is moved back to the active list.
• In the second pass, pages are isolated from the inactive list and passed on
to the page shrinker. The page shrinker, takes the following actions —
– Unmaps all the mappings of this page. This helps free shared pages,
that might be mapped into the page tables of several pages. This
unmapping is done using rmap.
– If the page is dirty, the page is written back to disk (if it should).
50 CHAPTER 6. MEMORY CONTROLLER
If the page is in use or actively referenced, the page is not freed and added
back to the active or inactive list depending on the state of the page.
6.2.4 LRU Behaviour
Linux uses a variant of the LRU algorithm. It maintains two lists, the active list
and inactive list as stated earlier. Figure 6.4 shows the page state transition
diagram for a page when it is referenced.
In the per container LRU, we move the pages across active and inactive list as
discussed above. We also move pages across these lists when the page is marked
as referenced in the global LRU list.
6.2.5 OOM
When the container fails to reclaim pages from the container LRU and the con-
tainer is over its limit, the container selects a bad process and kills it to free up
its memory. A bad process is selected based on the following criteria —
• The total VM size of the process
• The size of the VM’s of the children it forked
• The CPU user and system time of the process
• It’s importance to the system, the init process is never killed and system
administrator processes (root processes) are allowed to consume more mem-
ory
• A conﬁgurable parameter called oomkilladj
• The nice value of the process
6.2.6 Freeing A Page
Freeing a page is much simpler, since freeing a page will reduce the RSS, it is known
that since the container is under its limit, freeing a page will have no impact on the
container (as far as going over limits is concerned). When a page is unmapped, we
remove it from the per container LRU list and update the usage of the container
to reﬂect the new usage.
6.2. IMPLEMENTATION 51
Figure 6.1: Memory Allocation From Within Container Flowchart
52 CHAPTER 6. MEMORY CONTROLLER
Figure 6.2: RSS Controller Overview
6.2. IMPLEMENTATION 53
Figure 6.3: Per Container LRU List
Inactive
Unreferenced
Inactive
Referenced
Active
Referenced
Active
Unreferenced
Figure 6.4: Page State Transition Diagram
54 CHAPTER 6. MEMORY CONTROLLER
Chapter 7
Results
The results of running the memory controller are shown and analysed. These tests
were run on a Linux on Power
TM
box with 6GB of RAM and 2 CPUs. Each of the
CPUs was threaded thus in-aﬀect behaving like 4 CPUs.
Basic testing was done for diﬀerent types of pages, anonymous and ﬁle mapped.
For anonymous pages, malloc(3) was used and for ﬁle mapped pages, mmap(2)
was used. For the malloc case, the test case, allocated 1GB of RAM and touched
all of it, by writing to it. In the mmap case, the test case mapped a 1GB ﬁle and
wrote to it (touching all the pages).
7.1 System Time Test
Figure 7.1 shows the system time utilization for both mmap(2) and malloc(3) calls.
The point ”0” on the x-axis indicates that the memory available to the container
was unconstrained, i.e all the memory was made available. The other points on
the x-axis show the limit set on the container and the variation of the system
time used by the test case under the speciﬁed limit. There are two interesting
observations
• The system time used for a smaller container at certain points is lesser than
that of a larger container
• mmap tests run faster than malloc tests.
7.2 %CPU Time Test
Figure 7.2 shows the system time utilization for both mmap(2) and malloc(3) calls.
The point ”0” on the x-axis indicates that the memory available to the container
55
56 CHAPTER 7. RESULTS
0
1
2
3
4
5
6
0 200 400 600 800 1000
S
y
s
t
e
m

T
i
m
e
Memory available in the container
System time variation for accessing 1GB in a memory constrained container
malloc(3) reference
mmap(2) reference
mmap(2)
malloc(3)
Figure 7.1: System Time Variation With Varying Container Sizes
was unconstrained, i.e all the memory was made available. The other points on
the x-axis show the limit set on the container and the variation of the system
time used by the test case under the speciﬁed limit. There are two interesting
observations
• The %CPU used for a smaller container at certain points is lesser than that
of a larger container
• mmap tests consume more CPU than malloc tests.
7.3 Explanation Of Results
The tests show that the system time used for a container with 1GB and 100 MB
and the %CPU used is the same. The results also show a decline in system time
used and %CPU used as the size of the container reaches close to 500MB. One
possible explanation for this observation is that when the LRU size is close to 1GB,
7.4. MINOR FAULT TESTS 57
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000
%
C
P
U

U
t
i
l
i
z
a
t
i
o
n
Memory available in the container
%CPU time variation for accessing 1GB in a memory constrained container
malloc(3) reference
mmap(2) reference
mmap(2)
malloc(3)
Figure 7.2: % CPU Time Variation With Varying Container Sizes
we have too much of work to do scanning the list and when the container size is
close to 100MB, we have a small list, but too few pages to reclaim, so we need to
put in additional work in reclaiming pages.
7.4 Minor Fault Tests
Figure 7.3 shows the the number of minor faults incurred for both mmap(2) and
malloc(3) calls. The point ”0” on the x-axis indicates that the memory available
to the container was unconstrained, i.e all the memory was made available. The
other points on the x-axis show the limit set on the container and the variation of
the system time used by the test case under the speciﬁed limit. The observations
is
• The minor fault rate for both mmap and malloc is constant
58 CHAPTER 7. RESULTS
250000
300000
350000
400000
450000
500000
550000
0 200 400 600 800 1000
M
i
n
o
r

P
a
g
e

F
a
u
l
t
s
Memory available in the container
Minor Page Fault variation for accessing 1GB in a memory constrained container
malloc(3) reference
mmap(2) reference
mmap(2)
malloc(3)
Figure 7.3: Minor Page Fault Variation With Varying Container Sizes
7.5 Major Fault Tests
Figure 7.4 shows the the number of major faults incurred for both mmap(2) and
malloc(3) calls. The point ”0” on the x-axis indicates that the memory available
to the container was unconstrained, i.e all the memory was made available. The
other points on the x-axis show the limit set on the container and the variation of
the system time used by the test case under the speciﬁed limit. The observations
are
• malloc has fewer major faults than mmap
• The major fault rate for both mmap varies
7.6. EXPLANATION OF RESULTS 59
0
200
400
600
800
1000
1200
100 200 300 400 500 600 700 800 900 1000
M
a
j
o
r

P
a
g
e

F
a
u
l
t
s
Memory available in the container
Major Page Fault variation for accessing 1GB in a memory constrained container
mmap(2)
malloc(3)
Figure 7.4: Major Page Fault Variation With Varying Container Sizes
7.6 Explanation of Results
The anonymous memory is backed up by the swap cache and ﬁle backed memory
is backed by the page cache. In the test run, these were not limited. Under
system memory pressure, the page cache pages are reclaimed prior to mapped
pages. When the pages allocated by malloc were reclaimed, they were pushed on
to the swap cache. When those pages were referenced again, they were brought
in quickly into the container from the swap cache. In the case of mmap it is quite
possible that the page cache pages were reclaimed from the container and then
from the global page cache. Thus on some iterations, the pages had to be fetched
from disk (leading to a major page fault).
60 CHAPTER 7. RESULTS
Chapter 8
Summary
Resource Control is going to be a critical requirement in enterprises. Several oper-
ating systems have already started supporting this feature, with a good set of tools
for management of the features supported. With the growth and spread of virtu-
alization, system administrators will want to diﬀerentiate diﬀerent instances based
on the priority of hosting the container. In addition to that, not all applications
are treated equal. Some applications or part of applications will be prioritized
based on their criticality to the business,
The implementation of resource management in Linux has been discussed ex-
tensively. Some parts of the code (such as delay accounting) are already available.
The other parts are being developed. The wide choice of implementations of the
resource control infrastructure has made it possible to discuss the selection of the
most promising infrastructure. As the infrastructure stabilizes, controllers such as
CPU and Memory controllers are being developed on top of the infrastructure.
In this dissertation, we looked at the scope of work and split it into phases
• Infrastructure
• Accounting
• Feedback
• Control
Each of the phases and the work done has been looked into in detail. Tests
were run and data collected. The data collected shows some interesting results of
the impact of the container limit on the performance of applications. The results
show that limiting the container to close to 50% of its full memory requirement
yields best results.
61
62 CHAPTER 8. SUMMARY
Chapter 9
Directions For Future Work
There is much more work to do than just the RSS container, for example the
memory controller would eventually require
• A Page Cache controller
• A mlock(2) controller
• Kernel Accounting and Control
Patches for the page cache controller were developed by Vaidyanathan [21]
Apart from the memory controller, the CPU control is under active develop-
ment. Balbir and Menage [20] developed a sample CPU accounting system for
the containers patchset. Once the accounting system is accepted, a full ﬂedged
controller will be developed.
Once the have the Memory and CPU controller, users have requested for ad-
ditional controllers like
1. Disk I/O bandwidth controller
2. Fork rate controller
3. Number of open ﬁles controller
63
64 CHAPTER 9. DIRECTIONS FOR FUTURE WORK
Bibliography
[1] Daniel,P. Bovet and Cesati Marco. Understanding the Linux Kernel. O’
Reilly, November, 2005.
[2] Love, Robert. Linux Kernel Development. Novell Press, January, 2005.
[3] Vaddagiri, Srivatsa. ”[RFC] Resource Management - Infrastructure choices”.
http://lkml.org/lkml/2006/10/30/49. October(2006).
[4] Singh, Balbir and Emelianov, Pavel. ”Containers/Guarantees for re-
sources”. http://wiki.openvz.org/Containers/Guarantees for resources.
November(2006).
[5] Singh, Balbir and Nagar, Shailabh. ”Delay accounting patches”. http://lwn.
net/Articles/182133/ May(2006).
[6] Nagar, Shailabh. ”Delay accounting performance”.
http://lkml.org/lkml/2006/3/23/141 March(2006).
[7] Moore, Paul and Hadi, Jamal. ”Genetlink documentation”.
http://lwn.net/Articles/208755/ November(2006).
[8] Seetharaman, Chandra. ”Class Based Resource Management”.
http://ckrm.sourceforge.net/ December(2006).
[9] Menage, Paul. ”Generic Process Containers (V6)”.
http://lkml.org/lkml/2006/12/22/112 December(2006).
[10] Linux Weekly News. ”Resource Beancounters”.
http://lwn.net/Articles/197433/ August(2006).
[11] Singh, Balbir. ”Aggregated Beancounters”. http://lwn.net/Articles/199938/
September(2006).
[12] Derr, Simon. ”CPUSets Documentation”. http://lwn.net/Articles/127936/
September(2006).
65
66 BIBLIOGRAPHY
[13] Nagar, Shailabh Franke, Hubertus Choi, Jonghyuk Seethara-
man, Chandra Kaplan, Scott Singhvi, Nivedita Kashyap, Vivek,
Kravetz, Mike. ”Class-based Prioritized Resource Control in Linux”.
Ottawa Linux Symposium Proceedings 1(2003).
[14] Castro, Soﬁa, Tezulas, Nurcan, Yu, BooSeon, Berg, Jrgen, Kim, HoHyeon,
Gfroerer, Diana. AIX 5L Workload Manager. IBM Corporation (Redbook),
2001.
[15] Derr, Simon ”Server Virtualization Open Source Project”.
http://lwn.net/Articles/127936/ September(2006).
[16] Riel, Rik Van. ”Towards an O(1) VM”.
Ottawa Linux Symposium Proceedings 1(2003).
[17] Knuth, Donald. E. The Art of Computer Programming, Volume 1, Fundamental Algorithms Third Edition,
Massachusetts: Addison-Wesley, 1997
[18] Corbet, John. ”Trees I: Radix Trees”. http://lwn.net/Articles/127936/
September(2006).
[19] Zijlstra, Peter. ”RSS accounting”. http://lkml.org/lkml/2006/10/10/130
[20] Singh, Balbir and Menage, Paul. ”Simple CPU accounting container subsys-
tem”. http://lkml.org/lkml/2007/2/12/90
[21] Srinivasan, Vaidyanathan. ”Containers: Page Cache Accounting and Control
subsystem (v1)”. http://lwn.net/Articles/224815/

ms

Comments

Content

Sponsor Documents

Recommended