Memory management explained

An explanation of the virtual memory model used in modern operating systems, from a conceptual and practical perspective.

Introduction

I wrote this article back in 2004, in reply to a question placed by someone at Security Forums Dot Com (now known as WindowSecurity.com). All of the concepts are still very much relevant today, but some of the references may seem a bit anachronistic (e.g. mentions of Windows NT, or the Intel Pentium 4).

The original question described two sample C++ programs. Program 1 would declare an integer variable, store a value in it, print the variable's address and pause before leaving. Program 2 would declare a pointer to the same address (hard-coded in the source for simplicity), read its value and print it.

The poster would run program 1, and while it was paused, he would edit program 2's source code to include the proper address, compile the program and execute it. He expected program 2 to print out the value that program 1 had stored in its variable. However, what would happen is that program 2 would print some completely unrelated value.

Virtual memory vs physical memory

What the poster was witnessing here is Virtual Memory Management at its best. You see, in virtual memory capable OSes (like Win9x/NT and *nix), each process's address space is completely separate from the others. Well, not completely; there are some exceptions.

When the OS launches a given process that runs with a flat memory model (that means most 32 bit programs on an x86 architecture), it creates a linear address space for it (also known as virtual address space). This will be the linear memory range that the new process can access. Now, what does this linear (or virtual) mean?

It means that the addresses you access within your program (e.g. 0x00401234 or whatever) are not real physical addresses. That is, when you do something like:

*((int *)0x12345678) = 9;

you're not really writing to the physical address 0x12345678. You're writing to your virtual address 0x12345678, which is mapped into some physical address by the OS.

Ok, so you may be wondering, what's the point of all this? The point is, among other things, that each process has its own independent address space - and as such, it can use whichever addresses it wants for its internal variables, without having to worry if they are already used by some other process. Can you imagine what it would be like creating a program to run on a multitasking system without any sort of abstraction layer over the physical RAM? You would have to know each and every address used by each and every process the system is ever going to run, to make sure not to use any of them. Either that, or risk having two completely independent processes garble each other because their programmers (or compilers) happened to use the same addresses for their variables. That's kind of hard on the programmers...

Getting down to it

Pages

Now, how does it all really work? Ok, let's get down and dirty with the memory managers. First of all, you need to understand a simple but important concept, which is the concept of paged memory. A page is nothing more than a small consecutive portion of memory. On Intel x86 systems, each page is 4 KBytes (= 4096 bytes). Paged memory is a way of managing memory (another way is segmented memory, which is somewhat harder on the programmer, although simpler to implement at the OS level). Nowadays, the most used by far is paged memory, as it's far more practical for the programmer and much more flexible. Windows supports the two models, as the underlying 32 bit CPUs support them as well (they had to be backwards compatible with DOS, which only used segmented memory management). Basically what Windows does so it doesn't have to bother with the different segments is set the segment registers to include the whole memory range, so they all overlap and mean the same thing.

Ok, enough History. With paged memory, the OS divides your physical memory in pages, just as it divides your physical hard disk in clusters. Why? Well, it's just not practical dealing with individual bytes when you have, say, 256 millions of them. Also it would take up more space to store each byte's individual properties than the bytes we're supposed to be managing (the structures needed to manage 256 million bytes would take up more than 256 million bytes). So, we have the physical memory divided into 4Kb pages, and that's what we manage. Now, we need a way to access individual bytes within a page. We do that with the displacement (or offset) field. Fancy names aside, all of this means: when you access physical address 0x12345678, you're actually accessing byte 0x678 within page 0x12345. Get it? This is obviously for a 32-bit system with 4Kb pages. How did I know which portion of the address was the page and which portion was the displacement? Easy. We're talking about a 4Kb page system, so if each page is 4096 bytes long, how many bits do we need to index it? 12 (because 2¹² = 4096). So, the 12 rightmost bits in an address represent the displacement inside the page. The rest represent the page itself.

Page table

So, now we have a way of dividing addresses memory into pages and back. We're still talking about physical pages, though. Actual RAM addresses. Time to implement the abstraction layer. For that we'll need a little something called a page table. This is just a table of page table entries (if you pardon the redundancy), but I'll go into more detail about that later. What we want to achieve is to have each process with its own page table, with that table holding the mapping between the process's virtual pages and the actual physical pages where they are located.

So, what exactly is a page table entry, or PTE? Basically, it's a struct with several fields. Its index in the page table (PT) indicates the virtual page it represents (so PT[3] would give us the PTE for virtual page 3, just as PT[0x00400] would give us the PTE for virtual page 0x00400). Actually it's a bit more complicated than that, since we can have up to 0xFFFFF pages and we don't want to be indexing a table with 1 million entries. What x86 CPUs (and consequently, Windows) do is have a table of tables, in a way similar to how ext2 inodes handle large files with many blocks by adding layers of indirection. But moving on, we have a PTE for each virtual page that the process can access. Within that PTE are stored several fields, including the actual physical page that the virtual page maps to, its permissions bits (readable, writable, accessible from user mode, etc.), the present bit (which tells the OS whether the page is actually present in physical memory or it's been swapped out to disk), the dirty bit (which tells the OS if the page's contents have changed since the last time it was written to disk), and some reference to the disk sectors in which its contents were stored (in case it has been swapped out).

Basically speaking, as it is the PT is checked for each and every memory read/write every process does. This is done by the CPU itself, assuming it supports paged memory management, otherwise forget about all this (the x86 supports this since the 386). Naturally, this isn't ideal, since it's very slow. We'll be adding 1 or 2 memory reads for each memory read, which means we're effectively multiplying the number of memory reads your process does by at least 2. So a solution came, and it was in the form of an on-die cache, better known as the Translation Lookaside Buffer or TLB. This is a special cache within the CPU that stores the last PTEs it has encountered. Its size depends on the specific CPU model; the Intel Pentium 4 for example has 2 TLB's (one for instruction addresses and one for data addresses), holding 128 entries each. With this we drastically reduce the need for logistical memory reads, since by the principle of reference locality this should cover a good proportion of the cases.

An example

Let's take a look at what happens when a process tries to read/write from a given virtual address (remember, the only thing a process knows about are virtual addresses). Say we have the following example:

*((int *)0x00401234) = 5;

Finding the PTE

When the CPU encounters this instruction, it will have to find out the actual physical address where it has to write. To do that, it will take the virtual address 0x00401234 and find out to which virtual page it belongs. Assuming this is a 4Kb page machine (like the x86), the virtual page will be 0x00401 (just strip away the least significant 12 bits, remember). The displacement within the page is 0x234.

Next, the CPU needs to find out where the virtual page 0x00401 is stored. For that, it will check its TLB and see if it finds the information there (let's assume it doesn't, to make things interesting). Since the information is not in the TLB, the CPU will have to go check the page table of the current process. It gets the base address of the page directory of the current process from one of the CPU registers (CR3) and goes there to check which page table describes this particular page (this is the whole table of tables concept, as it's not practical having to index a table with 1 million entries).

The details of how the CPU reaches the PTE aren't important. The point is it will go search the page table for an entry describing virtual page 0x00401. Of course, I use the word "search" in a broad sense; this isn't an actual linear search, but rather a matter of calculating the index of the PTE for this virtual page, if it exists. If no such PTE exists, the CPU will throw an unresolvable page fault, as the process is trying to access an address that doesn't exist in its context. This will cause the operating system to kill the process, with a nice "This program has committed an access violation" message in the case of Windows, or a SIGSEGV in the case of *nix.

Using the PTE

Let's assume that a PTE does exist for page 0x00401, meaning that it has been allocated by the operating system. Having just found the PTE, the CPU will first check its permissions bits, to see if we have write access. Let's assume that we do. Next, the CPU will check the Present bit, to see if the page is currently loaded in physical memory or if we have to load it from disk. If the Present bit is 0 (meaning the page isn't in physical memory), the CPU will generate a page fault, bringing into action the operating system's memory manager. The memory manager will read the PTE, find out in which disk sectors the page is stored, read the sectors from disk, put the data somewhere in physical memory, and update the PTE to contain the new physical page where the data is stored. Lastly, the memory manager will update the Present bit to 1.

So, now the virtual page is present in memory. Next, the CPU will read the physical page base address from the PTE. Let's assume this virtual page is stored in physical page 0x01F20. Now we have everything we need: a physical base address, and a displacement (0x234, from the original virtual address). All the CPU has to do is join the two and it will have the physical address where it has to write: 0x01F20234. From then on, it just stores the 5 we originally wanted to store in there, and that's it.

Moving away from it

All of this is independent of whether you are running as superuser or not; this is just memory management and the way it works. You could tap into the address space of other processes from kernel space (after all, the memory manager itself has to do it). For example, from a device driver - you either code it by hand (read CR3, do the math to find the PTE, read from the PTE, etc) or use one of the ring 0 memory manager functions that will allow you to do such things. In either case you don't need to go so deep for something as simple as inter process communication (IPC); just use shared memory. For Windows, check out the MS SDK reference for the CreateFileMapping and MapViewOfFile APIs. Or, if you want to peek into the memory of a process that you didn't code yourself (and as such doesn't open your file mapped object), check out the ReadProcessMemory and WriteProcessMemory APIs.

Regarding shared memory, as you can probably guess by now, all it entails is simply having virtual addresses on different processes pointing to the same physical pages. The virtual addresses for each process don't even have to be the same, what matters is the physical page.

There are more interesting things to learn about, such as the copy on write bit (not implemented in Win9x) and the much in vogue now execute permission. Had the Intel architecture actually implemented a verification of the execute permission bit back when they started doing protected memory, a whole set of buffer overflow related attacks would simply not be an issue... The stack has no business being executable for most programs. A normal buffer overflow attempt would just cause an access violation as the CPU tried to execute code from a non-executable page (such as the stack or data). At least AMD was smart enough to include it in their new x86-64 CPU. Of course this still doesn't fix everything: permissions can be changed, for example by using mprotect, or VirtualProtect/VirtualProtectEx, and the attacker could play with the return address on the stack to try to reach one of those system calls.