Part #2 - A lot about Paging, a little about Virtualization
This article is based on extensive research and I am not an expert in this field. The only intention I’ve got was “ to understand how virtualization is working and share it ”. Please tell me any mistake you discover. I will correct it as quick as possible.
So far we have talked about the processors part of virtualization. But what is with the memory? If a guest could do funny things in the memory it’s game over, isn’t it? Of cause it is, so in order to explain how the VMM keeps control over the memory, we’ll have to take a look how memory is organized.
Most explanations describe memory management linear. So, you have cells which can store bytes and those cells are indexed from 0x00000000-0xFFFFFFFF, those indexes are the (32 bit)-addresses used by the CPU.
______________ | |> [0x00000000] | | | | ........ | | | | |______________|> [0xFFFFFFFF]
But this is no longer true for most computers (small embedded systems sometimes address memory like that to save resources).
Instead we have a memory management unit (MMU) which is sitting in your CPU and translates virtual addresses instead of physical ones (virtual doesn’t refers to virtualization. It’s just called this way).
Paging is a pretty simple concept although it takes some time until you get the calculation part right and it feels somehow familiar.
The idea is that we take 4KB of physical memory and store the physical base address of this 4KB in a table entry.
The 4KB are called a page. The table we store it in is called Page Table (PT).
Intel tells the programmers in its manual that a PT can hold 512 pages. 512 pages à 4KB are roughly 2MB of memory we could address.
But we want to address more memory than 2MB, so we store the physical base address of 512 PTs into one Page Directory Entry (PDE). One PT à 2MB * 512 = 1GB per PDE.
1GB per PDE is still not enough. So we do it again, take 512 PDEs and store the physical base addresses of the those entries into a Page Directory Pointer Table Entry (PDPTE).
And because we can, 512 PDPTEs into one Page Map Level 4 entry (PML4E).
A physical address could be read like “give me the 0xffec1230-th byte in memory”. The virtual address instead contains all the indexes of the different pointer tables which leads the MMU to the correct byte:
Virtual 64bit address:
FLAGS |PML4 |PDPTE |PDE |PT |Offset into page 0000000000000000|000000000|000000000|000000000|000000000|000000000000 [16bit] | [9bit] | [9bit] | [9bit] | [9bit] | [12bit] (9bit encode 0-511)
Don’t freak out, here comes an example:
Let’s say we have this virtual address: 0x00000000007FC031. Written down in bits it looks like that:
|PML4 |PDPTE |PDE |PT |Offset into page Binary: 0000000000000000|000000000|000000000|000000011|111111100|000000110001 Decimal: 0 0 3 508 49
Based on the virtual address the MMU looks up the first entry in the PML4 (which has an index of 0).
This one holds a physical address which is pointing to a PDPT.
There again it shall follow the pointer in the first entry to a PD.
In this PD it selects the fourth PT.
In the PT it looks up the pointer at the 509th entry.
Finally, this last pointer points the MMU to a page. In this page the 50th byte is read or written.
Keep in mind that the actual physical address of each table entry consists of two parts. The leading bits of the physical address stored in an table entry, plus the index from the virtual address. Those parts get appended to create a valid physical address.
Maybe you already knew that, but the address BUS in 64bit computers is most the time just 48bit wide. So in reality processors only use 48bit to address memory, not 64. This means that the pointers to the next table entry or bit aren’t 64bit long.
But because each table entry is 64bit long, the pointer to the next entry doesn’t use all bits.
In fact the a table entry stores the first 36bit of the final pointer. If the index is 9bits, it gets attached together with 3 extra bits. Those 3 extra bits are padding. If the appended index identifies a byte, those 3 bits won’t be used.
pointer from table entry index padding | [36bits] | [9bits] | | 0000000000000000000000000000000000000000 | 00000000 | 000
pointer from table entry index of byte | [36bits] | [12bits] | 0000000000000000000000000000000000000000 | 00000000000
Paging is a tradeoff: The MMU has to translate virtual addresses into physical ones. This means the processor needs more time to access memory. But on the other hand, we can also do way more interesting things.
Because the pointer to the next entry isn’t occupying the whole 64 bits (but 36bits), we can use the rest of them to store flags.
As an example for all the other entries we take a look at a Page Table Entry:
Every page and its prior table entry gets a privilege level assigned (red). The privilege level can be 0 which makes this page and therefore its stored data, code or table entries only accessible by the kernel or the OS. Only accessible by the OS/kernel means, that the processor must execute OS/kernel code while requesting the data from the memory. Otherwise the MMU will through a Page Fault Exception.
If it is set to 1 it is accessible by user applications.
Read/write flags (blue) are self-explanatory.
And last but obviously not least, there is the execute-disable bit (green), which will prevent any execution of the page content. This is a security feature to prevent malware to execute code, which is written in a data segment of a process. But don’t worry, there are ways around that.
So why have we gone through all this? Well, first of all it’s basic computer science knowledge :P. And secondly, Intel supports hardware wise memory virtualization based on this concept.
Just to make it clear: The MMU only translates the given virtual address into a physical one and delivers the data back to the processor. Setting up all the page tables and assign them to running processes is still done by the kernel.
So if you boot up a virtualized guest on your computer it will set up the page tables naturally. Although, the host already had set up its own paging structures. To prevent a complete jumble in your memory, the guests access to the memory needs to be manipulated as well.
In the past, this had to be done via the VMM which held a shadow (aka a copy) of the guest paging tables and translated the guest pages into the physical ones, which is incredible slow. So Intel added an Extended Page Tables (EPT) functionality into its processors, which does the translation process in the MMU and with it much quicker.
If the processor is in a VM-non root state and the guest requests memory, the MMU will first calculate the guest physical address as normal.
At this point the MMU looks up a special pointer called the Extended Page Table Pointer.
This pointer is simply 36bits long. In the next step the MMU appends bit 47:39 of the guests physical address to it plus 3 bits.
| [36bits] | [9bits] | | 0000000000000000000000000000000000000000 | 00000000 | 000
Looks familiar? It is indeed the same translate procedure as usual. Although, this time the MMU isn’t using a virtual address to read the indexes from, but the guest physical address.
And this was basically it. The guests memory gets mapped into the hosts memory without overwriting whings. The host has every opportunity to manipulate the guests memory, while the guest is not even aware it is using a different paging structure.
If paging was completely new for you, this was probably a hard read. I tried to make it accessible and haven’t covered things like caching or memory segmentation with the GDT.
Please let me know, if there are any unclear statements or mistakes.
In the next part I initially planned to write about VT-d, but I am pretty sick of starring at Intel manuals. So when I have some spare time I will introduce Qubes OS, which uses virtualization heavily. Or do a write up about an exploit which breaks out of a guest. Or even write a PoC Bluepill like rootkit.
*Do it yourself*
- https://rayanfam.com/topics/hypervisor-from-scratch-part-1/ (building a kernel module)
- https://software.intel.com/sites/default/files/managed/7c/f1/326019-sdm-vol-3c.pdf (raw information)