Point of no C3 | Linux Kernel Exploitation - Part 0

(exploit) #1

In the name of Allah, the most beneficent, the most merciful.

“Appreciate the art, master the craft.”

It’s been more than a year, huh? but I’m back, with “Point of no C3”. It’s main focus will be Kernel Exploitation, but that won’t stop it from looking at other things.


  • Chapter I: Environment setup:
    • Preparing the VM
    • Using KGDB to debug the kernel
    • Compiling a simple module
    • What?
    • Few structs
    • Debug a module
  • Chapter II: Overview on security and General understanding:
    • Control Registers
    • SMAP
    • SMEP
    • Write-Protect
    • Paging(a bit of segmentation too)
    • Processes
    • Syscalls
    • IDT(Interrupt Descriptor Table)
    • KSPP
    • KASLR
    • kptr_restrict
    • mmap_min_addr
    • addr_limit

Chapter I: Environment setup

No QEMU for you.

Preparing the VM:

To begin with, we would set up the environment and the VM’s in order to experiment on them.
For this, Debian was choosen(core only).
Other choices include SUSE or Centos, etc.

debian-9.4.0-amd64-netinst.iso			2018-03-10 12:56 291M [X]
debian-9.4.0-amd64-xfce-CD-1.iso		2018-03-10 12:57 646M
debian-mac-9.4.0-amd64-netinst.iso		2018-03-10 12:56 294M

A VM is then created with atleast 35GB space.(Hey, It’s for compiling the kernel!)

Installer disc image file (iso):
[C:\vm\debian-9.4.0-amd64-netinst.iso	[▼]]
⚠ Could not detect which operating system is in this disc image.
  You will need to specify which operating system will be installed.

Once you boot it, you can proceed with Graphical Install, and since we only want the core, stop at Software selection and have only SSH server and standard system utilities selected.
And when it’s done, you’ll have your first VM ready.

Debian GNU/Linux 9 Nwwz tty1
Hint: Num Lock on

Nwwz login: root
Linux Nwwz 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[email protected]:~#

In order to get the latest stable Linux kernel release(4.17.2 at the time of writing) and run it.
We would start by installing necessary packages:

apt-get install git build-essential fakeroot ncurses* libssl-dev libelf-dev ccache gcc-multilib bison flex bc

Downloading the kernel tarball and the patch:

[email protected]:~# cd /usr/src
[email protected]:/usr/src# wget "https://mirrors.edge.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.gz"
[email protected]:/usr/src# wget "https://mirrors.edge.kernel.org/pub/linux/kernel/v4.x/patch-4.17.2.gz"

Extracting them:

[email protected]:/usr/src# ls
linux-4.17.2.tar.gz patch-4.17.2.gz
[email protected]:/usr/src# gunzip patch-4.17.2.gz
[email protected]:/usr/src# gunzip linux-4.17.2.tar.gz
[email protected]:/usr/src# tar -xvf linux-4.17.2.tar

Moving and applying the patch:

[email protected]:/usr/src# ls
linux-4.17.2 linux-4.17.2.tar patch-4.17.2
[email protected]:/usr/src# mv patch-4.17.2 linux-4.17.2/
[email protected]:/usr/src# cd linux-4*2
[email protected]:/usr/src/linux-4.17.2# patch -p1 < patch-4.17.2

Cleaning the directory and copying the original bootfile to the current working directory and changing the config with an ncurses menu:

[email protected]:/usr/src/linux-4.17.2# make mrproper
[email protected]:/usr/src/linux-4.17.2# make clean
[email protected]:/usr/src/linux-4.17.2# cp /boot/config-$(uname -r) .config
[email protected]:/usr/src/linux-4.17.2# make menuconfig

One must then set up the following fields:

[*] Networking support 	--->
    Device Drivers		--->
	Firmware Drivers	--->
	File systems		--->
[X] Kernel hacking		--->
		printk and dmesg options					--->
	[X] Compile-time checks and compiler options	--->
		[*] Compile the kernel with debug info
	-*- Kernel debugging
	[*] KGDB: kernel debugger
	Do you wish to save your new configuration?
	Press <ESC><ESC> to continue kernel configuration.
			[< Yes >]			< No >

Make sure you do have similiar lines on .config:


Before starting the compiling process, to faster the process, you can split the work to multiple jobs(on different processors). nproc would hand you the number of processing units available.

[email protected]:/usr/src/linux-4.17.2# nproc
[email protected]:/usr/src/linux-4.17.2# make -j4

It will then automatically go through stage 1 & 2:

Setup is 17116 bytes (padded to 17408 bytes).
System is 4897 kB
CRC 2f571cf0
Kernel: arch/x86/boot/bzImage is ready	(#1)
	Building modules, stage 2.
	MODPOST	3330 modules
	CC		virt/lib/irqbypass.mod.o
	LD [M]	virt/lib/irqbypass.ko
[email protected]:/usr/src/linux-4.17.2#

If somehow, there’s no stage two, a single command should be executed before moving on:
(This normally isn’t required.)

make modules

Installing the modules:

[email protected]:/usr/src/linux-4.17.2# make modules_install
	INSTALL	sound/usb/usx2y/snd-usb-usx2y.ko
	INSTALL	virt/lib/irqbypass.ko
	DEPMOD	4.17.0
[email protected]:/usr/src/linux-4.17.2#

Installing and preparing the kernel for boot:

[email protected]:/usr/src/linux-4.17.2# make install
Found linux image: /boot/vmlinuz-4.17.0
Found initrd image: /boot/initrd.img-4.17.0
Found linux image: /boot/vmlinuz-4.9.0-6-amd64
Found initrd image: /boot/initrd.img-4.9.0-6-amd64
[email protected]:/usr/src/linux-4.17.2# cd /boot
[email protected]:/boot# mkinitramfs -o /boot/initrd.img-4.17.0 4.17.0
[email protected]:/boot# reboot

You can then choose the new kernel from the boot screen:

*Debian GNU/Linux, with Linux 4.17.0
 Debian GNU/Linux, with Linux 4.17.0 (recovery mode)
 Debian GNU/Linux, with Linux 4.9.0-6-amd64
 Debian GNU/Linux, with Linux 4.9.0-6-amd64 (recovery mode)

If it fails however, saying that it’s an out-of-memory problem, you can reduce the size of the boot image.

[email protected]:/boot# cd /lib/modules/4.17.0/
[email protected]:/lib/modules/4.17.0# find . -name *.ko -exec strip --strip-unneeded {} +
[email protected]:/lib/modules/4.17.0# cd /boot
[email protected]:/boot# mkinitramfs -o initrd.img-4.17.0 4.17.0

It’ll then boot successfully.

[email protected]:~# uname -r
Using KGDB to debug the kernel:

Installing ifconfig and running it would be the first thing to do:

[email protected]:~# apt-get install net-tools
[email protected]:~# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet  netmask  broadcast

Back to Debian machine, transfering vmlinux to the host is done with SCP or WinSCP in my case.

[email protected]:~# service ssh start
..							Répertoire parent
vmlinux			461 761 KB	Fichier

With this, you’ll have debug symbols ready, but you still need to enable KGDB for the target kernel.

[email protected]:~# cd /boot/grub
[email protected]:/boot/grub# nano grub.cfg

Editing a single line, adding __setup arguments, we would then be able to manipulate the kernel for our needs, such as disabling KASLR and enabling KGDB.
Search for the first ‘Debian GNU’ occurence and make sure it’s the wanted kernel, and add the following to the line starting with [X]: kgdboc=ttyS1,115200 kgdbwait nokaslr.

menuentry 'Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-b1a66d11-d729-4f23-99b0-4ddfea0af6c5' {
	echo	'Loading Linux 4.17.0 ...'
	[X] linux	/boot/vmlinuz-4.17.0 root=UUID=b1a66d11-d729-4f23-99b0-4ddfea0af6c5 ro quiet kgdboc=ttyS1,115200 kgdbwait nokaslr
	echo	'Loading initial ramdisk ...'
	initrd	/boot/initrd.img-4.17.0

In order to debug the running kernel, another VM similer to the one made previously(Debian) will be created(Debian HOST).
Now shutdown both VMs in order to set the pipe:

  • Debian:
    ⦿ Use named pipe:
    	| \\.\pipe\com_2                        |
    	[This end is the server.             [▼]]
    	[The other end is a virtual machine. [▼]]
    I/O mode
    ⧆ Yield CPU on poll
    	Allow the guest operating system to use this serial
    	port in polled mode (as opposed to interrupt mode).
  • DebianHOST:
    ⦿ Use named pipe:
    	| \\.\pipe\com_2                        |
    	[This end is the client.             [▼]]
    	[The other end is a virtual machine. [▼]]
    I/O mode
    ⧆ Yield CPU on poll
    	Allow the guest operating system to use this serial
    	port in polled mode (as opposed to interrupt mode).

Getting the vmlinux image to DebianHOST after installing necessary packages:

[email protected]:~# apt-get install gcc gdb git net-tools
[email protected]:~# cd /home/user
[email protected]:/home/user# ls
[email protected]:/home/user# gdb vmlinux
GNU gdb (Debian 7.12-6)

Turning the Debian back on would result in a similiar message:

KASLR disabled: 'nokaslr' on cmdline.
[	1.571915] KGDB: Waiting for connection from remote gdb...

Attaching to DebianHOST’s GDB is then possible:

(gdb) set serial baud 115200
(gdb) target remote /dev/ttyS1
Remote debugging using /dev/ttyS1
kgdb_breakpoint () at kernel/debug/debug_core.c:1073
1073		wmb(); /* Sync point after breakpoint */
(gdb) list
1068	noinline void kgdb_breakpoint(void)
1069	{
1070		atomic_inc(&kgdb_setting_breakpoint);
1071		wmb(); /* Sync point before breakpoint */
1072		arch_kgdb_breakpoint();
1073		wmb(); /* Sync point after breakpoint */
1074		atomic_dec(&kgdb_setting_breakpoint);
1075	}
1076	EXPORT_SYMBOL_GPL(kgdb_breakpoint);

Know that by writing ‘continue’ on GDB, you wouldn’t be able to control it again unless you use the magic SysRq key to force a SIGTRAP to happen:

[email protected]:~# echo "g" > /proc/sysrq-trigger

And you can see in DebianHOST that it works.

[New Thread 459]
[New Thread 462]
[New Thread 463]
[New Thread 476]
[New Thread 485]
[New Thread 487]

Thread 56 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 489]
kgdb_breakpoint () at kernel/debug/debug_core.c:1073
1073	wmb(); /* Sync point after breakpoint */
Compiling a simple module:

A simple Hello 0x00sec module would be created.
We need to make a directory in root folder, and prepare two files:

[email protected]:~# mkdir mod
[email protected]:~# cd mod
[email protected]:~/mod/# nano hello.c
#include <linux/init.h>
#include <linux/module.h>

static void hello_exit(void){
	printk(KERN_INFO "Goodbye!\n");

static int hello_init(void){
	printk(KERN_INFO "Hello 0x00sec!\n");
	return 0;

[email protected]:~/mod/# nano Makefile
obj-m += hello.o
KDIR   = /lib/modules/$(shell uname -r)/build

	make -C $(KDIR) M=$(PWD) modules
	rm -rf *.ko *.o *.mod.* *.symvers *.order

Then, one can start compiling using ‘make’ and insert/remove the module in kernel to trigger both init and exit handlers.

[email protected]:~/mod# make
make -c /lib/modules/4.17.0/build M=/root/mod modules
make[1]: Entering directory '/usr/src/linux-4.17.2'
	CC [M]	/root/mod/hello.o
	Building modules, stage 2.
	MODPOST 1 modules
	CC	/root/mod/hello.mod.o
	LD [M] /root/mod/hello.ko
make[1]: Leaving directory '/usr/src/linux-4.17.2'
[email protected]:~/mod# insmod hello.ko
[email protected]:~/mod# rmmod hello.ko

The messages would be by then saved in the dmesg circular buffer.

[email protected]:~/mod# dmesg | grep Hello
[ 6545.039487] Hello 0x00sec!
[email protected]:~/mod# dmesg | grep Good
[ 6574.452282] Goodbye!

To clean the current directory:

[email protected]:~/mod# make clean

The kernel doesn’t count on the C library we’ve been used to, because it’s judged useless for it.
So instead, after the module is linked and loaded in kernel-space(requires root privileges, duh).
It can use header files available in the kernel source tree, which offers a huge number of functions such as printk() which logs the message and sets it’s priority, module_init() and module_exit() to declare initialization and clean-up functions.
And while application usually run with no chance of changing their variables by another thread. This
certainly isn’t the case for LKMs, since what they offer could be used by multiple processes at a single time, which could lead(if the data dealt with is sensible, aka in critical region) to a panic, or worse(better?), a compromise.

Few structs:

The kernel implements multiple locks, only semaphores and spinlocks will likely be used here.
When the semaphore is previously held, the thread will sleep, waiting for the lock to be released so he can claim it.
That’s why it’s a sleeping lock, therefore, it’s only used in process context.

/* Please don't access any members of this structure directly */
struct semaphore {
	raw_spinlock_t		lock;
	unsigned int		count;
	struct list_head	wait_list;

It can then be initialized with sema_init() or DEFINE_SEMAPHORE():

#define __SEMAPHORE_INITIALIZER(name, n)				\
{									\
	.lock		= __RAW_SPIN_LOCK_UNLOCKED((name).lock),	\
	.count		= n,						\
	.wait_list	= LIST_HEAD_INIT((name).wait_list),		\

static inline void sema_init(struct semaphore *sem, int val)
	static struct lock_class_key __key;
	*sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
	lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);

With val being the much processes that can hold the lock at once.
It’s normally set to 1, and a semaphore with a count of 1 is called a mutex.
Another type of locks would be spinlocks, it keeps the thread spinning instead of sleeping, for that reason, it can be used in the interrupt context.

typedef struct spinlock {
	union {
		struct raw_spinlock rlock;

# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
		struct {
			u8 __padding[LOCK_PADSIZE];
			struct lockdep_map dep_map;
} spinlock_t;

#define __RAW_SPIN_LOCK_INITIALIZER(lockname)	\
	{					\
	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
	SPIN_DEBUG_INIT(lockname)		\
	SPIN_DEP_MAP_INIT(lockname) }

#define __RAW_SPIN_LOCK_UNLOCKED(lockname)	\
	(raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)

# define raw_spin_lock_init(lock)				\
	do { *(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock); } while (0)

static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
	return &lock->rlock;

#define spin_lock_init(_lock)				\
do {							\
	spinlock_check(_lock);				\
	raw_spin_lock_init(&(_lock)->rlock);		\
} while (0)

Enough with locks, what about file_operations?
This struct holds the possible operations that can be called on a device/file/entry.
When creating a character device by directly calling cdev_alloc() or misc_register(), it has to be provided along with the major(on first function only) and minor.
It is defined as follows:

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
} __randomize_layout;

There are similiar structs too, such as inode_operations, block_device_operations and tty_operations
But they all provide handlers to userspace function if the file/inode/blockdev/tty is the target.
These are sometimes used by the attacker in order to redirect execution such as perf_fops or ptmx_fops.

The kernel provides some structs for lists with different search times.
The first being double linked-list, list_head, it’s definition is simple, pointing to the next and previous list_head.

struct list_head {
	struct list_head *next, *prev;

While the second is redblack tree, rb_node, provides better search time.

struct rb_node {
	unsigned long  __rb_parent_color;
	struct rb_node *rb_right;
	struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

It can be used to find the target value faster, if it’s bigger than the first node(head), then go right, else, go left.
Function container_of() can then be used to extract the container struct.
Note: Each device, can have multiple minors, but it’ll necessarily have a single major.

[email protected]:/# cd /dev
[email protected]:/dev# ls -l
total 0
crw------- 1 root root    [10], 175 Feb  9 09:24 agpgart
                            *-> Same major, different minors.
crw-r--r-- 1 root root    [10], 235 Feb  9 09:24 autofs
drwxr-xr-x 2 root root         160 Feb  9 09:24 block
drwxr-xr-x 2 root root          80 Feb  9 09:24 bsg
[c]rw-rw-rw- 1 root tty      [5], [2] Feb  9 12:06 ptmx
|                             |    |
|                             |    *--> Minor
*---> Character Device        *---> Major
[b]rw-rw---- 1 root cdrom    [11], [0] Feb  9 09:24 sr0
|                             |    |
|                             |    *--> Minor
*---> Block Device            *---> Major
Debug a module:

When we started gdb, the only image it was aware of, is the vmlinux one.
It doesn’t know about the loaded module, and doesn’t know about the load location.
In order to provide these things and make debugging the module possible, one has to first transfer
the target module to DebianHOST.

[email protected]:~/mod# service ssh start

Once that’s done, one should find different sections and addresses of the LKM in memory:

[email protected]:~/mod# insmod simple.ko
[email protected]:~/mod# cd /sys/module/simple/sections
[email protected]:/sys/module/simple/sections# ls -la
total 0
drwxr-xr-x 2 root root	   0 Aug 11 06:30 .
drwxr-xr-x 5 root root	   0 Aug  2 17:55 ..
-r-------- 1 root root	4096 Aug 11 06:31 .bss
-r-------- 1 root root	4096 Aug 11 06:31 .data
-r-------- 1 root root	4096 Aug 11 06:31 .gnu.linkonce.this_module
-r-------- 1 root root	4096 Aug 11 06:31 __mcount_loc
-r-------- 1 root root	4096 Aug 11 06:31 .note.gnu.build-id
-r-------- 1 root root	4096 Aug 11 06:31 .orc_unwind
-r-------- 1 root root	4096 Aug 11 06:31 .orc_unwind_ip
-r-------- 1 root root	4096 Aug 11 06:31 .rodata.str1.1
-r-------- 1 root root	4096 Aug 11 06:31 .rodata.str1.8
-r-------- 1 root root	4096 Aug 11 06:31 .strtab
-r-------- 1 root root	4096 Aug 11 06:31 .symtab
-r-------- 1 root root	4096 Aug 11 06:31 .text
[email protected]:/sys/module/simple/sections# cat .text
[email protected]:/sys/module/simple/sections# cat .data
[email protected]:/sys/module/simple/sections# cat .bss

Back to DebianHOST and in gdb:

(gdb) add-symbol-file simple.ko 0xffffffffc054c000 -s .data 0xffffffffc054e000 -s .bss 0xffffffffc054e4c0

And that’s it.

Chapter II: Overview on security and General understanding

Uuuuh, it’s simple?

Control Registers:

CRs are special registers, being invisible to the user, they hold important information on the current CPU and the process running on it.
x86_32 and x86_64:
Keep in mind that their sizes are different(64bit for x86_64, 32bit for x86_32).

x32 and x64:
#0:     PE(Protected Mode Enable)
#1:     MP(Monitor co-processor)
#2:     EM(Emulation)
#3:     TS(Task Switched)
#4:     ET(Extension Type)
#5:     NE(Numeric Error)
#6-15:  Reserved
#16:    WP(Write Protect)
#17:    Reserved
#18:    AM(Alignment Mask)
#19-28: Reserved
#29:    NW(Not-Write Through)
#30:    CD(Cache Disable)
#31:    PG(Paging)
x64 only:
#32-61: Reserved

Solely containing the PFLA(Page Fault Linear Address) address, which would later be extracted using do_page_fault function and passed to __do_page_fault to handle it.

dotraplinkage void notrace
do_page_fault(struct pt_regs *regs, unsigned long error_code)
	unsigned long address = read_cr2(); /* Get the faulting address */
	enum ctx_state prev_state;

	prev_state = exception_enter();
	if (trace_pagefault_enabled())
		trace_page_fault_entries(address, regs, error_code);

	__do_page_fault(regs, error_code, address);

This register contains the physical address of the current process PGD(Page Global Directory), which(once converted back to virtual address) would link to the next level(P4D on five-level page tables or PUD on four-level page tables), but in the end, it’s all to find the same struct, that is, struct page.

static inline unsigned long read_cr3_pa(void)
	return __read_cr3() & CR3_ADDR_MASK;

static inline unsigned long native_read_cr3_pa(void)
	return __native_read_cr3() & CR3_ADDR_MASK;

static inline void load_cr3(pgd_t *pgdir)

This is called as an example when an Oops happens, and the kernel calls dump_pagetable().

x32 and x64:
#0:     VME(Virtual-8086 Mode Extensions)
#1:     PVI(Protected Mode Virtual Interrupts)
#2:     TSD(Time Stamp Disable)
#3:     DE(Debugging Extensions)
#4:     PSE(Page Size Extensions)
#5:     PAE(Physical Address Extensions)
#6:     MCE(Machine Check Enable)
#7:     PGE(Page Global Enable)
#8:     PCE(Performance-Monitoring Counter Enable)
#9:     OSFXSR(OS Support for FXSAVE and FXRSTOR Instructions)
#10:    OSXMMEXCPT(OS Support for Unmasked SIMD Floating Point Exceptions)
#11:    UMIP(User-Mode Instruction Prevention)
#12:    Reserved
#13:    VMXE(Virtual Machine Extensions Enable)
#14:    SMXE(Safer Mode Extensions Enable)
#15-16: Reserved
#17:    PCIDE(PCID Enable)
#18:    OSXSAVE(XSAVE and Processor Extended States Enable)
#19:    Reserved
#20:    SMEP(Supervisor Mode Execution Prevention)
#21:    SMAP(Supervisor Mode Access Prevention)
#22-31: Reserved
x64 only:
#31-63: Reserved

CR1 and CR5 to CR7:
Marked as reserved, accessing them would result in raising the Undefined Behavior(#UD) exception.
x86_64 only:
Only the first 4 bits are used in this one, while the other 60 bits are reserved(0).
Also called TPR(Task Priority Register). Those 4 bits are used when servicing interrupts, checking if the task should really be interrupted. It may or may not, depending on the interrupt’s priority: (IP <= TP ? PASS:SERVICE).

They differ from architecture to another, while the previous example reviewed two CISC(x86_32, x86_64). Windows itself does have much similiarities at this level:

The thing is a little bit more different in RISC(ARM for this example):
Instead of Control Registers, they are named Coprocessors(P0 to P15), each Coprocessor holds 16 registers(C0 to C15). Note however, that only CP14 and CP15 are very important to the system.
MCR and MRC Instructions are available to deal with data transfer(read/write).
An example for the TTBR(Translation Table Base Register) is as follows:

Stands for Supervisor Mode Access Prevention, as it’s name suggests, prevents access to user-space from a more privileged context, that is, ring zero. However, since access may still be necessary in certain occasions, a flag is dedicated(AC in EFLAGS) to this purpose, along with two instructions to set or clear it:

static __init int setup_disable_smap(char *arg)
	return 1;
__setup("nosmap", setup_disable_smap);

It can be disabled with nosmap boot flag, which would clear the CPU’s SMAP capability, or by unsetting the SMAP bit(#21) on CR4.


An abbreviation for Supervisor Mode Execution Prevention, when running on ring zero, execution would not be allowed to be transmitted to user-space. So both SMEP and SMAP put a form of limitation on the attacker’s surface.

static __init int setup_disable_smep(char *arg)
	return 1;
__setup("nosmep", setup_disable_smep);

Knowing if it’s on is as simple as checking /proc/cpuinfo, and it’s the same for SMAP.
This protection can be disabled with nosmep boot flag, it can also be disabled during runtime by unsetting SMEP bit(#20) on CR4.


Since code executing at the highest level of privilege should normally be capable of writting to all pages even those marked as RO(Read Only). However, a bit in CR0(WP bit(16th)) is supposed to stop that from happening, by providing additional checks.

Paging(a bit of segmentation too):

Linux does separate privileges. the processor can handle up to 4 different rings, starting from 0 which obviously is the most privileged and ending with 3 being the least privileged with limited access to system resources. However, most operating systems do work with only two rings, zero(also called kernel-space) and three(or user-space).
Each running process does have a struct mm_struct which fully describes it’s virtual memory space.
But when it comes to segmentation and paging, we’re only interested in few objects in this struct: context, the single-linked list mmap and pgd.

typedef struct {

	u64 ctx_id;

	atomic64_t tlb_gen;

	struct rw_semaphore	ldt_usr_sem;
	struct ldt_struct	*ldt;

#ifdef CONFIG_X86_64
	unsigned short ia32_compat;

	struct mutex lock;
	void __user *vdso;
	const struct vdso_image *vdso_image;

	atomic_t perf_rdpmc_allowed;

	u16 pkey_allocation_map;
	s16 execute_only_pkey;
	void __user *bd_addr;
} mm_context_t;

This struct holds many information on the context, including the Local descriptor table(LDT), the VDSO image and base address(residing in user-space __user), a read/write semaphore and a mutual exclusion lock(it’s a semaphore too, remember?).

struct ldt_struct {

	struct desc_struct	*entries;
	unsigned int		nr_entries;

	int			slot;

The first element in the LDT is a desc_struct pointer, referencing an array of entries, nr_entries of them.
However, know that LDT isn’t usually set up, it would only use the Global Descriptor Table, it’s enough for most processes.

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#ifdef CONFIG_X86_64
	[GDT_ENTRY_KERNEL32_CS]		= GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_CS]		= GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_DS]		= GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER32_CS]	= GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_DS]	= GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_CS]	= GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_CS]		= GDT_ENTRY_INIT(0xc09a, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_DS]		= GDT_ENTRY_INIT(0xc092, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_CS]	= GDT_ENTRY_INIT(0xc0fa, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_DS]	= GDT_ENTRY_INIT(0xc0f2, 0, 0xfffff),
	[GDT_ENTRY_PNPBIOS_CS32]	= GDT_ENTRY_INIT(0x409a, 0, 0xffff),
	[GDT_ENTRY_PNPBIOS_CS16]	= GDT_ENTRY_INIT(0x009a, 0, 0xffff),
	[GDT_ENTRY_PNPBIOS_DS]		= GDT_ENTRY_INIT(0x0092, 0, 0xffff),
	[GDT_ENTRY_APMBIOS_BASE]	= GDT_ENTRY_INIT(0x409a, 0, 0xffff),
	[GDT_ENTRY_APMBIOS_BASE+1]	= GDT_ENTRY_INIT(0x009a, 0, 0xffff),
	[GDT_ENTRY_APMBIOS_BASE+2]	= GDT_ENTRY_INIT(0x4092, 0, 0xffff),

	[GDT_ENTRY_ESPFIX_SS]		= GDT_ENTRY_INIT(0xc092, 0, 0xfffff),
	[GDT_ENTRY_PERCPU]		= GDT_ENTRY_INIT(0xc092, 0, 0xfffff),
} };

A per-cpu variable gdt_page is initialized using the GDT_ENTRY_INIT macro.

#define GDT_ENTRY_INIT(flags, base, limit)			\
	{							\
		.limit0		= (u16) (limit),		\
		.limit1		= ((limit) >> 16) & 0x0F,	\
		.base0		= (u16) (base),			\
		.base1		= ((base) >> 16) & 0xFF,	\
		.base2		= ((base) >> 24) & 0xFF,	\
		.type		= (flags & 0x0f),		\
		.s		= (flags >> 4) & 0x01,		\
		.dpl		= (flags >> 5) & 0x03,		\
		.p		= (flags >> 7) & 0x01,		\
		.avl		= (flags >> 12) & 0x01,		\
		.l		= (flags >> 13) & 0x01,		\
		.d		= (flags >> 14) & 0x01,		\
		.g		= (flags >> 15) & 0x01,		\

This macro simply takes three arguments, and splits them in order to store at each field a valid value.
The GDT holds more entries on 32bit than on 64bit.

struct gdt_page {
	struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));

Says that gdt_page is an array of GDT_ENTRIES(32 on x86_32, 16 on x86_64) much of desc_struct aligned to PAGE_SIZE(usually 4KB(4096)).

struct desc_struct {
	u16	limit0;
	u16	base0;
	u16	base1: 8, type: 4, s: 1, dpl: 2, p: 1;
	u16	limit1: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
} __attribute__((packed));

When an ELF is about to run, and is being loaded with load_elf_binary(), it does call setup_new_exec(), install_exec_creds() on bprm before it calls setup_arg_pages() which would pick a random stack pointer.
Before returning successfully, it would call finalize_exec() and start_thread() which would update the stack’s rlimit and begin execution respectively:

start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
	start_thread_common(regs, new_ip, new_sp,
			    __USER_CS, __USER_DS, 0);

As you are able to see, this function is just a wrapper around start_thread_common():

static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
		    unsigned long new_sp,
		    unsigned int _cs, unsigned int _ss, unsigned int _ds)
	WARN_ON_ONCE(regs != current_pt_regs());

	if (static_cpu_has(X86_BUG_NULL_SEG)) {

		loadsegment(fs, __USER_DS);

	loadsegment(fs, 0);
	loadsegment(es, _ds);
	loadsegment(ds, _ds);

	regs->ip		= new_ip;
	regs->sp		= new_sp;
	regs->cs		= _cs;
	regs->ss		= _ss;
	regs->flags		= X86_EFLAGS_IF;

As a conclusion, every process starts with default segment registers, but different GPRs, stack and instruction pointer, and by looking at __USER_DS and __USER_CS:



We would find the segment registers and their values on user-space:

Initial state:
CS = 6*8+3 = 0x33
SS = 5*8+3 = 0x2b
DS = FS = ES = 0

These values can be checked using GDB and a dummy binary.

(gdb) b* main
Breakpoint 1 at 0x6b0
(gdb) r
Starting program: /root/mod/cs

Breakpoint 1, 0x00005555555546b0 in main ()
(gdb) info reg cs ss
cs		0x33	51
ss		0x2b	43

Also, you should know that, CS holds in it’s least 2 significant bits, the Current Privilege Level(CPL), other segment selectors hold the Requested Privilege Level(RPL) instead of CPL.

(gdb) p/t $cs
$1 = 110011
(gdb) p/x $cs & 0b11
$2 = 0x3
# (Privilege Level: User(3) SuperUser(0))
(gdb) p/d $cs & ~0b1111
$3 = 48
# (Table Offset: 48)
(gdb) p/d $cs & 0b100
$4 = 0
# (Table Indicator: GDT(0) LDT(1))

3 stands for the third ring, least privileged, that is, user-space.
It doesn’t change, unless the execution is in kernel-space, so it’s similiar for both root and any normal user. So both RPL and CPL could be considered a form of limitation when accessing segments with lower(more privileged) DPL(Descriptor Privilege Level).

When it comes to paging, it’s equivalent bit in CR0(#31) is only set when the system is running in protected mode(PE bit in CR0 is set), because in real mode, virtual address are equal to physical ones.
Linux moved from four-level page tables to support five-level page tables by adding an additional layer(P4D), so the levels now are: PGD P4D PUD PMD PTE.
PGD is the first level Page Global Directory, it is a pointer of type pgd_t, and it’s definition is:

typedef struct { pgdval_t pgd; } pgd_t;

It holds a pgdval_t inside, which is an unsigned long(8 bytes on x86_64, 4 on x86_32):

typedef unsigned long	pgdval_t;

To get to the next level, pagetable_l5_enabled() is called to check if the CPU has X86_FEATURE_LA57 enabled.

#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)

This can be seen in p4d_offset():

static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
	if (!pgtable_l5_enabled())
		return (p4d_t *)pgd;
	return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);

If it isn’t enabled, it simply casts the pgd_t * as p4d_t * and returns it, otherwise it returns the P4D entry within the PGD that links to the specific address.
Then P4D itself can be used to find the next level, which is PUD of type pud_t *, PUD links to PMD(Page Middle Directory) and PMD to the PTE(Page Table Entry) which is the last level, and contains the physical address of the page with some protection flags and is of type pte_t *.

Each process has it’s own virtual space(mm_struct, vm_area_struct and pgd_t).

struct vm_area_struct {

	unsigned long vm_start;
	unsigned long vm_end;

	struct vm_area_struct *vm_next, *vm_prev;

	struct rb_node vm_rb;

	unsigned long rb_subtree_gap;

	struct mm_struct *vm_mm;
	pgprot_t vm_page_prot;
	unsigned long vm_flags;	

	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	struct list_head anon_vma_chain;
	struct anon_vma *anon_vma;

	const struct vm_operations_struct *vm_ops;

	unsigned long vm_pgoff;
	struct file * vm_file;
	void * vm_private_data;

	atomic_long_t swap_readahead_info;
#ifndef CONFIG_MMU
	struct vm_region *vm_region;
	struct mempolicy *vm_policy;
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;
typedef struct { pgdval_t pgd; } pgd_t;

So creating a new process would be very expensive on performance. Copy-on-Write(COW) comes in helpful here, by making a clone out of the parent process and only copying when a write happens to the previously marked read-only pages.
This happens on fork and more specifically in copy_process(), which duplicates the task_struct and does specific operations depending on flags passed to clone(), before copying all parent information which includes credentials, filesystem, files, namespaces, IO, Thread Local Storage, signal, address space.
As an example, this walks VMAs in search of a user specified address, once found, it gets its Physical address and Flags by walking page tables.

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <asm/pgtable.h>
#include <linux/highmem.h>
#include <linux/slab.h>

#define device_name "useless"
#define SET_ADDRESS 0x00112233

char *us_buf;
unsigned long address = 0;

long do_ioctl(struct file *filp, unsigned int cmd, unsigned long arg){
			address = arg;
			return 0;
			return -EINVAL;

ssize_t do_read(struct file *filp, char *buf, size_t count, loff_t *offp){
	int res, phys, flags;
	struct vm_area_struct *cmap;
	pgd_t *pgd;
	p4d_t *p4d;
	pud_t *pud;
	pmd_t *pmd;
	pte_t *ptep;

	/* Find corresponding VMA */
	cmap    = current->mm->mmap;

		if(cmap->vm_start <= address && address < cmap->vm_end){
		cmap     = cmap->vm_next;
		if(cmap  == NULL){
			return -1;
	/* Walking Page-tables for fun */
	pgd     = pgd_offset(current->mm, address);
	p4d     = p4d_offset(pgd,         address);
	pud     = pud_offset(p4d,         address);
	pmd     = pmd_offset(pud,         address);
	ptep    = pte_offset_kernel(pmd,  address);
	phys    = *((int *) ptep);
	flags   = phys & 0xfff;
	phys   &= ~0xfff;
	snprintf(us_buf, 64, "PhysAddr(%x) VMAStart(%lx) Flags(%x)", phys, cmap->vm_start, flags);

	if(count > 64)
		count = 64;
	res = copy_to_user(buf, us_buf, count);
	return res;

struct file_operations fileops = {
					.owner = THIS_MODULE,
					.read  = do_read,
					.unlocked_ioctl = do_ioctl,

static int us_init(void){
	struct proc_dir_entry *res;

	us_buf = kmalloc(64, GFP_KERNEL);
	if(us_buf == NULL){
		printk(KERN_ERR "Couldn't reserve memory.");
		return -ENOMEM;
	res = proc_create(device_name, 0, NULL, &fileops);
	if(res == NULL){
		printk(KERN_ERR "Failed allocating a proc entry.");
		return -ENOMEM;
	return 0;

static void us_exit(void){
	remove_proc_entry(device_name, NULL);


To communicate with this proc entry, the following was written:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>

#define device_path "/proc/useless"
#define SET_ADDRESS 0x00112233

void main(void){
	int fd;
	char *ok;
	char c[64];

	fd = open(device_path, O_RDONLY);
	ok = malloc(512);
	memcpy(ok, "Welp", sizeof(int ));
	ioctl(fd, SET_ADDRESS, ok);

	read(fd, c, sizeof( c));
	printf("%s\n", &c);

This gives:
0x867 in binary is: 100001100111.
Present: 1 (The page is present)
R/W: 1 (The page have both read and write permissions)
U/S: 1 (The page can be accessed by the user and supervisor)
Accessed: 1 (Set if the page had been accessed)
Dirty: 1 (Set if the page was written to since last writeback)

Note that necessary checks on validity of return values was ignored in this example, these could be performed with p??_none() and p??_present(), and multiple other things could have been done, such as playing with the PFN or page or reading from the Physical Address with void __iomem *, ioremap() and memcpy_fromio() or struct page * and kmap().

Translating address from virtual to physical takes time, so caching is implemented using the TLB(Translation Lookaside Buffer) to improve the performance, hopefully that the next access is going to land a cache-hit and that’ll hand the PTE faster than a miss where a memory access is forced to happen to get it. The TLB flushes from time to another, an example would be after a page fault is raised and completed.


The kernel sees each process as a struct task_struct which is a huge struct that contains many fields which we can’t cover entirely, some are used to guarantee the (almost) fair scheduling and some show the task’s state(if it’s either unrunnable, runnable or stopped), priority, the parent process, a linked list of children processes, the address space it holds, and many others.
We are mainly interested in the const struct cred __rcu *cred; which holds the task’s credentials.

struct cred {
	atomic_t	usage;
	atomic_t	subscribers;
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
	kuid_t		uid;
	kgid_t		gid;
	kuid_t		suid;
	kgid_t		sgid;
	kuid_t		euid;
	kgid_t		egid;
	kuid_t		fsuid;
	kgid_t		fsgid;
	unsigned	securebits;
	kernel_cap_t	cap_inheritable;
	kernel_cap_t	cap_permitted;
	kernel_cap_t	cap_effective;
	kernel_cap_t	cap_bset;
	kernel_cap_t	cap_ambient;
	unsigned char	jit_keyring;
	struct key __rcu *session_keyring;
	struct key	*process_keyring;
	struct key	*thread_keyring;
	struct key	*request_key_auth;
	void		*security;
	struct user_struct *user;
	struct user_namespace *user_ns;
	struct group_info *group_info;
	struct rcu_head	rcu;
} __randomize_layout;

This struct holds Capabilities, ((effective) user and group) ID, keyrings, (for synchronization, Read-Copy-Update) RCU, (tracks the user’s usage of the system by keeping counts) user and (holds U/G ID and the privileges for them) user_ns.
In order to better understand this structure, a simple proc entry was created which extracts the task_struct of the process that uses it(current) and reads the effective UID and GID.

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/cred.h>
#include <linux/uidgid.h>

#define device_name "useless"
#define SD_PRIV     0x10071007

	kuid_t ceuid;
	kgid_t cegid;
	spinlock_t clock;

long do_ioctl(struct file *filp, unsigned int cmd, unsigned long arg){
	int res;

		case SD_PRIV:
			current_euid_egid(&us_cd.ceuid, &us_cd.cegid);
			res = copy_to_user((void *)arg, &us_cd, 8);
			return res;
			return -EINVAL;

struct file_operations fileops = {
					.owner = THIS_MODULE,
					.unlocked_ioctl = do_ioctl,

static int us_init(void){
	struct proc_dir_entry *res;

	res = proc_create(device_name, 0, NULL, &fileops);
	if(res == NULL){
		printk(KERN_ERR "Failed allocating a proc entry.");
		return -ENOMEM;

	return 0;

static void us_exit(void){
	remove_proc_entry(device_name, NULL);


The initialization process starts by preparing the spinlock and creating a proc entry with a specified name “useless” and a file_operations struct containing only necessary owner and unlocked_ioctl entries.
While the ioctl handler simply checks if the command passed was SD_PRIV to extract the UID and GID with a call to the current_euid_egid() macro which in turn calls current_cred() to extract the current->cred:

#define current_euid_egid(_euid, _egid)		\
do {						\
	const struct cred *__cred;		\
	__cred = current_cred();		\
	*(_euid) = __cred->euid;		\
	*(_egid) = __cred->egid;		\
} while(0)
#define current_cred() \
	rcu_dereference_protected(current->cred, 1)

Then, we create a tasktry.c to interract with the /proc/useless.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>

#define device_path "/proc/useless"
#define SD_PRIV     0x10071007

	unsigned int uid;
	unsigned int gid;

void main(void){
	int fd;

	fd = open(device_path, O_RDONLY);
	ioctl(fd, SD_PRIV, &data);

	printf("UID: %d GID: %d\n", data.uid, data.gid);

Two binaries are then created in /tmp directory, one which is compiled by root(setuid bit set) tasktry_root and the other by a normal user called tasktry_user.

[email protected]:~# cd /tmp
[email protected]:/tmp# gcc tasktry.c -o tasktry_root; chmod u+s tasktry_root
[email protected]:/tmp# cd /root/mod
[email protected]:~/mod# make
make -c /lib/modules/4.17.0/build M=/root/mod modules
make[1]: Entering directory '/usr/src/linux-4.17.2'
	CC [M]	/root/mod/task.o
	Building modules, stage 2.
	MODPOST 1 modules
	CC	/root/mod/task.mod.o
	LD [M] /root/mod/task.ko
make[1]: Leaving directory '/usr/src/linux-4.17.2'
[email protected]:~/mod# insmod task.ko
[email protected]:~/mod# su - user
[email protected]:~$ cd /tmp
[email protected]:/tmp$ gcc tasktry.c -o tasktry_user
[email protected]:/tmp$ ls
tasktry_user tasktry_root tasktry.c
[email protected]:/tmp$ ./tasktry_root
UID: 0 GID: 1000
[email protected]:/tmp$ ./tasktry_user
UID: 1000 GID: 1000

As you can see, the effective UID of tasktry_root is 0 making it own high privileges, so overwritting effective creds is one way to privilege escalation(prepare_kernel_creds() and commit_creds() are used for this purpose in most exploits, instead of getting the stack base and overwritting it directly.), another is to change capabilities.
On Windows, one way to escalate privileges would be to steal the token of System process(ID 4) and assign it to the newly spawned cmd.exe after changing the reference count:


Processes running in userspace can still communicate with the kernel, thanks to syscalls.
Each syscall is defined as follows:

	return task_tgid_vnr(current);

With multiple arguments:

SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
	return ksys_lseek(fd, offset, whence);

So, in general:

	/* Passing the argument to another function, for processing. */
	return call_me([ARG_NAME]);

Few tries aaand :slight_smile::

#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(void){
	printf("ID: %d\n", getuid());
	return 0;

Running this sample with GDB and putting breakpoint on the x64 libc, we can see that it does set EAX register to 0x66(syscall number on x64) before the syscall instruction.

(gdb) x/i $rip
=> 0x555555554704 <main+4>:		callq 0x5555555545a0 <[email protected]>
(gdb) x/x getuid
0x7ffff7af2f30 <getuid>: 		0x000066b8
(gdb) b* getuid
Breakpoint 2 at 0x7ffff7af2f30: file ../sysdeps/unix/syscall-template.S, line 65.
(gdb) c

Breakpoint 2, getuid () at ../sysdeps/unix/syscall-template.S:65
65		../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) disas $rip
Dump of assembler code for function getuid:
=> 0x00007ffff7af2f30 <+0>:		mov		$0x66,%eax
   0x00007ffff7af2f35 <+5>:		syscall
   0x00007ffff7af2f37 <+7>:		retq
 End of assembler dump.
(gdb) shell
[email protected]:~# echo "g" > /proc/sysrq-trigger

We can invoke a shell from GDB to force SysRQ, and see what this offset in the kernel links for:

[New Thread 756]
[New Thread 883]
[New Thread 885]

Thread 103 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 889]
kgdb_breakpoint () at kernel/debug/debug_core.c:1073
10733			wmb(); /* Sync point after breakpoint */
(gdb) p &sys_call_table
$1 = (const sys_call_ptr_t (*)[]) 0xffffffff81c00160 <sys_call_table>
(gdb) x/gx (void *)$1 + 0x66*8
0xffffffff81c00490 <sys_call_table+816>:	0xffffffff8108ec60
(gdb) x/i 0xffffffff8108ec60
0xffffffff8108ec60 <__x64_sys_getuid>:		nopl	0x0(%rax,%rax,1)

So, it’s the global sys_call_table, indexing the __x64_sys_getuid there.

"The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls"
is written in syscall_64.tbl that contains all the syscalls
available to the kernel.

This is similiar to the nt!KiServiceTable on Windows.

kd> dps nt!KeServiceDescriptorTable
82b759c0  82a89d9c nt!KiServiceTable
82b759c4  00000000
82b759c8  00000191
82b759cc  82a8a3e4 nt!KiArgumentTable
82b759d0  00000000
82b759d4  00000000
kd> dd nt!KiServiceTable
82a89d9c  82c85c28 82acc40d 82c15b68 82a3088a
82a89dac  82c874ff 82b093fa 82cf7b05 82cf7b4e
82a89dbc  82c0a3bd 82d11368 82d125c1 82c00b95
kd> ln 82c85c28
(82c85c28)   nt!NtAcceptConnectPort   |  (82c85ca5)   nt!EtwpRundownNotifications
Exact matches:
    nt!NtAcceptConnectPort = <no type information>
kd> ln 82acc40d 
(82acc40d)   nt!NtAccessCheck   |  (82acc43e)   nt!PsGetThreadId
Exact matches:
    nt!NtAccessCheck = <no type information>
kd> ln 82d125c1
(82d125c1)   nt!NtAddDriverEntry   |  (82d125f3)   nt!NtDeleteDriverEntry
Exact matches:
    nt!NtAddDriverEntry = <no type information>

Dissasembling it gives us:

(gdb) disas __x64_sys_getuid
Dump of assembler code for function __x64_sys_getuid:
	0xffffffff8108ec60 <+0>:	nopl	0x0(%rax,%rax,1)
	0xffffffff8108ec65 <+5>:	mov		%gs:0x15c00,%rax
	0xffffffff8108ec6e <+14>:	mov		0x668(%rax),%rax
	0xffffffff8108ec75 <+21>:	mov		0x4(%rax),%esi
	0xffffffff8108ec78 <+24>:	mov		0x88(%rax),%rdi
	0xffffffff8108ec7f <+31>:	callq	0xffffffff8112d4a0 <from_kuid_munged>
	0xffffffff8108ec84 <+36>:	mov		%eax,%eax
	0xffffffff8108ec86 <+38>:	retq

With a basic understanding of ASM and a very limited knowledge of the kernel (AT&T haha, too lazy to switch the syntax :wink:.), one can know that it does first search for the current task, store some pointer it holds at offset 0x668 at RAX before dereferencing it again and using content at +0x88(RDI) and +0x4(RSI) as arguments to the from_kuid_munged call before it nops and returns(q there stands for qword).
We can verify this either by looking at the source:

	return from_kuid_munged(current_user_ns(), current_uid());
uid_t from_kuid_munged(struct user_namespace *targ, kuid_t kuid)
	uid_t uid;
	uid = from_kuid(targ, kuid);

	if (uid == (uid_t) -1)
		uid = overflowuid;
	return uid;

Or checking in GDB(maybe both?):

(gdb) b* __x64_sys_getuid
Breakpoint 1 at 0xffffffff8108ec60: file kernel/sys.c, line 920.
(gdb) c
[New Thread 938]
[Switching to Thread 938]

Thread 122 hit Breakpoint 1, __x64_sys_getuid () at kernel/sys.c:920
920		{
(gdb) ni
get_current () at ./arch/x86/include/asm/current.h:15
15		return this_cpu_read_stable(current_task);
(gdb) x/i $rip
=> 0xffffffff8108ec65 <__x64_sys_getuid+5>:		mov		%gs:0x15c00,%rax
(gdb) p ((struct task_struct *)0)->cred
Cannot access memory at address 0x668
(gdb) p ((struct cred *)0)->uid
Cannot access memory at address 0x4
(gdb) p ((struct cred *)0)->user_ns
Cannot access memory at address 0x88

The sys_call_table is residing in a RO(read only) memory space:

(gdb) x/x sys_call_table
0xffffffff81c00160 <sys_call_table>:	0xffffffff81247310
(gdb) maintenance info sections
 [3]	0xffffffff81c00000->0xffffffff81ec1a42 at 0x00e00000: .rodata ALLOC LOAD RELOC DATA HAS_CONTENTS

But a kernel module can overcome this protection and place a hook at any systemcall.
For that, two example modules will be given:
=] Disabling the previously discussed WP(write-protect) bit in the CR0(control register #0), using read_cr0 and write_cr0 to acheive that.

#include <linux/fs.h>
#include <asm/pgtable.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/uaccess.h>
#include <linux/kallsyms.h>
#include <linux/miscdevice.h>
#include <asm/special_insns.h>

#define device_name "hookcontrol"
#define ioctl_base    0x005ec
#define ioctl_enable  ioctl_base+1
#define ioctl_disable ioctl_base+2

int    res;
int  (*real_getuid)(void);
void **sys_call_table;
unsigned long const *address;

static int hooked_getuid(void){
	printk(KERN_INFO "Received getuid call from %s!", current->comm);
	if(real_getuid != NULL){
		return real_getuid();
	return 0;

long do_ioctl(struct file *filp, unsigned int cmd, unsigned long arg){
	unsigned long cr0 = read_cr0();

		case ioctl_enable:
		printk(KERN_INFO "Enabling hook!");
		write_cr0(cr0 & ~0x10000);
		sys_call_table[__NR_getuid] = hooked_getuid;
		write_cr0(cr0 |  0x10000);
		printk(KERN_INFO "Successfully changed!");

		return 0;
		case ioctl_disable:
		printk(KERN_INFO "Disabling hook!");
		write_cr0(cr0 & ~0x10000);
		sys_call_table[__NR_getuid] = real_getuid;
		write_cr0(cr0 |  0x10000);
		printk(KERN_INFO "Successfully restored!");
		return 0;
		return -EINVAL;

struct file_operations file_ops = {
									.owner          = THIS_MODULE,
									.unlocked_ioctl = do_ioctl

struct miscdevice hk_dev = {

static int us_init(void){
	res = misc_register(&hk_dev);
		printk(KERN_ERR "Couldn't load module!");
		return -1;
	sys_call_table = (void *) kallsyms_lookup_name("sys_call_table");
	real_getuid    = sys_call_table[__NR_getuid];
	address        = (unsigned long *) &sys_call_table;
	printk(KERN_INFO "Module successfully loaded with minor: %d!", hk_dev.minor);
	return 0;

static void us_exit(void){


=] Orr’ing the protection mask of the page at which it resides(__pgprot(_PAGE_RW))( set_memory_rw() & set_memory_rw()), or directly modifying the PTE.

static inline pte_t pte_mkwrite(pte_t pte)
	return pte_set_flags(pte, _PAGE_RW);

static inline pte_t pte_wrprotect(pte_t pte)
	return pte_clear_flags(pte, _PAGE_RW);

Looking at these functions, one can safely assume that manipulation can be acheived with simple OR and AND(_PAGE_RW) operations on the pte_t.

pte_t *lookup_address(unsigned long address, unsigned int *level)
	return lookup_address_in_pgd(pgd_offset_k(address), address, level);

Since it’s a kernel address, pgd_offset_k() is called, which makes use of &init_mm, instead of a mm_struct belonging to some process of one’s choice.

pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
			     unsigned int *level)
	p4d_t *p4d;
	pud_t *pud;
	pmd_t *pmd;

	*level = PG_LEVEL_NONE;

	if (pgd_none(*pgd))
		return NULL;

	p4d = p4d_offset(pgd, address);
	if (p4d_none(*p4d))
		return NULL;

	*level = PG_LEVEL_512G;
	if (p4d_large(*p4d) || !p4d_present(*p4d))
		return (pte_t *)p4d;

	pud = pud_offset(p4d, address);
	if (pud_none(*pud))
		return NULL;

	*level = PG_LEVEL_1G;
	if (pud_large(*pud) || !pud_present(*pud))
		return (pte_t *)pud;

	pmd = pmd_offset(pud, address);
	if (pmd_none(*pmd))
		return NULL;

	*level = PG_LEVEL_2M;
	if (pmd_large(*pmd) || !pmd_present(*pmd))
		return (pte_t *)pmd;

	*level = PG_LEVEL_4K;

	return pte_offset_kernel(pmd, address);

so, the ioctl handler looks like this:

long do_ioctl(struct file *filp, unsigned int cmd, unsigned long arg){
	unsigned int level;
	pte_t *pte = lookup_address(*address, &level);;

		case ioctl_enable:
		printk(KERN_INFO "Enabling hook!");
		pte->pte |= _PAGE_RW;
		sys_call_table[__NR_getuid] = hooked_getuid;
		pte->pte &= ~_PAGE_RW;
		printk(KERN_INFO "Successfully changed!");

		return 0;
		case ioctl_disable:
		printk(KERN_INFO "Disabling hook!");
		pte->pte |= _PAGE_RW;
		sys_call_table[__NR_getuid] = real_getuid;
		pte->pte &= ~_PAGE_RW;
		printk(KERN_INFO "Successfully restored!");
		return 0;
		return -EINVAL;

(Know that these are only examples, usually, replacing should take place at init and restoring the original at exit, plus the definition of both the hook and original handlers, should hold asmlinkage(passing arguments in stack, unlike fastcall(default) in registers), however, since the syscall here holds no arguments, this was ignored.)
By running an application from user-space to interact with /dev/hookcontrol: (enabling and disabling after a while) and taking a look at dmesg:
This can be used to provide a layer on the syscall, prevent or manipulate the return value, like kill to prevent a process from being killed, getdents to hide some files, unlink to prevent a file from being deleted, et cetera…
And it doesn’t stop here, even without syscall hooking, one can play with processes(hide them as an example…) with task_struct elements and per-task flags, or change the file_operations in some specific struct, and many other possibilities.

IDT(Interrupt Descriptor Table):

In order to handle exceptions, this table exists, by linking a specific handler to each exception, it helps deal with those raised from userspace(a translation to ring zero is required first) and kernelspace.
It first is initialized during early setup, and this can be seen in setup_arch() which calls multiple functions, some to setup the IDT, most important to us is idt_setup_traps():

void __init idt_setup_traps(void)
	idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);

It makes use of the default IDTs array(def_idts).

static const __initconst struct idt_data def_idts[] = {
	INTG(X86_TRAP_DE,		divide_error),
	INTG(X86_TRAP_NMI,		nmi),
	INTG(X86_TRAP_BR,		bounds),
	INTG(X86_TRAP_UD,		invalid_op),
	INTG(X86_TRAP_NM,		device_not_available),
	INTG(X86_TRAP_OLD_MF,		coprocessor_segment_overrun),
	INTG(X86_TRAP_TS,		invalid_TSS),
	INTG(X86_TRAP_NP,		segment_not_present),
	INTG(X86_TRAP_SS,		stack_segment),
	INTG(X86_TRAP_GP,		general_protection),
	INTG(X86_TRAP_SPURIOUS,		spurious_interrupt_bug),
	INTG(X86_TRAP_MF,		coprocessor_error),
	INTG(X86_TRAP_AC,		alignment_check),
	INTG(X86_TRAP_XF,		simd_coprocessor_error),

#ifdef CONFIG_X86_32
	INTG(X86_TRAP_DF,		double_fault),
	INTG(X86_TRAP_DB,		debug),

#ifdef CONFIG_X86_MCE
	INTG(X86_TRAP_MC,		&machine_check),

	SYSG(X86_TRAP_OF,		overflow),
#if defined(CONFIG_IA32_EMULATION)
	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_compat),
#elif defined(CONFIG_X86_32)

On x86_32 as an example, when an int 0x80 is raised. the following happens:

static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
	struct thread_info *ti = current_thread_info();
	unsigned int nr = (unsigned int)regs->orig_ax;

	ti->status |= TS_COMPAT;

	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
		nr = syscall_trace_enter(regs);

	if (likely(nr < IA32_NR_syscalls)) {
		nr = array_index_nospec(nr, IA32_NR_syscalls);
		regs->ax = ia32_sys_call_table[nr](regs);

		regs->ax = ia32_sys_call_table[nr](
			(unsigned int)regs->bx, (unsigned int)regs->cx,
			(unsigned int)regs->dx, (unsigned int)regs->si,
			(unsigned int)regs->di, (unsigned int)regs->bp);


__visible void do_int80_syscall_32(struct pt_regs *regs)

It would call enter_from_user_mod() to , then enable Interrupt Requests(IRQs) on the current CPU.
Push the saved registers to find the syscall number(EAX), use it as an index in the ia32_sys_call_table array.
Arguments are passed to the handler in registers with the following order: EBX, ECX, EDX, ESI, EDI, EBP.
However, the first object as seen in the idt_table is the X86_TRAP_DE(divide error).
This can be seen from GDB, that the first gate within idt_table holds the offset_high, offset_middle and offset_low referencing divide_error. Which would deal with division by 0 exceptions.

(gdb) p idt_table
$1 = 0xffffffff82598000 <idt_table>
(gdb) p/x *(idt_table + 0x10*0)
$2 = {offset_low = 0xb90, segment = 0x10,
      bits = {ist = 0x0, zero = 0, type = 14, dpl = 0, p = 1},
	  offset_middle = 0x8180, offset_high = 0xffffffff, reserved = 0x0}
(gdb) x/8i 0xffffffff81800b90
	0xffffffff81800b90 <divide_error>:		nopl	(%rax)
	0xffffffff81800b93 <divide_error+3>:	pushq	$0xffffffffffffffff
	0xffffffff81800b95 <divide_error+5>:	callq	0xffffffff81801210 <error_entry>
	0xffffffff81800b9a <divide_error+10>:	mov		%rsp,%rdi
	0xffffffff81800b9d <divide_error+13>:	xor		%esi,%esi
	0xffffffff81800b9f <divide_error+15>:	callq	0xffffffff81025d60 <do_devide_error>
	0xffffffff81800ba4 <divide_error+20>:	jmpq	0xffffffff81801310 <error_exit>

You can see that it’s DPL is zero, that is, an int $0x00 from a userland process wouldn’t help reaching it(unlike int $0x03, int $0x04 or int $0x80). Gate descriptors are initialized in idt_setup_from_table which calls idt_init_desc:

idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
	gate_desc desc;

	for (; size > 0; t++, size--) {
		idt_init_desc(&desc, t);
		write_idt_entry(idt, t->vector, &desc);
		if (sys)
			set_bit(t->vector, system_vectors);

And here it is.

static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
	unsigned long addr = (unsigned long) d->addr;

	gate->offset_low	= (u16) addr;
	gate->segment		= (u16) d->segment;
	gate->bits		= d->bits;
	gate->offset_middle	= (u16) (addr >> 16);
#ifdef CONFIG_X86_64
	gate->offset_high	= (u32) (addr >> 32);
	gate->reserved		= 0;

This could be used by the attacker, such as by getting the IDT address using the SIDT instruction, and looking for a specific handler in the list, incrementing offset_high would set it to 0.

As we said above, we're going to use the IDT and overwrite one of its
entries (more precisely a Trap Gate, so that we're able to hijack an
exception handler and redirect the code-flow towards userspace).
Each IDT entry is 64-bit (8-bytes) long and we want to overflow the
'base_offset' value of it, to be able to modify the MSB of the exception
handler routine address and thus redirect it below PAGE_OFFSET
(0xc0000000) value.

~ Phrack


This is a protection that appeared starting from 4.8, it’s name is a short for: “Kernel self-protection project”, It does provide additional checks on copy_to_user() and copy_from_user() to prevent classic buffer-overflows bugs from happening, by checking the saved compile-time buffer size and making sure it fits. if not, abort and prevent any possible exploitation from happening.

[email protected]:~/mod# cd /usr/src
[email protected]:/usr/src# cd linux-4.17.2
[email protected]:/usr/src/linux-4.17.2# cd include
[email protected]:/usr/src/linux-4.17.2/include# nano uaccess.h

We can directly see a check that’s likely to be 1, before proceeding to the copy operation:

static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
	if (likely(check_copy_size(to, n, false)))
		n = _copy_from_user(to, from, n);
	return n;

static __always_inline unsigned long __must_check
copy_to_user(void __user *to, const void *from, unsigned long n)
	if (likely(check_copy_size(from, n, true)))
		n = _copy_to_user(to, from, n);
	return n;

The check function is as follows, it does first check the compile-time size against the requested size, and calls __bad_copy_from() or __bad_copy_to() depending on the boolean is_source if it seems like an overflow is possible, which is unlikely of course(or not?), it then returns false.
If not, it does call check_object_size() and returns true.

extern void __compiletime_error("copy source size is too small")
extern void __compiletime_error("copy destination size is too small")

static inline void copy_overflow(int size, unsigned long count)
	WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count);

static __always_inline bool
check_copy_size(const void *addr, size_t bytes, bool is_source)
	int sz = __compiletime_object_size(addr);
	if (unlikely(sz >= 0 && sz < bytes)) {
		if (!__builtin_constant_p(bytes))
			copy_overflow(sz, bytes);
		else if (is_source)
		return false;
	check_object_size(addr, bytes, is_source);
	return true;

This function is simply just a wrapper around __check_object_size().

extern void __check_object_size(const void *ptr, unsigned long n,
					bool to_user);

static __always_inline void check_object_size(const void *ptr, unsigned long n,
					      bool to_user)
	if (!__builtin_constant_p(n))
		__check_object_size(ptr, n, to_user);
static inline void check_object_size(const void *ptr, unsigned long n,
				     bool to_user)
{ }

Additional checks are provided here in __check_object_size(), and as the comment says, not a kernel .text address, not a bogus address and is a safe heap or stack object.

void __check_object_size(const void *ptr, unsigned long n, bool to_user)
	if (static_branch_unlikely(&bypass_usercopy_checks))

	if (!n)

	check_bogus_address((const unsigned long)ptr, n, to_user);

	check_heap_object(ptr, n, to_user);

	switch (check_stack_object(ptr, n)) {
	case NOT_STACK:
		usercopy_abort("process stack", NULL, to_user, 0, n);

	check_kernel_text_object((const unsigned long)ptr, n, to_user);

With this, it does provide enough to block and kill classic buffer-overflow bugs, this can be disabled by commenting the check and recompiling a module.


Stands for Kernel Address Space Layout Randomization.
It’s similiar to the ASLR on userspace which protects the stack and heap addresses from being at the same location in two different runs(unless the attacker gets lucky :stuck_out_tongue:). PIE too since it does target the main binary segments which are text, data and bss.

This protection randomizes the kernel segments(Exception table, text, data…) at each restart(boot), we’ve previously disabled it by using the nokaslr at the kernel command line.
In order to experiment on it, this was removed and specific symbols in /proc/kallsyms were then fetched on two different runs.
First run:
Second run:
This shows that addresses are randomly assigned on boottime to _stext and _sdata, whereas their end is just the start address plus a size which doesn’t change in this case(0x21dc0 for .data, 0x6184d1 for .text), note that .data is on a constant distance from .text.
So if the attacker gets the .text base address(which is the result of a leak), he can know the location of all the kernel symbols even with no access to kallsyms using RVAs(or offsets), but he’ll have to compile the target kernel in his box to get them.
This is for example used when SMEP is on and one has to go for ROP to disable it first, and then redirect execution to a shellcode placed in userspace(< TASK_SIZE).


This protection prevents kernel addresses from being exposed to the attacker. It does stop %pK format from dumping an address, and it’s work depends on the kptr_restrict value(0, 1 or 2).

Kernel Pointers:

	%pK	0x01234567 or 0x0123456789abcdef

	For printing kernel pointers which should be hidden from unprivileged
	users. The behaviour of %pK depends on the kptr_restrict sysctl - see
	Documentation/sysctl/kernel.txt for more details.

This can be seen in kprobe_blacklist_seq_show() which performs a check with a call to kallsyms_show_value(), depending on it, it would or would not print the start and end addresses.

static int kprobe_blacklist_seq_show(struct seq_file *m, void *v)
	struct kprobe_blacklist_entry *ent =
		list_entry(v, struct kprobe_blacklist_entry, list);

	if (!kallsyms_show_value())
		seq_printf(m, "0x%px-0x%px\t%ps\n", NULL, NULL,
			   (void *)ent->start_addr);
		seq_printf(m, "0x%px-0x%px\t%ps\n", (void *)ent->start_addr,
			   (void *)ent->end_addr, (void *)ent->start_addr);
	return 0;

What kallsyms_show_value() does is shown here:

int kallsyms_show_value(void)
	switch (kptr_restrict) {
	case 0:
		if (kallsyms_for_perf())
			return 1;
	case 1:
		if (has_capability_noaudit(current, CAP_SYSLOG))
			return 1;
		return 0;

If kptr_restrict value is 0, it does call kallsyms_for_perf() to check if sysctl_perf_event_paranoid value is smaller or equal to 1, returns 1 if true.
If it’s 1, it checks if CAP_SYSLOG is within the user’s capabilities, if true, it returns 1.
Otherwise, it returns 0.

Disabling this protection can be done by setting /proc/sys/kernel/kptr_restrict content to 0.
Or using sysctl to do that:

sysctl -w kernel.kptr_restrict=0

But watchout for perf_event_paranoid too, if it’s > 1, then it needs to be adjusted.
This is an example on the default kernel run by my Debian VM:

[email protected]:~$ cd /proc/self
[email protected]:/proc/self$ cat stack
[<ffffffff81e7c869>] do_wait+0x1c9/0x240
[<ffffffff81e7d9ab>] SyS_wait4+0x7b/0xf0
[<ffffffff81e7b550>] task_stopped_code+0x50/0x50
[<ffffffff81e03b7d>] do_syscall_64+0x8d/0xf0
[<ffffffff8241244e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[<ffffffffffffffff>] 0xffffffffffffffff

However, in the 4.17 kernel, we get this, because of perf_event_paranoid:

[email protected]:~# cd /proc/self
[email protected]:/proc/self# cat stack
[<0>] do_wait+0x1c9/0x240
[<0>] kernel_wait4+0x8d/0x140
[<0>] __do_sys_wait4+0x95/0xa0
[<0>] do_syscall_64+0x55/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
[email protected]:/proc/self# cat /proc/sys/kernel/kptr_restrict
[email protected]:/proc/self# cat /proc/sys/kernel/perf_event_paranoid

The mm_struct within task_struct holds an operation function called get_unmapped_area.

struct mm_struct {
		unsigned long (*get_unmapped_area) (struct file *filp,
				unsigned long addr, unsigned long len,
				unsigned long pgoff, unsigned long flags);

It is then extracted in get_unmapped_area(), which tries to get it from the mm(mm_struct), before checking it’s file and it’s file_operations or if it has the MAP_SHARED flag and assign shmem_get_unmapped_area() to it.
However, within the mm_struct, the default value of get_unmapped_area is the arch specific function.
This function does search for a large enough memory block to satisfy the request, but before returning the addr, it does check if it’s bigger or equal to mmap_min_addr, which means that any address below it will not be given, this prevents NULL pointer dereference attack from happening(no mmaping NULL address, nothing will be stored there(shellcode, pointers…)).

Disabling this protection can be done by setting /proc/sys/vm/mmap_min_addr content to 0, or using sysctl like before.

sysctl -w vm.mmap_min_addr=0

The thread(thread_struct) within the task_struct contains some important fields, amongst them, is the addr_limit.

typedef struct {
	unsigned long		seg;
} mm_segment_t;

struct thread_struct {
	mm_segment_t		addr_limit;

	unsigned int		sig_on_uaccess_err:1;
	unsigned int		uaccess_err:1;

This can be read with a call to get_fs(), changed with set_fs():

#define MAKE_MM_SEG(s)	((mm_segment_t) { (s) })


#define get_ds()	(KERNEL_DS)
#define get_fs()	(current->thread.addr_limit)
static inline void set_fs(mm_segment_t fs)
	current->thread.addr_limit = fs;

When userspace likes to reach an address, it is checked against this first, so overwritting it with -1UL(KERNEL_DS) would let you access(read or write) to kernelspace.

This was the introduction, I’ve noticed that it has grown bigger than I expected, so I stopped, and removed parts about protections, side-channel attacks and others.

Starting this was possible, thanks to: @_py(DA BEST), @pry0cc, @anon79434934, @4w1il, @ricksanchez and @Leeky.
See y’all in part 1, peace.

nothing is enough, search more to learn more”.
~ exploit

(The_Cat) #2

First four lines are hilarious^^^

(system) #3

This topic was automatically closed after 30 days. New replies are no longer allowed.