Yesterday, I joined the 0x00sec IRC channel and, as many other times, @dtm come up with an interesting concept… and I had to try. The idea was pretty challenging and I have not completely come up with a full implementation but I manage to get a minimal Proof of Concept program to illustrate the concept and, maybe, to be used as a starting point for other people more interested in the topic than me.
The IBI Crypter
Sure, we were talking about crypters and I named mine IBI crypter. IBI stands for Instruction by Instruction Crypter, and that is the concept. Instead of decrypting the whole application at once or even a block, this crypter tries to decrypt just the instruction that has to be executed at any time. After running that instruction, the code is immediately crypted again so a live memory dump of the process will still be encrypted. This is why I’m referring to this concept as a JIT (Just in Time) crypter.
During our discussion we realised that the problem is not trivial, and I have to admit that it is more challenging that I initially expected. I’ll let you know about the issues at the end of this paper.
Before diving into the code, let me explain you how the crypter works.
An Embedded Debugger
The idea is pretty simple. We start with a program with some crypted sections. The program starts tracing itself and sets a break point at the first function to run. Once that function is executed, the program execution will stop and the embedded debugger will take control. Then it will decrypt the next instruction to run and, after that, it will execute that instruction.
Once the instruction is executed, the debugger will take control again of the process, it will crypt again the instruction just executed and repeat the process until the code is completely executed.
The concept is pretty simple but the implementation is not that trivial as we will see in a while.
To follow the rest of the paper, it may be useful for you to also read this other text:
Breakpoints on Intel Processors
As we already know how to trace a program and access its memory and registers (yes, you should read the link I have just mentioned), we only need to know how to set a break point. Again the concept itself is pretty simple, at least for the traditional way of setting breakpoints.
Breakpoints make use of the processor instruction int 3
(opcode 0xcc). This is a 1 byte processor instruction and stops the current processor execution and runs some code defined by us. So, the process to set a break point is as follows:
- Store the content at the memory address where we want to set our break point.
- Write the
int 3
instruction in that address. - Run the program
Whenever the program reaches the int 3
instruction, the program will stop at that point and our code will take control and it can do whatever it needs to do, in our case, decrypt the code to execute.
Once we are done, we have to do a couple more things to restore the execution of the original program:
- Copy back the original byte (the one we stored when we set the break point) in their original position.
- Decrease the
IP
(Instruction Pointer). As you know, theIP
always points to the next instruction to be executed. In this case, this is the address of our breakpoint plus 1 (the size of theint 3
instruction). We want to decrease it in order to run the original instruction (that we broke injecting theint 3
opcode). - Give control back to the original process, or ask the process to just run the next instruction.
Overall the concept is pretty straightforward.
SingleStep Execution
Fortunately for us, the ptrace
interface offer a function to execute a single instruction, Otherwise we would have to add some code to figure out the size of the current instruction (it can be 1 to 15 bytes), in order to know where to set up our next breakpoint.
The ptrace
PTRACE_SINGLESTEP
will do that for us. It will just run one instruction and give us control back. This is actually the last piece of the puzzle to build our simple proof of concept.
Let’s look into it.
The Proof of Concept
We have chosen to illustrate the technique a very basic application to check the validity of a user provided code. The function that does the check will be crypted (and only that function) and it will be executed using the awesome IBI Crypter :P.
Let’s start with the main program and our check function. We have call it target.c
#include <stdio.h>
#include "stub.h"
#define CRYPT_ME __attribute__((section(".secure")))
// This function is crypted
CRYPT_ME int
check_key (unsigned char *str)
{
int i;
unsigned char *p = str;
while (*p) {*p -= '0'; p++;};
if (str[0] + str[1] != 5) return 1;
if (str[2] * str[3] != 10) return 1;
return 0;
}
int
main (int argc, char*argv[])
{
_stub (check_key); // Setup run environment
printf ("Code is %s\n", check_key (argv[1]) ? "INCORRECT": "CORRECT");
}
As you can see, the check_key
function in the code above, do a couple of stupid checks on the key it receives as parameter, and returns 0 if the key is valid or 1 otherwise. The main function is also pretty simple. It first runs our _stub
and then just prints CORRECT
or INCORRECT
based on the result of the check_key
function.
To off-line crypter I will be using is the same described in the paper:
We are pushing the functions to secure into a separated section for easy identification (see CRYPT_ME
macro). Then we XOR it, as described in the paper I have just mentioned.
The interesting stuff is in the _stub
function. Let’s look at it
The _stub
Function
The _stub
function has two main parts. First, we set up a breakpoint in the crypted function we want to run (check_key
in this case). Once we get there, we will start stepping over the function instruction by instruction.
Let’s go with the breakpoint
int
_stub (void *ep)
{
void *bp_ip;
long ip1, op1, op2;
struct user_regs_struct regs;
int status, cnt;
printf ("%s", "0x00pf IbI Crypter Stub\n");
// Start debugging!!!
if ((_pid = fork ()) < 0) PERROR("fork:");
if (_pid == 0) return 0; // Child process just keeps running
else
{
// Father starts debugging child
if ((ptrace (PTRACE_ATTACH, _pid, NULL, NULL)) < 0) PERROR ("ptrace_attach:");
printf ("%s", "+ Waiting for process...\n");
wait (&status);
bp_ip = ep;
// Set breakpoint at get there...
op1 = ptrace (PTRACE_PEEKTEXT, _pid, bp_ip);
DPRINTF("BP: %p 1 Opcode: %lx\n", bp_ip, op1);
if (ptrace (PTRACE_POKETEXT, _pid, bp_ip,
(op1 & 0xFFFFFFFFFFFFFF00) | 0xcc) < 0) PERROR ("ptrace_poke:");
// Run until breakpoint is reached.
if (ptrace (PTRACE_CONT, _pid, 0, 0) < 0) PERROR("ptrace_cont:");
wait (&status);
ptrace (PTRACE_GETREGS, _pid, 0, ®s);
DPRINTF ("Breakpoint reached: RIP: %llx\n", regs.rip);
regs.rip--;
ptrace (PTRACE_SETREGS, _pid, 0, ®s);
// REstore opcode
ptrace (PTRACE_POKETEXT, _pid, bp_ip, op1);
Hope the code is easy to understand. It creates a new process and starts debugging it. Immediately we set the breakpoint (opcode 0xcc
) in the address received as parameter, that in our case is the check_key
function (check the main
function above).
Once the break point is set, we just let the program run using the PTRACE_CONT
and we just wait for the program to hit the break point… i.e. we wait until the function we want to decrypt gets executed.
The program will eventually call the check_key
function (that is actually the next line in the main
function) and the _stub
code will take control back. Then we have to get the IP
register, decreased in one byte (as explained above), set the register value back and also restore the opcode where the int3
was inserted.
Time to run the function.
Decrypting, Running, Encoding and again
At this point we have stopped the application just at the beginning of the check_key
function and in order to continue the execution we have to decrypt it as we go.
This is the code that does the trick
// Start step by step debugging
ip1 = (long) ep;
cnt = 0;
while (WIFSTOPPED (status))
{
cnt ++;
// Read up to 16 bytes to get the longest instruction possible
// Decode and write back the decoded code to execute it
op1 = ptrace (PTRACE_PEEKTEXT, _pid, ip1);
op2 = ptrace (PTRACE_PEEKTEXT, _pid, ip1 + 8);
DPRINTF ("%lx :: OPCODES : %lx %lx\n", ip1, op1, op2);
XOR(op1);
XOR(op2);
DPRINTF ("%lx :: DOPCODES: %lx %lx\n", ip1, op1, op2);
ptrace (PTRACE_POKETEXT, _pid, ip1, op1);
ptrace (PTRACE_POKETEXT, _pid, ip1 + 8, op2);
/* Make the child execute another instruction */
if (ptrace(PTRACE_SINGLESTEP, _pid, 0, 0) < 0) PERROR ("ptrace_singlestep:");
wait(&status);
// Re-encode the instruction just executed so we do not have
// to count how many bytes got executed
XOR(op1);
XOR(op2);
ptrace (PTRACE_POKETEXT, _pid, ip1, op1);
ptrace (PTRACE_POKETEXT, _pid, ip1 + 8, op2);
// Get the new IP
ptrace (PTRACE_GETREGS, _pid, 0, ®s);
ip1 = regs.rip;
// If code is outside .secure section we stop debugging
if ((void*)ip1 < secure_ptr || (void*)ip1 > secure_ptr + secure_len)
{
printf ("Leaving .secure section... %d instructions executed\n", cnt);
break;
}
}
ptrace (PTRACE_CONT, _pid, 0, 0);
wait (&status);
}
printf ("DONE\n");
exit (1);
}
The function is a bit verbose but conceptually very simple. The XOR
macro just applies the XOR encoding to a long
(8 bytes) with a predefined key. You can check the details in the full source code (check at the end). At this point, it is not relevant.
If we check the format of the Intel opcodes, you will find out that a instruction, for a 64bits architecture may take up to 15 bytes. As we do not know (and we do not really want to know) anything about the next instruction to run, or in other words, we do not want to decode the opcodes ourselves in the code, then we have to decrypt up to 16 bytes to cover the longest possible opcode. Yes, in general, at a given time there are more than 1 single instruction decoded in memoru.
This is why we do two PTRACE_PEEK
s to read the current address and that address plus 8 bytes (longs are 8 bytes long). Once we have read the 16 bytes that contains the next instruction to run, we just decrypt it applying our XOR
macro, and we update the memory using PTRACE_POKE
so, the next program instruction is now correct.
At this point we can just run the next instruction using PTRACE_SINGLESTEP
and wait until the instruction is executed and control gets back to us.
Then we just need to encode again the 16 bytes of memory we decoded. This is not just to keep the program encrypted most of the time, but also to avoid some tricky logic to keep at least 16 bytes decoded in the memory program.
The final check in the while
loop checks if the current IP
is still in the .secure
section of it has moved into other executable section… more on this at the end of the text.
Testing
Testing the program is not that straightforward. For the time being we have to do some manual tasks to make it work. It is not hard to fully automatise the process so I leave it as an exercise to you
First we have to compile the target program:
$ gcc -o target target.c stub.c
Then we have to crypt the .secure
section, using the crypter_rt
tool (provided with the code).
$ ./crypter_rt target
Finally we need to manually fix the section information in stub.c
and redo the process. To get the information stub
needs about the section just run this command:
$ readelf -S target | grep secure
[14] .secure PROGBITS 0000000000400e22 00000e22
The two last number in this line has to go to the variables secure_ptr
and secure_len
in stub.c
. This information is used to figure out when the execution leaves the secure section and matches not crypted code.
Recompile and rebuild and you should be able to run the program like this:
$ ./target 1425
0x00pf B3Crypt Stub (Byte By Byte)
+ Waiting for process...
Leaving .secure section... 74 instructions executed
Code is CORRECT
DONE
./target 1426
0x00pf B3Crypt Stub (Byte By Byte)
+ Waiting for process...
Leaving .secure section... 75 instructions executed
Code is INCORRECT
DONE
The Tricky part
This example is very simple on purpose. Making a usable version will require some effort and I do not really have a need for this tool so I do not think I will go further on this implementation. However these are a couple of things to do, in order to extend this PoC into a usable tool
- The first thing to do is to extend the
stub
to access the ELF header and get the information associated to the.secure
section, so you do not need to update the source code to re-compile - Second is more tricky. You have to detect jumps/calls to functions outside the
.secure
section, as for instance the standard C library (have you seen a singleprintf
in thecheck_key
function ?). In those cases we have to set a break point just after the call in order to run the function without decoding it (those functions are not encoded) and to restart the decoding when the function returns… The last check in the function may give you some hints on how to proceed. - I haven’t extensively tested the program. I just made it work on my machine, it is not optimised and it may have some timing issues.
- This only works on 64bits platforms… should be easy to make it work on 32bits… but in any case it is x86 specific. For ARM or MIPS you need to figure out how to set breakpoints.
Well, this is it. I think this proves the concept is feasible and it is up to you to make it work. I see this more as a SW protection mechanism than as a malware development technique… WTF… they are the same thing
As usual you can get the complete source code from my github repo
Any comment is welcomed