IBI Crypter. A JIT Crypter PoC

Yesterday, I joined the 0x00sec IRC channel and, as many other times, @dtm come up with an interesting concept… and I had to try. The idea was pretty challenging and I have not completely come up with a full implementation but I manage to get a minimal Proof of Concept program to illustrate the concept and, maybe, to be used as a starting point for other people more interested in the topic than me.

The IBI Crypter

Sure, we were talking about crypters and I named mine IBI crypter. IBI stands for Instruction by Instruction Crypter, and that is the concept. Instead of decrypting the whole application at once or even a block, this crypter tries to decrypt just the instruction that has to be executed at any time. After running that instruction, the code is immediately crypted again so a live memory dump of the process will still be encrypted. This is why I’m referring to this concept as a JIT (Just in Time) crypter.

During our discussion we realised that the problem is not trivial, and I have to admit that it is more challenging that I initially expected. I’ll let you know about the issues at the end of this paper.

Before diving into the code, let me explain you how the crypter works.

An Embedded Debugger

The idea is pretty simple. We start with a program with some crypted sections. The program starts tracing itself and sets a break point at the first function to run. Once that function is executed, the program execution will stop and the embedded debugger will take control. Then it will decrypt the next instruction to run and, after that, it will execute that instruction.

Once the instruction is executed, the debugger will take control again of the process, it will crypt again the instruction just executed and repeat the process until the code is completely executed.

The concept is pretty simple but the implementation is not that trivial as we will see in a while.

To follow the rest of the paper, it may be useful for you to also read this other text:

Breakpoints on Intel Processors

As we already know how to trace a program and access its memory and registers (yes, you should read the link I have just mentioned), we only need to know how to set a break point. Again the concept itself is pretty simple, at least for the traditional way of setting breakpoints.

Breakpoints make use of the processor instruction int 3 (opcode 0xcc). This is a 1 byte processor instruction and stops the current processor execution and runs some code defined by us. So, the process to set a break point is as follows:

  • Store the content at the memory address where we want to set our break point.
  • Write the int 3 instruction in that address.
  • Run the program

Whenever the program reaches the int 3 instruction, the program will stop at that point and our code will take control and it can do whatever it needs to do, in our case, decrypt the code to execute.

Once we are done, we have to do a couple more things to restore the execution of the original program:

  • Copy back the original byte (the one we stored when we set the break point) in their original position.
  • Decrease the IP (Instruction Pointer). As you know, the IP always points to the next instruction to be executed. In this case, this is the address of our breakpoint plus 1 (the size of the int 3 instruction). We want to decrease it in order to run the original instruction (that we broke injecting the int 3 opcode).
  • Give control back to the original process, or ask the process to just run the next instruction.

Overall the concept is pretty straightforward.

SingleStep Execution

Fortunately for us, the ptrace interface offer a function to execute a single instruction, Otherwise we would have to add some code to figure out the size of the current instruction (it can be 1 to 15 bytes), in order to know where to set up our next breakpoint.

The ptrace PTRACE_SINGLESTEP will do that for us. It will just run one instruction and give us control back. This is actually the last piece of the puzzle to build our simple proof of concept.

Let’s look into it.

The Proof of Concept

We have chosen to illustrate the technique a very basic application to check the validity of a user provided code. The function that does the check will be crypted (and only that function) and it will be executed using the awesome IBI Crypter :P.

Let’s start with the main program and our check function. We have call it target.c

#include <stdio.h>
#include "stub.h"

#define CRYPT_ME __attribute__((section(".secure")))

// This function is crypted
CRYPT_ME int
check_key (unsigned char *str)
{
  int  i;
  unsigned char  *p = str;

  while (*p) {*p -= '0'; p++;};
  if (str[0] + str[1] != 5) return 1;
  if (str[2] * str[3] != 10) return 1;
  return 0;
}


int
main (int argc, char*argv[])
{
  _stub (check_key); // Setup run environment
  printf ("Code is %s\n", check_key (argv[1]) ? "INCORRECT": "CORRECT");
}

As you can see, the check_key function in the code above, do a couple of stupid checks on the key it receives as parameter, and returns 0 if the key is valid or 1 otherwise. The main function is also pretty simple. It first runs our _stub and then just prints CORRECT or INCORRECT based on the result of the check_key function.

To off-line crypter I will be using is the same described in the paper:

We are pushing the functions to secure into a separated section for easy identification (see CRYPT_ME macro). Then we XOR it, as described in the paper I have just mentioned.

The interesting stuff is in the _stub function. Let’s look at it

The _stub Function

The _stub function has two main parts. First, we set up a breakpoint in the crypted function we want to run (check_key in this case). Once we get there, we will start stepping over the function instruction by instruction.

Let’s go with the breakpoint

int
_stub (void *ep)
{
  void                    *bp_ip;
  long                    ip1, op1, op2;
  struct user_regs_struct regs;
  int                     status, cnt;

  printf ("%s", "0x00pf IbI Crypter Stub\n");

  // Start debugging!!!
  if ((_pid = fork ()) < 0) PERROR("fork:");

  if (_pid == 0) return 0;  // Child process just keeps running
  else
    {
      // Father starts debugging child
      if ((ptrace (PTRACE_ATTACH, _pid, NULL, NULL)) < 0) PERROR ("ptrace_attach:");
      printf ("%s", "+ Waiting for process...\n");
      wait (&status);

      bp_ip = ep;

      // Set breakpoint at get there...
      op1 = ptrace (PTRACE_PEEKTEXT, _pid, bp_ip);
      DPRINTF("BP: %p 1 Opcode: %lx\n", bp_ip, op1);
      if (ptrace (PTRACE_POKETEXT, _pid, bp_ip, 
		  (op1 & 0xFFFFFFFFFFFFFF00) | 0xcc) < 0) PERROR ("ptrace_poke:");

      // Run until breakpoint is reached.
      if (ptrace (PTRACE_CONT, _pid, 0, 0) < 0) PERROR("ptrace_cont:");
      wait (&status);
      
      ptrace (PTRACE_GETREGS, _pid, 0, &regs);
      DPRINTF ("Breakpoint reached: RIP: %llx\n", regs.rip);
      regs.rip--;
      ptrace (PTRACE_SETREGS, _pid, 0, &regs);

      // REstore opcode
      ptrace (PTRACE_POKETEXT, _pid, bp_ip, op1);

Hope the code is easy to understand. It creates a new process and starts debugging it. Immediately we set the breakpoint (opcode 0xcc) in the address received as parameter, that in our case is the check_key function (check the main function above).

Once the break point is set, we just let the program run using the PTRACE_CONT and we just wait for the program to hit the break point… i.e. we wait until the function we want to decrypt gets executed.

The program will eventually call the check_key function (that is actually the next line in the main function) and the _stub code will take control back. Then we have to get the IP register, decreased in one byte (as explained above), set the register value back and also restore the opcode where the int3 was inserted.

Time to run the function.

Decrypting, Running, Encoding and again

At this point we have stopped the application just at the beginning of the check_key function and in order to continue the execution we have to decrypt it as we go.

This is the code that does the trick

     // Start step by step debugging
      ip1 = (long) ep;
      cnt = 0;
      while (WIFSTOPPED (status))
	{
	  cnt ++;
	  // Read up to 16 bytes to get the longest instruction possible
	  // Decode and write back the decoded code to execute it
	  op1 = ptrace (PTRACE_PEEKTEXT, _pid, ip1);
	  op2 = ptrace (PTRACE_PEEKTEXT, _pid, ip1 + 8);
	  DPRINTF ("%lx :: OPCODES : %lx %lx\n", ip1, op1, op2);
	  
	  XOR(op1);
	  XOR(op2);

	  DPRINTF ("%lx :: DOPCODES: %lx %lx\n", ip1, op1, op2);

	  ptrace (PTRACE_POKETEXT, _pid, ip1, op1);
	  ptrace (PTRACE_POKETEXT, _pid, ip1 + 8, op2);

	  /* Make the child execute another instruction */
	  if (ptrace(PTRACE_SINGLESTEP, _pid, 0, 0) < 0) PERROR ("ptrace_singlestep:");
	  wait(&status);
	  
	  // Re-encode the instruction just executed so we do not have
	  // to count how many bytes got executed
	  XOR(op1);
	  XOR(op2);

	  ptrace (PTRACE_POKETEXT, _pid, ip1, op1);
	  ptrace (PTRACE_POKETEXT, _pid, ip1 + 8, op2);

	  // Get the new IP
	  ptrace (PTRACE_GETREGS, _pid, 0, &regs);
	  ip1 = regs.rip;

	  // If code is outside .secure section we stop debugging 
	  if ((void*)ip1 < secure_ptr || (void*)ip1 > secure_ptr + secure_len)
	    {
	      printf ("Leaving .secure section... %d instructions executed\n", cnt);
	      break;
	    }
	}

      ptrace (PTRACE_CONT, _pid, 0, 0);
      wait (&status);      
    }

  printf ("DONE\n");
  exit (1);
}

The function is a bit verbose but conceptually very simple. The XOR macro just applies the XOR encoding to a long (8 bytes) with a predefined key. You can check the details in the full source code (check at the end). At this point, it is not relevant.

If we check the format of the Intel opcodes, you will find out that a instruction, for a 64bits architecture may take up to 15 bytes. As we do not know (and we do not really want to know) anything about the next instruction to run, or in other words, we do not want to decode the opcodes ourselves in the code, then we have to decrypt up to 16 bytes to cover the longest possible opcode. Yes, in general, at a given time there are more than 1 single instruction decoded in memoru.

This is why we do two PTRACE_PEEKs to read the current address and that address plus 8 bytes (longs are 8 bytes long). Once we have read the 16 bytes that contains the next instruction to run, we just decrypt it applying our XOR macro, and we update the memory using PTRACE_POKE so, the next program instruction is now correct.

At this point we can just run the next instruction using PTRACE_SINGLESTEP and wait until the instruction is executed and control gets back to us.

Then we just need to encode again the 16 bytes of memory we decoded. This is not just to keep the program encrypted most of the time, but also to avoid some tricky logic to keep at least 16 bytes decoded in the memory program.

The final check in the while loop checks if the current IP is still in the .secure section of it has moved into other executable section… more on this at the end of the text.

Testing

Testing the program is not that straightforward. For the time being we have to do some manual tasks to make it work. It is not hard to fully automatise the process so I leave it as an exercise to you :stuck_out_tongue:

First we have to compile the target program:

$ gcc -o target target.c stub.c

Then we have to crypt the .secure section, using the crypter_rt tool (provided with the code).

$ ./crypter_rt target

Finally we need to manually fix the section information in stub.c and redo the process. To get the information stub needs about the section just run this command:

$ readelf -S target | grep secure
  [14] .secure           PROGBITS         0000000000400e22  00000e22

The two last number in this line has to go to the variables secure_ptr and secure_len in stub.c. This information is used to figure out when the execution leaves the secure section and matches not crypted code.

Recompile and rebuild and you should be able to run the program like this:

$ ./target 1425
0x00pf B3Crypt Stub (Byte By Byte)
+ Waiting for process...
Leaving .secure section... 74 instructions executed
Code is CORRECT
DONE
 ./target 1426
0x00pf B3Crypt Stub (Byte By Byte)
+ Waiting for process...
Leaving .secure section... 75 instructions executed
Code is INCORRECT
DONE

The Tricky part

This example is very simple on purpose. Making a usable version will require some effort and I do not really have a need for this tool so I do not think I will go further on this implementation. However these are a couple of things to do, in order to extend this PoC into a usable tool

  • The first thing to do is to extend the stub to access the ELF header and get the information associated to the .secure section, so you do not need to update the source code to re-compile
  • Second is more tricky. You have to detect jumps/calls to functions outside the .secure section, as for instance the standard C library (have you seen a single printf in the check_key function ?). In those cases we have to set a break point just after the call in order to run the function without decoding it (those functions are not encoded) and to restart the decoding when the function returns… The last check in the function may give you some hints on how to proceed.
  • I haven’t extensively tested the program. I just made it work on my machine, it is not optimised and it may have some timing issues.
  • This only works on 64bits platforms… should be easy to make it work on 32bits… but in any case it is x86 specific. For ARM or MIPS you need to figure out how to set breakpoints.

Well, this is it. I think this proves the concept is feasible and it is up to you to make it work. I see this more as a SW protection mechanism than as a malware development technique… WTF… they are the same thing :stuck_out_tongue:

As usual you can get the complete source code from my github repo

Any comment is welcomed

16 Likes

Nice concept! Maybe you could put it on our 0x00sec Gitlab? :smiley:

1 Like

Thanks @SmartOne.

All my code is GPL so anybody can use/modify/redistribute it. Be free to put it anywhere you want and modify it as you wish. Just, please, honour the GPL license and release publicly any modification you do to the code also as a GPL module. That’s it.

2 Likes

Hell yea, getting jiggy with the jit.

1 Like

This is going to take some serious re-reading. Great post, as usual… I wish I was there for the IRC brainstorming.

2 Likes

I’m gonna have to read this like 4 more times to digest it, but I did have one concern.

There is a very limited set of instructions. This small vocabulary presents a very small search space for attacking the cipher text. Further, repeatedly encrypting and decrypting the same message (the instruction) over and over again makes the cipher increasingly vulnerable to attack. How do you manage the fact that there is such a small search space and that the cipher is used over and over to reencrypt the same few words?

2 Likes

Hey @fraq,

Thanks for reading and for the interesting comments.

First of all, let me say that this was just a PoC. I didn’t though much further, but we can do it now. So these are my first thoughts to kick off the discussion.

I do not think the search space is that small, but I haven’t checked it so I may be wrong. The set of instruction is limited, but for a CISC processor as intel it may be around a few thousands (opcode wise) that is probably more that the number of words many people use in their everyday communication (and it is definitely longer that the number of letters we use in a normal message). On top of that, many instructions have data (offsets to jump to, values to store in registers,…) that, I think, actually increases the search space.

Said that, I think we have to look at this within its context. This was intended to be used as a crypter. Normal crypters need a plain decrypting function (at least the first one, for a multi-level crypter) and they also need the key stored somewhere or at least a way to figure it out/generate it without a user entering it. The program has to run on its own. In this context we should compare to other crypters not to a general cryptographic algorithms.

If we move to the forensics side, one typical way to get the decoded payload (the program part that is crypted), is to run the program, let the decryption function do its job and then dump the memory. That way you do not have to care about the cryptography behind and you do not have to reverse a potentially heavily obfuscated stub. In that sense, this approach avoids such an analysis and, cryptographically wise, I do not see, at first glance, a big difference when compared to a classical crypter.

I cannot assess if this re-encoding approach makes a cryptoanalysis attack more successful. I cannot figure out how to perform it. Haven’t tried but in principle you cannot attach more than one debugger to a binary (that is a classical anti-debug technique), the fact that the crypter uses the debugger functions makes difficult a dynamic analysis, at least a traditional/straightforward one, or even checking the process memory to get the information required for a cryptoanalysis. There is probably a way but it might not be as easy as running the program step by step with gdb,

I hope I have properly understood your question (even when I’m aware I haven’t explicitly answer it) and that this text shows some light on the topic. I would be curious about knowing how could you implement the attack/analysis you mentioned. Just because I’m not an expert myself on cryptography and I cannot figure it out myself.

3 Likes

One more thing:

Does this PoC make a new process for every 16 bytes? If so: Isn’t that costly? Is there a way around this?

1 Like

No. It creates a process at the beginning and starts “debugging” it. Not sure if that can be avoided. After that is works like a normal debugger when you run the program step by step, but decrypting the next 16 bytes (starting at current RIP value) in memory before running each step and crypting them back once the instruction is executed.

2 Likes

Thank you for sharing , great post

1 Like