ARM Stack Unwinding

Languages like C++ and Java have very useful facilities that allow a stack trace to be collected and displayed in a variety of ways. In Java, a snapshot of the current stack trace can be taken simply by constructing a Throwable object, and the trace can be displayed using the printStackTrace() method.

    Throwable t = new Throwable();

    t.printStackTrace(System.out);

C++ offers similar facilities, and because of this, both these languages can provide useful information when an unexpected failure occurs and is detected. For example, assertions maybe placed in the code with the failure action being to display the current stack trace to help the programmer debug the cause.

Unfortunately C offers no such inbuilt luxury, and as such, debugging without a debugger or other logging mechanism maybe a little more difficult in the first instance. This interests me as I've worked as an embedded engineer creating software for consumer devices which are extensively field tested containing code that makes extensive use of assertion macros. I decided to investigate stack unwinding in order to enable better information to be gathered from devices that have failed in the field, with a goal to supporting stack tracing without the need for expensive or cumbersome supporting hardware.

Therefore I decided to make an ARM stack unwinder that would be suitable to run on an embedded target to provide stack trace capabilities for C, similar to those already enjoyed by other languages.

Design Consideration

Since the target is an embedded processor in a consumer device, there are some restrictions on how the solution can be engineered and what options are available. At the same time, knowing the target processor is likely to be an ARM7 or similar processor using the ARM and Thumb Instruction Set Architectures allows the solution to be targeted specifically to this family of RISC processors.

Since storage is at a premium on these devices, I decided to forego use of debug tables as these would need embedding on the target and would be large, even if compressed. Since code runs from FLASH, I also have to be careful that the solution does not try and patch any code - one stack unwinding approach described in the ARM APCS document is to 'patch' function entries and exits and to then execute code to cause the stack adjustments to be made and stack frames unwound. (Patching code in FLASH is not easy since FLASH can generally only be erased in blocks and then written once i.e. it is not truly random access. Additionally erasing any block of FLASH could potentially permanently damage the device until it is reprogrammed, so that is a risk and complexity best avoided if at all possible).

Finally the form factor devices usually don't bring out connectors for debugging such as JTAG. This may sound weird, but all the pins on the package inside the device have a use and JTAG is usually multiplexed with something that is generally more useful, and exposing debug interfaces is also considered a security issue and frowned upon in the industry. Additionally having less external connectors makes Electrostatic Discharge (ESD) protection simpler as well as having other small benefits. And in any case, Java requires no debugger to grab a stack trace, so why should C?

Method

Given the above restrictions, I decided that the best approach is to write a small model of the ARM processor which can interpret the code and look for the tell-tale signs of functions returning to divine the call stack. Yup, I decided to write a model ARM processor to run on the ARM to interpret the code with the aim of unwinding the stack, as opposed to producing bit exact interpretation of the code. Implementing the model ARM provides a couple of challenges that are laid out in the following subsections.

Function Epilogs

All functions look pretty much the same. They have a prologue, a main body and an epilogue. The compiler generates the prologue and epilogue on almost all functions (IRQ handling functions can be an explicit exception to this) to setup and teardown the stack frame used during the function body for things such as local variables. Debug tables can give details about the location of function prologues and epilogues to assist debuggers in interpreting the stack at any time, but I don't have these, so have to locate them automatically.

Since this is targeted at ARM processors, only ARM functions need to be considered. Looking at a few epilogues, it can be seen that they generally look the same and take the format shown in the following examples:

ADD      sp,#0x28
POP      {r4-r6,pc}

ADD      sp,sp,#0x28
LDMFD    sp!,{r4-r6,pc}

The examples show a stack adjustment to remove storage allocated for locals, then a restoration of the corrupted registers as required by the ARM ABI, and the restoration of the program counter (PC) from the stack effectively executing the return. The return is very similar and can be detected easily regardless of processor operating mode (Thumb or ARM), although this raises an interesting question. The examples show a return that is only suitable if the function is always called from code in the same operating mode as the function itself i.e. the Thumb return code can only be used if returning to Thumb code, and the same is also true for ARM code. Practically it is sometimes the case that ARM code may call Thumb functions and vice versa, and this is known as interworking. Fortunately the ARM ABI also describes interworking and more examples can be generated to show function epilogues used when interworking is required.

ADD      sp,#0x28
POP      {r4-r6}
POP      {r3}
BX       r3

ADD      sp,sp,#0x28
LDMFD    sp!,{r4-r6,lr}
BX       lr

With interworking, the BX instruction is used to enable the processor mode to be changed at the same time as returning from the function. In each case, the return address is restored to a register before being used with the BX instruction to cause the return, where the least significant bit of the return address is used to indicate the desired processor mode once the branch has been taken. So we now have a good idea of what we need to detect in order to determine where a function exits - the reading of a return address from the stack, it being loaded into the PC either directly by the load, or via a BX instruction.

The first thing in the model ARM is therefore to store not only register contents, but a bit of data about where the contents originated from. To do this, I've made a couple of types for my model ARM to use.

typedef enum
{
    REG_VAL_INVALID      = 0x00,
    REG_VAL_FROM_STACK   = 0x01,
    REG_VAL_FROM_MEMORY  = 0x02,
    REG_VAL_FROM_CONST   = 0x04,
    REG_VAL_ARITHMETIC   = 0x80
}
RegValOrigin;

typedef struct
{
    Int32              v;
    RegValOrigin       o;
}
RegData;

Creating an array of RegData structures then allows the register file to be emulated. The Program Counter (PC) and Stack Pointer (SP) can be added to the register file to give the model a basis to interpret code.

At this point, two basic loops are added to the model - one to interpret Thumb code, and one to interpret ARM code and decoding for the POP and LDMFD instructions are added respectively. When a value is loaded to a register, the RegValOrigin can now be updated to indicate that the data originated on the stack, REG_VAL_FROM_STACK, making it easy to spot a function returning when a BX is encountered for such a tracked register. Interpretation of BX is therefore added to both the ARM and Thumb modes, faithfully checking the LSB of the branch address to change the interpretation mode between ARM and Thumb as needed.

At this point, the model ARM is capable of detecting the return from a function. This is a good start, but it needs to handle the stack adjustment if it is to be able to unwind more than one stack frame. In the examples seen so far, the stack adjust has just been the addition of 0x28 to the stack pointer, although this will not always be the case. Depending on the amount of stack data utilised by a function, the stack adjust maybe for a different value, and unfortunately not all adjustments can be accommodated by a single instruction. The following shows an odd C function and the generated assembly.

int testStackResize(void)
{
    char biggie[0x81111];
    char *c = biggie;
    int  t;

    sprintf(biggie, "Hello");

    t = 0;

    while(*c)
    {
        t += *c;
        c++;
    }

    runFunc();
    return t;
}

testStackResize PROC
        LDR      r3,|L1.364| + 56
        PUSH     {r4,r5,lr}
        ADD      sp,r3
        MOV      r0,sp
        MOV      r4,sp
        ADR      r1,|L1.364| + 60
        BL       __0sprintf
        MOV      r5,#0
        B        |L1.202|
|L1.198|
        ADD      r5,r0,r5
        ADD      r4,#1
|L1.202|
        LDRB     r0,[r4,#0]
        CMP      r0,#0
        BNE      |L1.198|
        LDR      r0,|L1.364| + 68
        LDR      r0,[r0,#0]  ; runFunc
        BL       __call_via_r0
        LDR      r3,|L1.364| + 56
        MOV      r0,r5
        NEG      r3,r3
        ADD      sp,r3
        POP      {r4,r5}
        POP      {r3}
        BX       r3

This fictional function is far from pretty, and the epilogue is somewhat more complicated. The important instructions are the LDR into r3 from a constant memory address, then the NEG operation before the stack adjust. Interlaced with this is an instruction to move the function's return value into r0 in accordance with the ABI. This requires the model ARM not only to interpret a number of new instructions, but also provokes thought about how the compiler generates awkward constant values.

In ARM and Thumb mode there are a number of ways in which to generate a constant value, and the model ARM will need to be able to interpret them all. Worse still is the possibility that the desired constant, or some part of it, has already been created for use by the function body an that the compiler will use this already constructed value. This means that the model not only needs to be able to interpret any instructions that can be used to generate constant values, but that it needs to be able to look outside the function epilogue and into the function body too. Therefore the model ARM becomes yet more sophisticated and now attempts to interpret every instruction of the program, starting at the PC and SP values from where the stack trace is required, and stopping when some to be determined criteria is met.

Clearly it is desirable for the model ARM is to remain small, and so only the subset of instructions that are needed for stack unwinding should be implemented. The required instructions that have been identified so far are those that are involved in constant value generation, stack adjusting and returning. The question is what to do when an instruction that is not understood is encountered. The solution to this is simple - invalidate all state and continue the interpretation! As seen earlier, the registers also have a status attached to their value, one status being REG_VAL_INVALID. Upon an uninterpreted instruction being found, all registered values, with the exception of the PC and SP are therefore invalidated.

Now that register values can be invalid, the rules of arithmetic also have to change to propagate this meta data. For example, a simple addition of two registers to yield a value in a third should produce a result with status REG_VAL_INVALID if either of the inputs is invalid. Additionally a register MOV should copy not only the register value, but the status too. The following code fragment from the ARM interpreting loop shows how the propagation of the register status data is handled for Data Processing instructions.

/* Propagate register validity */
switch(arithOp)
{
    case  0: /* AND: Rd := Op1 AND Op2 */
    case  1: /* EOR: Rd := Op1 EOR Op2 */
    case  2: /* SUB: Rd:= Op1 - Op2 */
    case  3: /* RSB: Rd:= Op2 - Op1 */
    case  4: /* ADD: Rd:= Op1 + Op2 */
    case 12: /* ORR: Rd:= Op1 OR Op2 */
    case 14: /* BIC: Rd:= Op1 AND NOT Op2 */
        if(!M_IsOriginValid(state->regData[rn].o) ||
           !M_IsOriginValid(op2origin))
        {
            state->regData[rd].o = REG_VAL_INVALID;
        }
        else
        {
            state->regData[rd].o = state->regData[rn].o;
            state->regData[rd].o |= op2origin;
        }
        break;
    case  5: /* ADC: Rd:= Op1 + Op2 + C */
    case  6: /* SBC: Rd:= Op1 - Op2 + C */
    case  7: /* RSC: Rd:= Op2 - Op1 + C */
        /* CPSR is not tracked */
        state->regData[rd].o = REG_VAL_INVALID;
        break;

    case  8: /* TST: set condition codes on Op1 AND Op2 */
    case  9: /* TEQ: set condition codes on Op1 EOR Op2 */
    case 10: /* CMP: set condition codes on Op1 - Op2 */
    case 11: /* CMN: set condition codes on Op1 + Op2 */
        break;


    case 13: /* MOV: Rd:= Op2 */
    case 15: /* MVN: Rd:= NOT Op2 */
        state->regData[rd].o = op2origin;
        break;
}

Code 2: Propagation of register state in interpretation of ARM Data Processing instruction.

Now that the model is attempting to interpret all instructions, the handling of conditional code needs to be considered. Specifically the model must meet the following requirements:

ARM instructions employ conditional guarding, meaning that condition codes can be attached to most instructions such that they are only executed if the condition is met. Thumb mode uses conditional branch instructions, BNE BEQ etc..., to achieve a similar goal, and in both cases the Status Register (SR) holds the condition flags which determine if a branch is taken or an ARM instruction executed. Tracking the SR would apply overhead and make the model more complex, as will any sort of branch analysis to find infinite loops and function exits. Therefore I make the following assumptions, to simplify the model ARM.

It seems highly unlikely that the function epilogue will contain conditional code and the 'stack moves once' rule of the ABI means that there is no risk of needing to conditionally correct the stack depending on the path taken through the function. Unconditional branches must always be taken since without them it is possible that interpretation could wander into another function or data area. Ignoring conditional branches also greatly simplifies the interpreter, but introduces a risk that some loops may appear infinite, as the following example shows.

int loop()
{
    while(1)
    {
        int v = getch();

        if(v == EOF)      { break; }
        else if(v == 10)  { printf("\n"); }
        else              { printf("%c", v); }
    }

    return funcB();
}

loop PROC
        PUSH     {r4,lr}
|L1.2|
        BL       getch
        CMP      r0,#0
        BEQ      |L1.32|
        CMP      r0,#0xa
        BNE      |L1.22|
        ADR      r0,|L1.36|
        BL       __0printf
        B        |L1.2|
|L1.22|
        MOV      r1,r0
        ADR      r0,|L1.36| + 4
        BL       __0printf
        B        |L1.2|
|L1.32|
        POP      {r4,pc}
        DCW      0000
|L1.36| DATA
        DCB      "\n\0\0\0"
        DCB      "%c\0\0"
        ENDP

In the above example, the BEQ needs to be taken in order to reach the function epilogue. Without understanding of the Status Register, the model ARM cannot do this, so instead gets stuck in the loop - this is the first caveat of the scheme. Accepting for the moment that this type of construct may occur (although I personally would try to avoid writing such C code as it misses the purpose of the while statement!), a scheme for detecting an infinite loop is required. I opt simply to count the number of instructions since a function return was discovered, and to stop the interpretation if some predefined limit is exceeded. This is very simple, and has little overhead, although may take longer to determine that unwinding is stuck than a more analytical approach would allow.

At this point, the model ARM should be capable of unwinding most stacks, has the ability to interpret all the code that could appear in a function epilogue and can interpret and detect returning to a register value sourced from the stack. The interpreter has a simple method to blunder into most function epilogues and also has a primitive method of detecting when it is stuck in an infinite loop. A little polish can be applied to trap cases such as the branching to a register whose value is invalid and the scheme is basically working. However, there are a couple of surprises yet...

Function Prologues

So far the model ARM has been built to concentrate on unwinding the stack frames by interpretation of code leading up to and including the function epilogues - the prologue has not needed consideration. However, there is an optimisation that can be supplied by a compiler that causes prologues to become significant. The optimisation is 'tail calling' but is not specific to ARM architectures.

Tail calling is when a function always calls another function as the last thing that it does before returning. The compiler can spot this pattern and instead of generating return code from the first function, it can call the second function in such a way that its return code will return to the original caller.

void tailCall(int v)
{
    v *= v;
    printf("%d", v);
    tailFunc(v);
}

tailCall PROC
        STMFD    sp!,{r4,lr}
        MUL      r4,r0,r0
        MOV      r1,r4
        ADR      r0,|L1.524|
        BL       __0printf
        MOV      r0,r4
        LDMFD    sp!,{r4,lr}
        B        tailFunc
        ENDP

In this case, the Link Register (LR) is restored from the stack, but instead of the commonly seen BX, an unconditional branch is made to tailFunc(), such that tailFunc() will instead return to the value placed in the LR. It's a small optimisation that saves a word and a few cycles, but it complicates the interpretation performed by the model ARM.

To accommodate this, the model ARM must either be aware of tail calling and ignore it, or must additionally be able to interpret function prologues. Detecting the tail call would not be impossible, the pattern or restoring the Link Register from the stack and then unconditionally branching is detectable, but there could be a small risk of misdetection if the LR were used in a function body for any purpose such as temporary storage or arithmetic.

Interpreting a function prologue is much the same as interpreting an epilogue and the same instructions will be used to generate the constant value for the stack adjust. However, whereas previously all values were being read from memory and the stack, some values are now stored to the stack to save state before the function executes. This could potentially damage the stack on an executing system, so a small hash table to store memory addresses and their values is implemented to store stack data instead; before reading from memory the hash table is inspected, and if a value is found it is used in place of the value from the device memory. Since function prologues are likely to start with a PUSH or STMFD instruction, there is also the possibility of an invalid register value being stored to memory, so the memory hash has to allow for storage of some state data to prevent an invalid register value becoming valid if it is PUSHed and then POPed from the stack. Finally, to prevent the memory hash needing to be large or risk overflowing, it is periodically purged of data stored at addresses that are above the current top of the stack.

Caveats

And there we have it - a scheme performing a kind of abstract interpretation of ARM or Thumb code in order to unwind the stack frames. However, while small, this method of abstract interpretation is not quite perfect. There are a number of situations where it will not work, although the general case so far suggests that it works very well in practice. Still, there are limitations, and these are best listed.

The problem of infinite loops could be dealt with by adding a random element to interpretation. For example, if the model suspects itself to be stuck in an infinite loop (a large number of instructions have been interpreted with no function epilogue being found), it may start randomly taking conditional branches in an attempt to 'chance upon' the function epilogue. While more sophisticated methods of exiting infinite loops, such as tracking the PC and marking branch history, could be implemented, they would require more memory and complexity for something that has rarely been found to cause a problem in practical usage of the unwinder.

A greater problem with no solution is that of tail calling masking functions from the unwound stack. If the interpretation is started from a function that was tail called, the function that made the tail call will be omitted from the unwound stack. In example 8, unwinding started from tailFunc() or a sub-function thereof would omit to report tailCall() since the return path would not pass through that function. In hindsight it may have been better to run the model ARM backwards, although this presents different problems.

Implementation

The following shows the amount of code and data occupied by the unwinder when compiler using RVCT2.1, compiled to Thumb code with -O2. The totals show that under 3k of ROM is used to implement the model ARM, which is very acceptable for my application.

Code	(inc. data)	Library Member Name
44	0	unwarminder.o
182	0	unwarm.o
904	68	unwarm_arm.o
1198	60	unwarm_thumb.o
300	0	unwarmmem.o
2628	128	Totals

The unwinding code is implemented in a handful of files, and a single header file named unwarminder.h needs to be included to access the functionality. Accessing to the system memory from the unwinder is abstracted through callbacks that must be implemented by the 'client' code, and this allows reads to be validated such that unwinding is stopped if alignment or the address being read is at fault. The client code passes a small structure of function pointers to the unwinder to equip it with the callbacks required to read memory and report return addresses.

UnwResult UnwindStart(Int32                  spValue,
                      const UnwindCallbacks *cb,
                      void                  *data);

The cb structure gives the callbacks to allow memory accesses and return address reporting, while the data pointer can take any value and is passed to the reporting function (cb->report()) such that it may store state if required.

The spValue gives the stack pointer value at which to start unwinding; the PC value is determined automatically as it is effectively passed to the function via the Link Register and so can be retrieved. When using RVCT, the compiler intrinsic function __current_sp() allows the SP value to be read into a variable, so a call to start unwinding typically looks something like the following:

const UnwindCallbacks cliCallbacks = { ... };
CliStack              results;
Int8                  t;
UnwResult             r;

results.frameCount = 0;
r = UnwindStart(__current_sp(), &cliCallbacks, &results);

Code 4: Typical call to start unwinding, passing a pointer to some structure that lists all the callbacks, as well as a pointer to local storage.

Finally, the implementation is not aware of any OS or memory protection or management schemes that maybe in use on the target. The system on which this has been tested is simply configured with a flat memory map and has few restrictions on memory access, and the RTOS used poses no restrictions either. It maybe the case that to use this on other targets the MMU or MPU has to be reconfigured or disabled before unwinding is started, or the functions used by the unwinder to access the memory specially constructed to ensure that memory protections will not cause a problem. Should the unwinder request access to addresses that are genuinely invalid, the client functions for memory access can return FALSE to indicate that the memory cannot be accessed, and unwinding will terminate.

Licence and Download

The source code for the stack unwinder is available for free download and I'm making it PUBLIC DOMAIN. This means that there is no copyright and anyone is able to take a copy for free and use it as they wish, with or without modifications, and in any context they like, commercially or otherwise. The only limitation is that I don't guarantee that the software is fit for any purpose or accept any liability for its use or misuse - the software is without warranty.

Having said all this, the software has been ran under Valgrind and tested both in PC simulations (using ARMSD) and on ARM7TDMI and ARM920T targets.

This package contains the source code for the unwinder as well as two 'clients' that allow the unwinder to be exercised. The first client is contained in two files, client.c and client.h and can be built to produce an image that can be executed either on an ARM target or in an emulator and demonstrates the unwinding of the stack from which the unwinder is called. The second client is the 'simulation' client, simclient.c and simclient.h, which uses two memory images that are also supplied and contain a snapshot of a call stack and executable code which allows interpretation by the unwinder on a PC, where PC tools can also be used to debug the unwinder. The memory images supplied cannot however be ran on a target since I've zero'd the areas of code that are not needed to demonstrate the unwinder such that the ARM runtime is not present in binary form.