The Case of the Missing Third Parameter
In my porting Windows assembly to OS X write-up, I mentioned an easier way than actually combing through your code and making sure the stack is properly aligned. It turns out gcc
has a nice compiler flag, called -mstackrealign
with the following effect:
Realign the stack at entry. On the Intel x86, the -mstackrealign option will generate an alternate prologue/epilogue that realigns the runtime stack. This supports mixing legacy codes that keep a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility. The alternate prologue and epilogue are slower and bigger than the regular ones, and they require one dedicated register for the entire function. This also lowers the number of registers available if used in conjunction with the "regparm" attribute. Nested functions encountered while -mstackrealign is on will generate warnings, and they will not realign the stack when called.
So this sounds great! For a small performance hit, you can use your legacy code without modification. OS X doesn't use registers to pass parameters, so there's no conflict with regparm
. However, I tried to use this option in my project, and it still crashed. Not on the movdqa
instruction as before, but in seemingly random places. And only in Release mode. It sounds like this option doesn't play nicely with compiler optimizations, and just had to investigate.
I was able to narrow it down to a very simple bit of code. Take the following and put it in a new Xcode command line tool project:
#include <stdio.h>
#define NOINLINE __attribute__((noinline))
NOINLINE static void foo1(int i1)
{
printf("foo1: %d\n", i1);
}
NOINLINE static void foo2(int i1, int i2)
{
printf("foo2: %d, %d\n", i1, i2);
}
NOINLINE static void foo3(int i1, int i2, int i3)
{
printf("foo3: %d, %d, %d\n", i1, i2, i3);
}
NOINLINE static void foo4(int i1, int i2, int i3, int i4)
{
printf("foo4: %d, %d, %d, %d\n", i1, i2, i3, i4);
}
NOINLINE static void foo5(int i1, int i2, int i3, int i4, int i5)
{
printf("foo5: %d, %d, %d, %d, %d\n", i1, i2, i3, i4, i5);
}
NOINLINE static void foo6(int i1, int i2, int i3, int i4, int i5, int i6)
{
printf("foo6: %d, %d, %d, %d, %d, %d\n", i1, i2, i3, i4, i5, i6);
}
int main(int argc, char **argv)
{
foo1(1);
foo2(1, 2);
foo3(1, 2, 3);
foo4(1, 2, 3, 4);
foo5(1, 2, 3, 4, 5);
foo6(1, 2, 3, 4, 5, 6);
return 0;
}
You would expect the output to be:
foo1: 1
foo2: 1, 2
foo3: 1, 2, 3
foo4: 1, 2, 3, 4
foo5: 1, 2, 3, 4, 5
foo6: 1, 2, 3, 4, 5, 6
And if you run it in Debug mode, that's what you get. But run it in Release mode with optimizations turned on, and you get:
foo1: 1
foo2: 1, 2
foo3: 1, 2, 1
foo4: 1, 2, 4, -1881117246
foo5: 1, 2, 4, 5, 0
foo6: 1, 2, 4, 5, 6, 0
What the? All of a sudden, the 3rd paramter has gone missing. The 4th and higher paramters get shfited over. And the last paramter is random garbage. This was quite interesting, and I decided to dig deeper.
I was able to narrow the problem down to a specific optimization known as unit-at-a-time
. Read the gcc
manual for the full description, but the relevant portion is:
Static functions now can use non-standard passing conventions that may break asm statements calling functions directly.
Ah ha... let's look at the generated assembly code in main()
where it calls foo4()
, when using -Os
, but without -mstackrealign
:
movl $4, (%esp)
movl $3, %ecx
movl $2, %edx
movl $1, %eax
call _foo4
Okay. The optimizer is being clever. Since foo4()
is static, it knows it is only called within this one module. Thus, it doesn't have to follow the usual calling conventions, and will pass parameters in registers instead of the stack for efficiency. So it takes 3 registers, %ecx
, %edx
and %eax
, and uses them for the first three parameters. The fourth parameter and up go on the stack. The corresponding code in foo4()
, naturally expects the first three parameters in registers, too:
# void foo4(int i1, int i2, int i3, int i4)
_foo4:
pushl %ebp # Save old frame pointer on the stack
movl %esp, %ebp # Setup new frame pointer
pushl %esi # Save %esi on the stack
pushl %ebx # Save %ebx on the stack
subl $32, %esp # Allocate space for 5 4-byte arguments
# to printf(), plus 12 bytes of padding
call ___i686.get_pc_thunk.bx # Position independent magic
"L00000000004$pb":
movl 8(%ebp), %esi # Move i4 parameter to %esi
movl %esi, 16(%esp) # Push %esi on the stack
movl %ecx, 12(%esp) # Push i3 parameter on the stack
movl %edx, 8(%esp) # Push i2 parameter on the stack
movl %eax, 4(%esp) # Push i1 parameter on the stack
leal LC3-"L00000000004$pb"(%ebx), %eax # Push printf() format
movl %eax, (%esp) # string on the stack
call L_printf$stub # Call printf()
addl $32, %esp # Free stack space for parameters
popl %ebx # Restore %ebc
popl %esi # Restore %esi
popl %ebp # Restore frame pointer
ret # We're done
The problem comes in when we look at the generated assembly code with -mstackrealign
in effect. The code for main()
doesn't change. It passes 3 parameters in registers and 1 on the stack. However, we don't have to get past the prolog of foo4()
to see the problem:
_foo4:
leal 4(%esp), %ecx # Special prologue to realign
andl $-16, %esp # stack to 16-bytes
pushl -4(%ecx) # (cont.)
pushl %ebp # Save old frame pointer on the stack
movl %esp, %ebp # Setup new frame pointer
This option added 3 new instructions to the prologue to realign the stack. In fact, the problem lies in the very first instruction when it clobbers %ecx
. Remember main()
passed the 3rd parameter, i3
, in %ecx. Well this explains where the 3rd parameter disappeared to... the bit bucket. Looking at the rest of the code, it's apparent that foo4()
expected the 3rd and 4th parameters to be on the stack. This also explains why the 4th parameter got shifted over to the 3rd, and why the last parameter was random garbage. This section in the -mstackrealign
description has come home to roost:
The alternate prologue and epilogue [...] require one dedicated register for the entire function. This also lowers the number of registers available if used in conjunction with the "regparm" attribute.
We're not using regparm
, but the effect is the same. The new prolog steals %ecx
for stack realignment, but apparently it forgot to tell unit-at-a-time
. Thus we have one of those miscommunications I alluded to in part 1 when then the calling conventions are not followed. In this case, the caller is using 3 registers for parameter passing, whereas the callee is using 2 registers.
Fine, mystery solved. But can we work around it? It turns out unit-at-a-time
is enabled for -O2
, -O3
, and -Os
. So one solution is to not optimize at these levels. But that's, pardon the pun, suboptimal.
It's possible to disable unit-at-a-time
individually by passing -fno-unit-at-a-time
. I've verified this does fix the problem, even with -O3
and -mstackrealign
. So this is your best bet. You'll lose some optimization, but at least your code will work, both with legacy Windows assembly code and with itself. I tested all of this in gcc 4.0.1, build 5367. rdar://problem/4861528 has been filed.