Where the top of the stack is on x86 (2011)

TrianguloY · on May 7, 2021

It's easier to understand if your diagram show the starting memory positions (address 0x00000000) at the top, and final positions (address 0xFFFFFFFF) at the bottom. This way the top of the stack is, precisely, at the top.

It doesn't seem to be an standard and there are almost the same number of diagrams with memory from 0 to F than from F to 0. Some of them are more useful in specific contexts, but most are simply drawn the way the author is used to.

Personally I prefer increasing numbers go down, like lines in a file, numbers on a numbered list, timestamp on a log, etc. Having the highest memory value 0xFF on the top seems...odd.

roelschroeven · on May 7, 2021

> Having the highest memory value 0xFF on the top seems...odd

Dunno ... the highest being on top feels very normal ... . Isn't that more or less the definition of "highest" and "top"?

We have conflicting conventions in all kinds situations:

Numbers increase towards the top in a class Cartesian coordinate system, in the numbering of floors in buildings, on calculator keypads, ...

Numbers increase towards the bottom in coordinate systems in many computer graphics contexts, in numbered lists, on telephone keypads, ...

Sometimes these conventions meet and clash and there is no one right way to handle that.

moonchild · on May 7, 2021

From a footnote of TFA:

> You may try to fix the confusion by viewing memory with its low addresses at the top and high addresses at the bottom. While this would indeed make stack movement more natural, it would also mean that increasing some memory address would take it down in the graphical representation, which is probably even more counter-intuitive.

Ultimately, my approach is to avoid directional terms such as 'top' and 'bottom' or 'high' and 'low' in the first place; they only cause confusion. Prefer 'greater' or 'smaller' addresses.

(Similarly 'left' and 'right' when applied to bits, which gets especially confusing if endianness is involved; prefer 'more significant' and 'less significant'.)

efaref · on May 7, 2021

I never understood why people ever put memory upside down on diagrams.

In fact, I think the only time I see diagrams with F at the top and 0 at the bottom is in explanations of the stack, or in explanations of how "the stack is so weird it grows the wrong way!".

When describing, e.g., a memory map for a custom ASIC, you would always put the low addresses first and the high addresses later. That's just how numbers work. The "stack grows the wrong way" issue seems to be an invented problem.

toast0 · on May 7, 2021

If you're writing a diagram for little endian, you end up with the bytes going right to left, if you have multibyte values, so you may as well read bottom to top, right to left.

derefr · on May 7, 2021

Little-endian is fundamentally confusing, because the bits within the bytes still go the "other" way. Best to just keep the diagram oriented to bits-within-bytes order, and make it clear that bytes-within-words are being plopped into memory backwards.

Just like, back when graphics bitplanes were a thing, it was much less confusing if the diagram just showed (virtual) memory interleaved, the way you'd actually have to deal with it as a programmer, rather than showing the "clean" physical-memory diagrams for what ends up on the "even" and "odd" chips.

gizmo686 · on May 8, 2021

I have had to work with data that uses little-endian for bit order as well (although we call it "least significant nit first", and reserve the word "endian" to describe byte order).

We actually dealt with this by adding support to a general purpose open source parsing tool we maintain [0]; where we had to add a feature to the standard specifically to accommodate this [1].

Even though least-significant-bit-first makes sense with littleEndian actually working with it is a giant pain, because approximately no tooling actually thinks that way.

[0] https://daffodil.apache.org/

[1] https://daffodil.apache.org/docs/dfdl/#_Toc62570139

amluto · on May 8, 2021

The bits within the bytes have no order because they are not addressable on most hardware. You can imagine them going whichever way you like, even in a different dimension. On an architecture without bit pointers, a byte is a number from 0 to 255, not a sequence of bits.

(x86 does indeed have the bt family of instructions that sort of address bits. I would argue that the existence of memory-operand bit accessors like this that accept larger-than-bits-per-word offsets is bizarre, but the instructions are not backwards. On a big endian system, these instructions would have different behavior depending on word size, which would be odd.)

derefr · on May 8, 2021

Bits within bytes have a defined order when you do things like pushing the CPU flags register to memory, reading it back into a numerical register, and then bit-shifting/rotating through it. That's something you can do on basically any architecture that supports preemptive scheduling; there's usually either a PUSHF or PUSHALL there to assist.

But, even without the ability to address the memory at the bit level, bits-within-bytes ordering still matters on any architecture where bits are striped between memory channels, RAID-0 wise. (Just like those graphics bitplanes!) The processor being LSB-first vs. MSB-first on how it interprets the data bus, determines which sockets you put the LoROM and HiROM firmware chips into your Apple II :)

toast0 · on May 8, 2021

I think it makes a difference when you've got bit packed fields. For example a 32-bit word with a 16-bit field a 4-bit field and a 12-bit field.

If you want that 12-bit value to work nice, it's probably going to be all the bits of byte 0 and the lower half of byte 1. (Or the upper half of byte 0 and all of byte 1). That looks best if you write bits and bytes right to left from zero. That 12-bit value uses bits 0-11 and you can mask it from your 32-bit read with 0xfff. The 4-bit value you can mask with 0xf00 and then shift (or shift first, whatever)

userbinator · on May 8, 2021

Personally, the association of "top", "highest", "up", and "increase" is strongest, and similarly "bottom", "lowest", "down", and "decrease"; so from that perspective it's "top" and "bottom" when applied to the stack which is the inconsistent one.

I think it's better to just stop drawing memory diagrams vertically, and show it horizontally instead. At least most people seem to agree that left is lower and right is higher.

albntomat0 · on May 7, 2021

That way also makes reading/writing a buffer go from top to bottom, similar to reading/writing normally.

billforsternz · on May 7, 2021

Exactly. After all we write programs by starting at the top of the screen and working our way down, and happily of course program execution increments in the same direction. So small addresses at the top and big addresses at the bottom is the most natural way of presenting any memory map.

anyfoo · on May 7, 2021

We call things "base of memory" and "top of memory" though, with the latter one being at the highest address of memory.

ketralnis · on May 7, 2021

I've heard both go in both directions. Hell, I've said both depending on the diagram I'm pointing at when I say it. It's not that universal

ellis-bell · on May 7, 2021

but then the heap would grow down in such a diagram...

btw my understanding is that the heap came first in the logical development of C and other systems programming. so it made sense to have .text and other program data at the lowest virtual memory addresses and then have the heap grow towards higher memory addresses.

then when the stack became a thing it had to grow down...

monocasa · on May 7, 2021

The stack might have actually come before the heap in history, and actually proceeds any systems programming language that was a higher level than assembly. They were seen in the Z4, and Turing wrote about them in 40s.

xanathar · on May 7, 2021

Bonus (?): the stack going down rather than up means that overflowing a stack-allocated buffer will overwrite the contents that are already on the stack (as they have a higher address than the last item in the buffer), likely changing the return address of the function and thus making arbitrary code execution a breeze (see: https://en.wikipedia.org/wiki/Return-to-libc_attack).

Most operating systems and standard libraries have checks and countermeasures to make this a lot harder nowadays, but still.

ajross · on May 7, 2021

In fact the mitigations have been so effective that in practice stack smash attacks are mostly historical at this point. But yes: having the direction of "natural memory copies" be the same direction as "back into the caller memory on the stack" was clearly a really bad mistake in hindsight.

I actually don't know where it came from. It was true in original PDP-11 Unix for sure: you had the program text followed by a grows-up heap, with the grows-down stack placed at the top of the program segment. Interestingly PDP-11 addressing was general enough to have implemented a grows-up stack, so this is clearly a mistake Unix could have corrected. I just don't know if this was the original use of the convention or if it inherited it from elsewhere.

vlovich123 · on May 7, 2021

Interestingly, per my reading of [1], these attacks are now easily available again in WASM due to misplaced confidence in the security of the language. It's a fantastic paper.

[1] https://www.unibw.de/patch/papers/usenixsecurity20-wasm.pdf

xanathar · on May 7, 2021

I THINK that the reason is: 8086 inherited it from the 8085 which inherited it from the 8080. The next parent in line would be the 8008, but that has a small call stack in the CPU registers rather than RAM, so the ancestor would be the 8080.

The 8080 had 64KB address space. I bet the rationale was to partition memory so that classic memory usage goes upwards from 0000 to FFFF and the stack goes downwards from FFFF to 0000. This removes the need to define a boundary between the two beforehand.

Of course this is totally speculation on my part, I might be super wrong.

bonzini · on May 7, 2021

The 8008 instruction set was not designed by Intel IIRC, so the lineage ends at the 8080.

Independently the 6502 also had a downward stack. I think the only modern machine with am upward-growing stack is the HP PA-RISC.

saagarjha · on May 8, 2021

Stack smashing is still possible if you have a way to leak data (specifically, the cookie) beforehand.

rdhatt · on May 7, 2021

Remember:

"The x86 architecture is the weirdo" --Raymond Chen

https://devblogs.microsoft.com/oldnewthing/20040914-00/?p=37...

https://devblogs.microsoft.com/oldnewthing/20130320-00/?p=48...

Narishma · on May 7, 2021

In this case, it's not a weirdo, is it? I don't know of any popular ISA with a stack that grows upward.

anyfoo · on May 7, 2021

I can't think of any either right now[1], with the exception maybe of very old ISAs where the stack is not in regularly addressable memory at all, and instead fixed depth formed by effectively a bunch of registers. But it might be hard to even define any notion of "upwards" or "downwards" on those.

[1] EDIT: This comment mentions PA-RISC: https://news.ycombinator.com/item?id=27079832

userbinator · on May 8, 2021

8051. Quite a popular microcontroller.

eatonphil · on May 7, 2021

My confusion about statements like this are understanding where the convention begins. Isn't it libc that sets up the stack? Couldn't it decide to set up the stack starting from the bottom of memory and put heap allocations at the top? I guess instructions like PUSH/POP and derivatives wouldn't be useful anymore so you'd have to recreate them. So I guess that means that the convention starts at the processor? You could store memory in the opposite way but it would just be slower since you wouldn't be able to use built in operations. Do I have that right?

ajross · on May 7, 2021

> Isn't it libc that sets up the stack?

In coordination with the kernel, yes. Different OSes do this differently, but many will set up the stack pointer for you. Linux historically has not, the new thread keeps the parent stack pointer and has to implement careful entry code to switch it without messing up the parent thread.

> Couldn't it decide to set up the stack starting from the bottom of memory and put heap allocations at the top?

In theory, but as you mention the architecture makes some clear demands here. On x86 CALL/RET and PUSH/POP both require a SP/ESP/RSP register with free space below it. Likewise x86 interrupts are handled on the stack and so you need to keep the stack pointer initialized for them (modern CPUs will automatically switch the stack for you from whatever user code is running, but they still need to switch it to a grows-down area you initialized for them).

Broadly, sure, you can come up with a software abstraction that acted as a "stack", but it would have to be in addition to the CPU stack you already need. In effect you'd have to burn a register for this extra stack pointer, which has performance implications.

astrobe_ · on May 7, 2021

Yes, you clearly see that when you implement a stack-based virtual machine (as in bytecode interpreter, not VMWare and the likes, although it is on the principle the same thing) : your bytecode (or whatever technique you are using except perhaps JIT) for push/pop/call/ret must agree on how to use the stack.

It is particularly the case in a language like Forth, which has two stacks that can be manipulated directly by the user (actually the user has to if they want to get anything done...) : one for the parameters/arguments, and one for the return addresses. This deviates from the usual single hardware stack processors. Forth processors (there are still some in operation) fully support those two stacks.

When you implement a Forth interpreter in assembler, you generally use the hardware stack for either the parameters or for the return addresses, while the other is managed manually (usually using another addressing-capable register such as EBP, ESI or EDI on x86).

If for instance you dedicate one segment (either in the x86 sense or in the common meaning), you typically use the push down "hardware" stack that you initialize at the top of memory, while the return addresses stack is a "push up" software-managed stack and initialized at the bottom of the memory.

toast0 · on May 7, 2021

> Likewise x86 interrupts are handled on the stack and so you need to keep the stack pointer initialized for them (modern CPUs will automatically switch the stack for you from whatever user code is running, but they still need to switch it to a grows-down area you initialized for them).

If your kernel is re-entrant, you need to keep platform stack conventions in the kernel, or an interrupt (or exception) during the kernel will overwrite your backwards stack.

I don't know for sure, but I think aignal handing in user processes runs on the user stack too (but I could easily be wrong).

If you really wanted, you could use the platform stack for call/ret only and have a separate data stack with whatever conventions you like.

monocasa · on May 7, 2021

> Linux historically has not, the new thread keeps the parent stack pointer and has to implement careful entry code to switch it without messing up the parent thread.

Well, for calls to fork(2) and clone(2), sure, but the kernel will setup a stack for you on exec(2). It has to have somewhere to stick the command line args, env, and some other extra bits of information like a random seed.

eatonphil · on May 7, 2021

That's helpful! Then why is it that libc is the one setting up the stack if this is a convention basically required by the processor?

ajross · on May 7, 2021

Because the runtime linker (e.g. ld-linux.so) is implemented in the C library, as is the user-callable implementation of pthread_create() or whatever[1]. It's definitely the application's job to decide on its own memory layout in any case. The kernel doesn't tell you where your stack should be, it's your address space.

[1] vs. the clone() Linux syscall, which isn't really useful to regular code because of the complexity.

ylyn · on May 7, 2021

Yes, more or less. I'm not sure if push and pop is significantly faster than a mov and add. But there are also instructions like leave and ret that follow this same fully descending stack convention. So if you deviate from that you can't use any of those.

It's worth mentioning what fully descending means here. I'm surprised the article didn't mention it.

"Fully" means that the stack pointer points to the last entry on the stack (the top). "Empty" means it points to the next entry just after the last (top) entry. That is, it points to where the next push would be written to.

Anyway, you can have whatever calling convention you want, really. All you need is a way to pass arguments, return values, and to know where to return to. You could have a linked list of frames allocated by malloc() with the head pointer in EBP, for example. Say the return address and caller frame pointer is stored in the start of the frame. Then you would return by mov ebx, [ebp]; mov ebp, [ebp+4]; jmp ebx. Or something lile that. This is a pretty ridiculous calling convention though, but it'd work.

AshamedCaptain · on May 7, 2021

In ARM you do have pop/push (and call/ret) instructions that can go in either direction, and most platforms I know still have a stack that grows to 0.

TBH, dunno where the difficulty is.

monocasa · on May 7, 2021

Sort of. The ARM stuff was allowed other uses because they were pretty generic load and store multiple instructions with a lot of increment/decrement options to make up for the original Acorn's lack of a DMA engine.

On anything resembling a recent ARM core though, using the stack pointer register with anything other than a descending stack has you falling off the perf wagon as it's backed by a hardware stack engine.

amelius · on May 7, 2021

In any privilege level?

JoeAltmaier · on May 7, 2021

There is an argument for making the call/return stack a separate non-addressable region from argument passing. So a malicious app can't overwrite return addresses and execute arbitrary code. Intel considered this early on, and rejected it.

Why? Because of Fortran. There are (were?) cases where a Fortran app would reach back on the stack to retrieve values from before the last call or some such. Rather than find some other way of making that work for those folks, the separate-return-stack idea was shelved.

And we all live with this debacle for decades.

Someone · on May 7, 2021

I don’t understand that argument. All function arguments still would be on a single stack (just as in Forth). That would only be a valid argument if some FORTRAN code inspected the return address, for example to behave differently, depending on the caller.

I would guess it’s more because early systems had small address spaces. If your heap grows up from the bottom of memory, and one stack grows down from top of your address space, how do you know where to place that second stack, especially in a system with, say, a 64kB address space?

JoeAltmaier · on May 7, 2021

Yes, exactly. Some FORTRAN code did exactly that.

And the stack could go in a page not addressable by general purpose instructions, save call and return instructions (and perhaps some kernel debug).

Some limited-RAM embedded devices have dedicated stack RAM. But my point was about Intel x86

stevemk14ebr · on May 7, 2021

Checkout the shadow stack used for CET (control flow enforcement technology). It is literally this idea

monocasa · on May 7, 2021

If that was an issue, they managed to fix it decades ago. Itanium had separate return address and data stacks, and scientific computing was one of it's few competitive strongsuits.

userbinator · on May 8, 2021

There are architectures with a separate (and limited) call stack. They also tend to be Harvards, with the data and program memory completely separate. In other words, go down that path and you end up with what amounts to a restricted microcontroller instead of a general-purpose computer... which unfortunately seems to be where things are heading with the rising authoritarian technocracy.

titzer · on May 7, 2021

You cannot address the call stack in WebAssembly, thus there can be no stack-smashing attacks (that overwrite return addresses). For languages that pass pointers into the stack, they must use a "shadow" stack allocated as a separate region and managed with an explicit stack pointer.

amelius · on May 7, 2021

Shouldn't there be two stacks? One to contain stuff like return addresses, and the other to contain data? And the stack containing addresses in a separate address space that can only be accessed through stack manipulation/return instructions.

kccqzy · on May 7, 2021

That's known as a shadow stack. Some newer Intel processors even have hardware support for it.

amelius · on May 8, 2021

Interesting:

https://en.wikipedia.org/wiki/Shadow_stack

But it seems a shadow stack is a slightly different concept which requires less changes in software.

moonchild · on May 7, 2021

This has been suggested, including by the proposed mill and forwardcom architectures. It's also the programming model used by stack-based languages such as forth.

Another nice side effect aside from security is that it simplifies the return predictor.

sesuximo · on May 8, 2021

1. Seems like you’d have extra cache contention?

2. How would coroutines/threads handle this? Switch two stack pointers?

amelius · on May 8, 2021

1. Not necessarily. You have two spaces now, but the same size in total.

2. Yes, they'd have to switch two stack pointers. I think a CPU could have a special instruction for this case.

sesuximo · on May 8, 2021

Re 1, I think this is a worse layout even if the size is the same. Less locality

Denvercoder9 · on May 7, 2021

Is there a reason why a stack growing towards the beginning of memory is better than one growing towards the end of memory, or is that just an arbitrary choice the x86 designers made?

bonzini · on May 7, 2021

It's mostly because it allows you to put code at a fixed address at the beginning of the memory and the stack at the opposite end. That lets the OS place the stack at a constant address (for MS-DOS COM files the stack pointer starts at 0xFFFE, with a zero already pushed so that a RET instruction exits the program; a similar convention existed in CP/M).

sgtnoodle · on May 8, 2021

It's got a lot of feng shui for simple computer architectures.

For one thing, you can have your code, static memory, and then heap at the beginning of the address space, and the stack at the end of the address space, and not have to worry about explicitly sizing the stack. On some architectures you could even just zero-initialize the stack pointer, and it will roll over to the end of memory on the first push.

You can also use the same machine code instructions that make array indexing efficient to access variables on the stack (load/store with literal offset,) without having to support signed offsets which would have effectively wasted a precious bit in a very common opcode.

When you overrun a buffer, it is more likely to cause your program to crash by corrupting the stack. This is bad in modern times for obvious security reasons, but pragmatically it would have been a useful "feature" in the early days of programming.

It pairs nicely with the semantics of a semaphore, where you decrement to acquire and increment to release.

The stack pointer conceptually points to valid memory, rather than pointing to invalid memory adjacent to valid memory. This likely has some favorable traits when it comes to how an architecture implements function calls. For example, RET could load the value at the stack pointer into the program counter, then increment the stack pointer. Loading from memory often takes multiple cycles compared to incrementing a register, and this starts the load a cycle earlier than otherwise for the same amount of combinatorial logic.

vlmutolo · on May 7, 2021

I wonder if it's at all related to the reasoning in this post about writing a bump allocator.

https://fitzgeraldnick.com/2019/11/01/always-bump-downwards....

pcwalton · on May 7, 2021

I don't know if this is the real reason, but it's more natural for stack-relative addressing to have the stack grow downwards, because you can write e.g. [rsp+10] instead of [rsp-10].

Apparently on PA-RISC (hppa) the stack grows the other way, so it is arbitrary.

tom_mellior · on May 7, 2021

Why is rsp+10 more natural? If rsp is the top of the stack, everything is below it, and "below" and "minus" go well together.

Though I would agree that for the topmost value specifically, rsp+0 feels more natural than rsp-4 or rsp-8.

glhaynes · on May 7, 2021

Easy to get confused similarly about trees that grow down in computer scientists' diagrams versus growing up in nature.

Camas · on May 8, 2021

Felt similar when I learned what inverting/reversing a tree meant.

colejohnson66 · on May 7, 2021

It also doesn’t help that debuggers show the call stack with the current function at the top and the entry point at the bottom (the opposite of how it works in memory).

andrey_utkin · on May 8, 2021

I really like the nomenclature described by my coworker Gareth Rees: stack has hot and cold ends. The terms of top and bottom just confuse everything.

https://garethrees.org/2018/03/12/stack/

timonoko · on May 8, 2021

Real processors did not have stack or subroutines and CALL and RET. Cosmac did not. Nova 800 did not.

Nova had write PC to memory address and jump to the next address. But that is not CALL. And subroutine could not be bigger than 255, because there was no RET but short jump via the stored address.

acchow · on May 7, 2021

Just flip your diagram upside down and there's no confusion.

jonsen · on May 7, 2021

And where does a branch go? It leaves the stem of consecutive memory numbers into nowhere and mysteriously comes back to the stem at an arbitrary number.

azhenley · on May 7, 2021

The title should be: Where the top of the stack is on x86 (2011)