[SOLVED] ARM Cortex-M4 Mutex Lock. DMB Instruction

Issue

This Content is from Stack Overflow. Question asked by J Klein

Background: CCL aka OpenMCL is a very nice, venerable, lightweight but fairly fast Common Lisp compiler. It’s an excellent match for the RPi because it runs on 32 bit models, and isn’t too memory intensive. In theory, unlike the heavier SBCL lisp compiler, it supports threads on 32 bit RPi. But it has a long-standing mutex bug.

However, this is an ARM machine language question, not a Lisp question. I’m hoping an ARM expert will read this and have an Aha! moment from a different context.

The problem is that CCL suffers a fatal flaw on the Raspberry Pi 2 and 3 (and probably 4), as well as other ARM boards: when using threading and locking it will fail with memory corruption. The threading failure has been known for years.

I believe I isolated the issue further, to a locking failure: when a CCL thread grabs a lock (mutex), and checks to see it it owns the lock, it sometimes turns out that another thread owns the lock. Threads seem to steal each other locks, which would be fatal for garbage collection, among other things. It seems to me that information that one core has take control of a lock does not percolate through to the other cores before the other cores grab it themselves (race condition). This bug does not happen on one-core RPis, like the Pi Zero.

I’ve explored this bug in this github repo. The relevant function is (threadtest2) which spawns threads, performs locks, and checks lock ownership. I initially thought that the locking code might be a missing DMB instruction; DMB “ensures that the exclusive write is synchronized to all processors”. Thus I put DMB instructions all over the locking code (but upon looking carefully, DMB was already there in a few spots, so the original compiler author had thought of this).

In detail, I put DMBs into just about every locking routine of arm-misc.lisp called from the futex-free version of %lock-recursive-lock-ptr in l0-misc.lisp, with no luck.

Obviously, I’m not asking anyone to diagnose this compiler. My question is

Does this lock-stealing behavior ring a bell? Has anyone else seen this problem of lock-stealing or race-condition on the ARM in another context, and have they found a solution? Is there something I’m missing about DMB, or is there another instruction needed?



Solution

I think you are over complicating it. Look at the amba/axi spec (and also where did you find a multi-core cortex-m4?). ldrex/strex are for sharing a resource across processors in a multi-processor chip. They have been incorrectly used for other things for some time now. ARM unfortunately did an unusually bad job of documenting all of this correctly.

The exclusive part of the ldr is that the processorid and the address (range) are saved in a table. When an strex happens the processorid for that address (range) is checked if it matches EXOKAY and do the store if not OKAY and dont. Strex does not clear anything, they interestingly have this clrex instruction which I assumes sets the processorid to some value that wont hit or depending on how they build their tables they free up a table entry.

I may try this after writing this but you can just as easily ldrex then strex then strex, fairly certain I have done int on full sized arms, will try it on a cortex-m4 ldrex, strex, strex, clrex, strex and see what happens.

In a uniprocessor system, ldrex/strex are expected to work in ARM’s logic but the chip vendor is not required to support it and may simply return OKAY (instead of EXOKAY). The L1 certainly and probably L2 are arm logic beyond that you get into chip vendor. (do cortex-ms have an l2?). Normally you are not going to have to worry about hitting the chip vendor code, you can run a long time if not indefinitely without knowing any of this as you will remain in one of the caches. And disabling both caches in Linux for example is a royal PITA, they may make it seem like it is a compile time option, but dig in and see the reality. And with only one processor how do you get a different processor id?

In multi-processor chips, the chip vendor is supposed to support it correctly beyond the caches if you can even get there with an exclusive access, how ldrex/strex are used normally, you are most likely to be within your L1 cache and never get exposed to what the chip vendor has provided, but it can happen if you get interrupted in between and you are likely saved by the L2. And in this case having more than one processorid in the chip makes sense, as there is more than one processor.

This is nice

The Cortex-M4 processor implements a local exclusive monitor. The
local monitor within the processor has been constructed so that it
does not hold any physical address, but instead treats any access as
matching the address of the previous LDREX. This means that the
implemented exclusives reservation granule is the entire memory
address range.

The m7 trm says the same thing.

Not having multiple cores how could/would one generate a different ID?
The docs are using the term processorid to indicate which processor is being used. How many processors are in a cortex-m? Perhaps it is documented elsewhere using a different string/name, but at this time I dont know how the processorid in a cortex-m is generated and being a uniprocessor is there more than one? I dont have access to a core to know for sure.

So even though the logic does not support a per-address exclusive access, they didnt say they didnt check the processorid, they simply consider all strex access for memory marked as shared to be checked against the processorid of the last ldrex independent of its address.

EDIT

PUT32(0x01000600,0x600);
PUT32(0x01000700,0x700);
PUT32(0x01000800,0x800);
CLREX();
hexstring(STREX(0x20000600,0x12345678));
hexstring(STREX(0x20000700,0x12345678));
hexstring(STREX(0x20000800,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000600,0x6666));
hexstring(STREX(0x20000700,0x12345678));
hexstring(STREX(0x20000800,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000700,0x7777));
hexstring(STREX(0x20000800,0x12345678));
hexstring(GET32(0x20000600));
hexstring(GET32(0x20000700));
hexstring(GET32(0x20000800));
CLREX();
hexstring(0xAABBCCDD);
hexstring(LDREX(0x20000600));
CLREX();
hexstring(STREX(0x20000600,0x2222));
hexstring(GET32(0x20000600));

producing

00000001 
00000001 
00000001 
00000600 <-- ldrex
00000000 <-- strex pass
00000001 <-- strex fail
00000001 
00006666 
00000000 
00000001 
00006666 
00007777 
00000800 
AABBCCDD 
00006666 
00000001 
00006666 

So looks like what they did here is the next strex after an ldrex passes independent of address. So using your terms the strex “clears the lock”.

And note that putting a clrex between the ldrex and strex does make the strex fail.

Not hitting the same address doesnt matter one ldrex to one strex

hexstring(LDREX(0x20000900));
hexstring(STREX(0x20000900,0x2222));
hexstring(STREX(0x20000900,0x2222));

3EEDCC1B 
00000000 
00000001 

Turning the data cache on didnt change the results.

Test functions:

.thumb_func
.globl LDREX
LDREX:
    ldrex r0,[r0]
    bx lr

.thumb_func
.globl CLREX
CLREX:
    clrex
    bx lr

.thumb_func
.globl STREX
STREX:
    strex r0,r1,[r0]
    bx lr

Unlike the big brother ARMs:

CLREX();
hexstring(STREX(0x20000600,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000600,0x6666));
hexstring(LDREX(0x20000600));
PUT32(0x20000600,0x11);
hexstring(STREX(0x20000600,0x6666));

00000001 
00000600 
00000000 
00006666 
00000000 

The strex survives the non exclusive access in between, at least based on the document you posted a non-exclusive store should spoil the prior ldrex (on an armv7-a).

Note the above is on a cortex-m4 r0p1 CPUID 0x410FC241


This Question was asked in StackOverflow by GNA and Answered by old_timer It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?