Implementing cross-process locks

amelius · on Jan 16, 2021

Why not simply use shared memory?

https://stackoverflow.com/questions/2389353/do-pthread-mutex...

ot · on Jan 16, 2021

One problem with that is that if one of the participating processes exits uncleanly while holding a lock (crash, OOM killer, ...), everything is deadlocked forever.

Instead, file locks would be released by the kernel, and if you use a separate process as described in the post it is isolated enough that the chance of crashing/being killed by the kernel is low, and if that happens you can detect it and crash everything.

In general, there is no robust solution for inter-process synchronization in Linux, except file locks which have very limited functionality. It's quite sad, this is the kind of functionality that you'd expect the kernel to provide, instead we have to resort to workarounds like the one described here, which add operational complexity.

amscanne · on Jan 16, 2021

Linux has support for a futex "robust list", which handles the exact case you're talking about: https://www.kernel.org/doc/Documentation/robust-futexes.txt

(Ironic that you said there is no "robust" solution. They are also exposed via pthread_mutex as a robust attribute.)

But no one really uses robust futexes. Why? I think the problem of distributed co-ordination is fundamentally complex, and people prefer to deal with complexity they create than trying to understand other complex solutions. (I'm not saying that this bias is wrong.)

So it's a bit sad that you're sad about the kernel not providing something that is does indeed provide!

ot · on Jan 16, 2021

TBH it is the third or fourth time I come across robust futexes, try to get my head around them, give up, and forget about them.

Futexes are a low-level primitive, I was talking about high-level primitives such as mutexes, condvars, and semaphores. If nobody has succeeded in implementing them on top of robust futexes, for practical purposes it's as if they don't exist.

amscanne · on Jan 17, 2021

Did you miss where I noted they are supported by pthread? I think you’re giving up way too quickly :)

cma · on Jan 16, 2021

Couldn't you signal the deadlocked process based on monitoring pidfd of the crashed one, then break the deadlock in the interrupt handler?

ot · on Jan 16, 2021

Who is "you" here? There may be several processes waiting on that lock. Even if they have a reliable crash detection mechanism, they'll have to coordinate to decide who's in charge of repairing the state. What if that process dies during the repair?

I'm not saying it's impossible, but you're basically implementing a distributed, embedded version of the lock daemon. And as all distributed things, fault-tolerance is hard. It's easier in a dedicated process or in the kernel because they're centralized (and if the kernel panics all processes stop; that can be emulated in a daemon).

pmeunier · on Jan 16, 2021

What we have to do in Pijul is even worse: in addition to all this, it must (1) be cross-platform and (2) avoid any user interaction, because Pijul is meant to be beginner-friendly.

chrismorgan · on Jan 16, 2021

Funny you should say cross-platform, because one of my honest concerns with something like this is what will happen if multiple computers try to operate on a shared file system simultaneously. A different kind of cross-platform, if you will. For example, I have definitely run Git commands on the same repository from both Windows and (via WSL) Linux simultaneously in the past (I used to use Windows gVim and Linux everything else, and both could invoke their platform’s Git). That’s where cross-process locking concerns me. And WSL is hardly the only type of file system that may be accessed from multiple machines simultaneously.

waheoo · on Jan 16, 2021

What functionality do you need that isn't covered by file locks?

ot · on Jan 16, 2021

Examples are in the post, and from my experience, FIFO locks and semaphores.

remram · on Jan 16, 2021

The semantics described in the article go further than a single mutex, so they can't use just pthread or just flock.

I wonder if using a pthread mutex in shared memory is faster than flock'ing the file (I'd assume yes, though I don't know if significant).

vlowther · on Jan 16, 2021

"On any platform, pages are synchronised to disk atomically: even though larger chunks might only be partially synchronised, pages are guaranteed to be synchronised atomically."

Someone is being hopelessly optimistic about how reliable disk writes are.