Don't want to see articles from a certain category? When logged in, go to your User Settings and adjust your feed in the Content Preferences section where you can block tags!
We do often include affiliate links to earn us some pennies. See more here.

After a long bumpy road with many revisions, it appears that the futex2 work sponsored by Valve is finally heading into the upstream Linux Kernel. Initially much larger, the work was slimmed down to get the main needed parts done and enabled before the rest can make it in.

So what is it? As developer André Almeida previously described it: "The use case of this syscall is to allow low level locking libraries to wait for multiple locks at the same time. This is specially useful for emulating Windows' WaitForMultipleObjects. A futex_waitv()-based solution has been used for some time at Proton's Wine (a compatibility layer to run Windows games on Linux). Compared to a solution that uses eventfd(), futex was able to reduce CPU utilization for games, and even increase frames per second for some games. This happens because eventfd doesn't scale very well for a huge number of read, write and poll calls compared to futex. Native game engines will benefit of this as well, given that this wait pattern is common for games.".

Speaking on Twitter, Valve developer Pierre-Loup Griffais said "It's amazing news that futex_waitv() seems to be on its way to the upstream kernel! Many thanks to the continued efforts of our partners at Collabora, CodeWeavers, and to the upstream community.".

Ideally then this will help Windows games in Proton on Linux run better. But that's not all!

Also interesting is the follow-up post from Griffais that mentions "Beyond Wine/Proton, we are also excited to bring those multi-threaded efficiency gains to Linux-native game engines and applications through some variant of the following primitive, pending more discussion with the glibc community:" with a link to some glibc work.

Article taken from GamingOnLinux.com.
41 Likes
About the author -
author picture
I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly checked on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly.
See more from me
The comments on this article are closed.
14 comments
Page: «2/2
  Go to:

ShabbyX Oct 11, 2021
Quoting: 3zekielShort answer: they kinda already tried that at first, and judged it to be a dead end.

Fair enough.

Quoting: 3zekielLong answer:
Modifying an existing (set of) syscall(s) is extremely limited. You can not break compatibility in any way, since that would break thousands of apps, with no way for users to fix it. Contrary to libraries, you can not just install another kernel, or use a lightweight container to fix a kernel ABI breakage. So all issues with that syscall set are pretty much set in stone.

Sure, but that doesn't mean you cannot provide the same functionality more efficiently. The point was eventfd doesn't scale, and making that scale doesn't necessarily have to interfere with its functionality.

Quoting: 3zekielMore generally, it seems more natural and clean to use a tool that is actually made to fix your issue. File descriptors (which eventfd is based on) are made to deal with file-like stuff (that's a lot of stuff in Linux). Futexes are made to deal with synchronization issues. Futexes are also made to be used in large numbers, file descriptors... not that much (the overhead in memory and so on).

Everything is a file. In fact the few things that Unix didn't make a file turned out to be the most problematic areas (pids and signals notably). At least the pid problem is remedied with fds (pidfd), and if signals aren't already, I'm sure they will be turned into fds too.

I said all that to say that given how central fds are, it's worthwhile to make sure eventfd is actually efficient, rather than keep trying to work around it.
3zekiel Oct 12, 2021
Quoting: ShabbyXSure, but that doesn't mean you cannot provide the same functionality more efficiently. The point was eventfd doesn't scale, and making that scale doesn't necessarily have to interfere with its functionality.

I think here the point is more that eventfd does not scale for that particular purpose. And I guess that is very general to thread synchronization because all synchronization (mutexes, semaphore) in Linux has been made around futexes for quite a while.

Quoting: ShabbyXEverything is a file. In fact the few things that Unix didn't make a file turned out to be the most problematic areas (pids and signals notably). At least the pid problem is remedied with fds (pidfd), and if signals aren't already, I'm sure they will be turned into fds too.

Well, you can see futexes as an extremely low overhead fd too. For signals ... They are a different beast. They basically break the illusion of user space applications that all their context is safe at every point. That nothing will come and trash their current state. Basically, you bring in kernel like issues in user space. They have their use sometime, but the only way to fix em, is not to use them really. Turning them to fd won't fix anything. You can read the comments about signals in kernel code, you will see how much the kernel devs love them :)

Quoting: ShabbyXI said all that to say that given how central fds are, it's worthwhile to make sure eventfd is actually efficient, rather than keep trying to work around it.

eventfd is actually efficient enough for its purpose I would expect. But the issue is, the inner counter is - by spec - maintained by the kernel. So it means many round trips between kernel and user space ... Which will limit perf, and I guess that is why you can not just poll it as much as you want. And that is the syscall spec, can't do much about it.
Futexes on the other hand were made so that you only go to kernel space if you can not take the lock ownership/if the semaphore is at 0 (basically a yield). So you have much less round trips with futexes. They are also stored as a simple `intptr` (an address) whereas eventfd looks like that:
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
 * Every time that a write(2) is performed on an eventfd, the
 * value of the __u64 being written is added to "count" and a
 * wakeup is performed on "wqh". A read(2) will return the "count"
 * value to userspace, and will reset "count" to zero. The kernel
 * side eventfd_signal() also, adds to the "count" counter and
 * issue a wakeup.
 */
__u64 count;
unsigned int flags;
int id;
};

From the look of it, eventfd will be more real time (you will wake up as soon as something happens if you have the priority), whereas futex side, you will wake up at your next quantum clearly (I only see mechanism unsuspending you, nothing scheduling you). Futexes also do not hold a list of who is waiting, they are just a counter. So the first one who comes and retake the lock wins seems. It's coherent since the scheduler is fair anyway.
So I would say, they just fill purposes that are orthogonal. I would typically use eventfd for IO related waits, or if I need something a bit more real time. And futexes for all the rest.
ShabbyX Oct 12, 2021
Quoting: 3zekieleventfd is actually efficient enough for its purpose I would expect. But the issue is, the inner counter is - by spec - maintained by the kernel. So it means many round trips between kernel and user space ... Which will limit perf, and I guess that is why you can not just poll it as much as you want. And that is the syscall spec, can't do much about it.
Futexes on the other hand were made so that you only go to kernel space if you can not take the lock ownership/if the semaphore is at 0 (basically a yield). So you have much less round trips with futexes. They are also stored as a simple `intptr` (an address) whereas eventfd looks like that:
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
 * Every time that a write(2) is performed on an eventfd, the
 * value of the __u64 being written is added to "count" and a
 * wakeup is performed on "wqh". A read(2) will return the "count"
 * value to userspace, and will reset "count" to zero. The kernel
 * side eventfd_signal() also, adds to the "count" counter and
 * issue a wakeup.
 */
__u64 count;
unsigned int flags;
int id;
};

From the look of it, eventfd will be more real time (you will wake up as soon as something happens if you have the priority), whereas futex side, you will wake up at your next quantum clearly (I only see mechanism unsuspending you, nothing scheduling you). Futexes also do not hold a list of who is waiting, they are just a counter. So the first one who comes and retake the lock wins seems. It's coherent since the scheduler is fair anyway.
So I would say, they just fill purposes that are orthogonal. I would typically use eventfd for IO related waits, or if I need something a bit more real time. And futexes for all the rest.

Ack. I don't know eventfds well enough to actually have a proposal to improve it. But agreed, having futexes go through the kernel unconditionally is completely in contradiction with them being futexes, so a new syscall is reasonable.
HariboKing Jan 2, 2022
Hello folks,

Not really a Linux Gamer, but a real-time software engineer (and contributor to the Linux kernel). Not sure how I stumbled on this post, but just wanted to clarify a couple of points make by 3zekiel regarding futex functionality.

Quoting: 3zekielFutexes also do not hold a list of who is waiting, they are just a counter. So the first one who comes and retake the lock wins seems. It's coherent since the scheduler is fair anyway.

Futexes *do* hold a list of waiters.

Before Pierre Peiffer's 'futex priority based wakeup' patch, non-Priority Inheritance (PI) futexes made use of a simple linked-list to store the tasks waiting on a futex. With the old scheme, tasks would be enqueued on this list when required to wait. When waiting tasks were to be awoken, this would be undertaken by dequeueing the relevant number of waiting tasks from the list of waiters and making them runnable again. However, this scheme did not take account of the waiting tasks' priorities - waiters were woken in first come first served / FIFO order.

PI futexes, however, do not behave in the same way due to their very nature - they are priority aware (and more than that, they temporarily alter task priorities under certain conditions to avoid Priority Inversion).

Pierre's patch made changes to non-PI futexes such that futex wakeups are orchestrated with respect to the priority of awaiting tasks. See futex_wake() within kernel/futex.c for any mainline kernel version > v2.6.21.

Quoting: 3zekielFrom the look of it, eventfd will be more real time (you will wake up as soon as something happens if you have the priority), whereas futex side, you will wake up at your next quantum clearly (I only see mechanism unsuspending you, nothing scheduling you). Futexes also do not hold a list of who is waiting, they are just a counter. So the first one who comes and retake the lock wins seems. It's coherent since the scheduler is fair anyway.

On !CONFIG_PREEMPT configured kernels, the required amount of waiting tasks are made runnable on a futex wake event. There is no invocation of schedule(). However, on CONFIG_PREEMPT configured kernels, a reschedule is invoked. This is ultimately invoked within the call chain for wake_up_q():

futex_wake()
--> wake_up_q()
    --> wake_up_process()
        --> try_to_wakeup()
            --> preempt_enable() << This function invokes a reschedule on CONFIG_PREEMPT kernels in the subsequent call chain (details below here ommitted).


So the reschedule invocation depends on the build-time configuration of the kernel and is buried relatively deep within the call chain (hence why I think 3zekiel missed it - I did too, the first time I looked).

Jack
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations: PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
The comments on this article are closed.