Every article tag can be clicked to get a list of all articles in that category. Every article tag also has an RSS feed! You can customize an RSS feed too!
We do often include affiliate links to earn us some pennies. See more here.

box86 and box64 get Steam Play Proton working much better on Arm devices

By -
Last updated: 18 Apr 2022 at 3:05 pm UTC

Are you planning to do some gaming on Arm devices? You need to take a look at the box86 and box64 projects, which are really quite impressive. Since the majority of software (especially games) are built for x86 processors (the usual Intel / AMD crop), Arm needs something stuck in between to get them running and that's exactly what these projects do.

Split across 32bit and 64bit, box86 and box64 both recently had major updates and a fun highlight on both is that the developer noted they have plenty of fixes and improvements to get both Steam and Steam Play Proton working — so Linux gaming on Arm is about to see a nice boost with these.

It's a good way to try out some more gaming on devices like the Raspberry Pi, or other popular tiny computing systems.

Many other improvements came along, see the release notes on both:

At some point soon, I'm definitely going to need to see how well it runs on my own RPi4. Sounds like it's getting quite exciting!

Article taken from GamingOnLinux.com.
26 Likes
About the author -
author picture
I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly checked on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly. You can also follow my personal adventures on Bluesky.
See more from me
The comments on this article are closed.
All posts need to follow our rules. For users logged in: please hit the Report Flag icon on any post that breaks the rules or contains illegal / harmful content. Guest readers can email us for any issues.
9 comments Subscribe

elgatil 18 Apr 2022
This! This is what I think the future of the steam deck of gonna be. With a powerful enough ARM soc (something similar to the apple M1 for example of the competition ever manages to catch up *sigh*) we could get a big expansion in battery life.

And actually, I have been following these projects for some time and I have the impression they are picking up speed largely. I wouldn't be surprise if in a couple of months it is revealed the developers are being funded by Valve. Like it happened with DXVK. Pure speculation here of course.


On a different note, the stack of Linux gaming is getting pretty funny:

x86 win game -> proton -> pressure vessel -> box86 -> the actual OS
I wonder how many more layers we manage to put in between :D
elmapul 18 Apr 2022
This! This is what I think the future of the steam deck of gonna be. With a powerful enough ARM soc (something similar to the apple M1 for example of the competition ever manages to catch up *sigh*) we could get a big expansion in battery life.

And actually, I have been following these projects for some time and I have the impression they are picking up speed largely. I wouldn't be surprise if in a couple of months it is revealed the developers are being funded by Valve. Like it happened with DXVK. Pure speculation here of course.


On a different note, the stack of Linux gaming is getting pretty funny:

x86 win game -> proton -> pressure vessel -> box86 -> the actual OS
I wonder how many more layers we manage to put in between :D

i dont think we will have the raw power to do those translations for cuting edge games any time soon.
but who knows, most of the heavy processing will be done by the gpu anyway.

one thing is for sure, accuracy would be dead
https://arstechnica.com/gaming/2011/08/accuracy-takes-power-one-mans-3ghz-quest-to-build-a-perfect-snes-emulator/

on a side note, maybe its a bad idea for valve, they dont want to harm the good relationship they're having with AMD.


Last edited by elmapul on 18 Apr 2022 at 10:30 pm UTC
elmapul 18 Apr 2022
"It's not like arm is new in gaming. Mobile phones have been doing it for a long time, the Switch uses arm cores."
speaking of it, arm processors would be much better to run emulators for portable consoles.
hell, its possible to run psp(or vita?) apps on a switch without emulators!
3zekiel 19 Apr 2022
"It's not like arm is new in gaming. Mobile phones have been doing it for a long time, the Switch uses arm cores."
speaking of it, arm processors would be much better to run emulators for portable consoles.
hell, its possible to run psp(or vita?) apps on a switch without emulators!

It's the Vita which can run without a full emulator, the PSP is using MIPS.
One problem though, at least for older portable consoles, is that they use 32 bit arm ISA, which has been dropped from newer cores. Also, emulating RISCV over modern CISC tend to work very well due to reducing the instruction cache bloat - an x64 instruction might cover 3 or more ARM instruction (think of LEA vs a multiplication a shift and an addition), keeping the generated code small. So it's not 100% sure that emulating ARM 32 over ARM 64 will be faster than emulating on top of x64.

As for emulating x64 over ARM, it is quite costly... The best way to do it is to go semi hardware like Apple did with the M1 (Implement a bunch of x64 instuctions in hw - mostly memory related -, use x64 memory ordering etc etc). Without that, I'm afraid taking a big overhead is mostly unavoidable, making recent games unplayable.
elmapul 19 Apr 2022
"It's not like arm is new in gaming. Mobile phones have been doing it for a long time, the Switch uses arm cores."
speaking of it, arm processors would be much better to run emulators for portable consoles.
hell, its possible to run psp(or vita?) apps on a switch without emulators!

It's the Vita which can run without a full emulator, the PSP is using MIPS.
One problem though, at least for older portable consoles, is that they use 32 bit arm ISA, which has been dropped from newer cores. Also, emulating RISCV over modern CISC tend to work very well due to reducing the instruction cache bloat - an x64 instruction might cover 3 or more ARM instruction (think of LEA vs a multiplication a shift and an addition), keeping the generated code small. So it's not 100% sure that emulating ARM 32 over ARM 64 will be faster than emulating on top of x64.

As for emulating x64 over ARM, it is quite costly... The best way to do it is to go semi hardware like Apple did with the M1 (Implement a bunch of x64 instuctions in hw - mostly memory related -, use x64 memory ordering etc etc). Without that, I'm afraid taking a big overhead is mostly unavoidable, making recent games unplayable.


i saw an video explaining it, and it was quite the opposite!
arm is better to emulate x86 than x86 to emulate arm!
the video is in portuguese so i'm not sure its gonna be usefull here, but te explanation was something like:
you can draw an square by drawing 4 lines, but you waste a lot of processing power if you have to draw an entire window with 1px of width every time you want an vertical line, and an entire window with 1px of height every time you want an horizontal line.
x86 complex instruction set is only usefull when most of those instructions get used often, but that simply is not the case, many instructions were put there to cheat on benchmarks or because hardware patents dont last forever and intel priorities were at not being copied instead of designing an efficient chip, in fact, most x86 instructions are already "emulated" using micro architecture or something like that in plain x86 chips.
(i say x86 but i mean both x86 and x86/64, its just laziness)

i dont remember the exactly explanation on why arm was better, but it was somethng like x86 have an number of instruction that vary too much to be predictable or anything like that.
the processor spend a lot of time trying to figure out the instruction instead of executing it.
anyway, i hope someone else who work on the area can figure out what i'm talking about and and explain it in better/more precise words. =p
3zekiel 19 Apr 2022
i saw an video explaining it, and it was quite the opposite!
arm is better to emulate x86 than x86 to emulate arm!
the video is in portuguese so i'm not sure its gonna be usefull here, but te explanation was something like:

Hmm I do not speak Portugese, but I did work on that quite a lot, and it goes completely against the benchmarks I did (and hell I did a lot). My thought is that the video is confusing power efficiency and performance. I will answer points by points and try to explain.

you can draw an square by drawing 4 lines, but you waste a lot of processing power if you have to draw an entire window with 1px of width every time you want an vertical line, and an entire window with 1px of height every time you want an horizontal line.

That's true, but in the end, what you do with memory accesses, vector computation is fairly predictable and standard, so what the modern CISCs do is that they concentrate on packing those operations. So the cases where you overwork will be rare. So the result is that you end up with more compact instructions, which are more cache efficient, and potentially giving more context to the backend hw optimizer - allowing it to perform better.
If we're talking old i386 instructions then yeah, that would be a valid point, but not on modern x64.

Btw, ARM has had some CISC sides for years now, be it the way it handle register save and restore, predicated instruction, some level of offseted load. I did not look at the most recent ISAs, but I would bet it got more CISC-y rather than less. In the end, when you go for performance, you hardly have nay choice.

x86 complex instruction set is only usefull when most of those instructions get used often, but that simply is not the case, many instructions were put there to cheat on benchmarks or because hardware patents dont last forever and intel priorities were at not being copied instead of designing an efficient chip, in fact, most x86 instructions are already "emulated" using micro architecture or something like that in plain x86 chips.
(i say x86 but i mean both x86 and x86/64, its just laziness)

Indeed x86 instructions are "emulated", like most super-scalar architectures, but it is actually to obtain better performance. I did not check, but I guess server grade ARM is too. When you want to achieve very high throughput, it's pretty much the only way.
What essentially happens is that an ISA is exposed (ARM/x64/POWER), made in a way that it is retro-compatible with older chips, and user-friendly to some level. But the CPU actually executes "micro instructions" which are made to be executable more efficiently / faster. This helps an insane lots with resource allocation too (Floating point units, integer ALUs, "real" registers). Thus it allows the CPU to execute as many instructions in parallel as it possibly can. As such, this is actually positive in term of performance, even if it's a bit counter intuitive.
You can take look at work which was done on "Dynamo" JIT, which does the same in SW for older RISCV CPUs, resulting in faster code even though you have a JIT in the middle. Nvidia with their "Denver" ARM arch made a half HW half SW solution too doing just that.

Also, on the point of instructions not being used, well, then they only cost a few transistors here and there. Looking at what takes space in a CPU, it is NOT the decoder. Caches, register files dominate largely.

Overall, all of this does cost power and area. Duplicating pipelines, ressources and co will not come for free. But truth is, there is no real alternatives, Intel and others have tried to switch to more bare architectures, in particular with "VLIW"(Very Long Instruction word) or "EPIC" (Explicitely Parallel Instruction Computers) - as Intel calls it - style ISA/CPUs, where all the work is done as compile time instead of dynamically, but truth is, it just flat out does not work. Dynamic optimization of resources is always better on general code. Always. Such static approaches only work on a restricted set of program types.
And yes, if you are very constrained, then the pure RISCV approach will actually win, but this is less and less true as lithographies get better and we can pack more and more transistor per mm squared.
Also, anyway, at high throughput, prefetchers and branch prediction that mirv talked about are vital to RISCV too, this is mostly due to needing deeper pipelines at high frequency, and this blasted memory latency wall that poison us all ...

i dont remember the exactly explanation on why arm was better, but it was something like x86 have an number of instruction that vary too much to be predictable or anything like that.

That's for the HW decoder yep, varying size instructions are kinda harder to decode. Bad news is, even in RISCV word they exist to a point and are a necessary evil.
In short, it's true that strict RISCV will allow you to have very regular instructions to decode, each instruction is 8 bytes wide on a 64 bit CPU, each has a a 16 bit opcode at the start, source register is at bit 24, dest at bit 32, immediate is there is in the rest. Of course, you can decode that faster.
BUT, and there is a big BUT, if you just want to push a register on the stack, then you only need an opcode, and a register. so you would barely use 24 bits out of those 64 you reserved. Thus you are wasting a lot of space. Also, since your instructions are very strict, saying that you want to do an offseted memory access requires to do
 
mov rx, SOME_ADDR
addi rx, SOME_IMMEDIATE
load ry, rx

Where each time you will use the full 64 bit instruction
whereas on CISC that would be
mov rx, 0xSOME_OFFSET[SOME_ADDR]
where you have only one opcode, only one dest register, and the two same ADD/IMMEDIATE as before.
On Intel, pushing a register, in x64 is only one byte (!!), where on a pure RISCV this will be 8 bytes.
Considering the price both in term of area and power of each bits of instruction cache, I think you see why most high throughput arches go the way of superscalar / more cisc like stuff. Once again apple M1 is actually borrowing intel/cisc like instruction for these things.

the processor spend a lot of time trying to figure out the instruction instead of executing it.
anyway, i hope someone else who work on the area can figure out what i'm talking about and and explain it in better/more precise words. =p

I kinda see the point, but it is only valid if you have very tight power/transistor budget, and can't afford deep / multi pipeline CPU backends.
As soon as you have a multi issue CPU with deep pipelines, decode stage becomes neglect-able. Not to count that the big CPUs are able to decode whole cache lines in parallel anyway, making that price even less important.
If we are talking IOT, or embedded CPUs, then yes, valid point.

To summarize,
Performance wise: pure RISCV is very efficient when you are in full control of what you execute, think of very compute intensive stuff on a very dedicated subject, where you can do insane amount of static optimizations. However, as soon as you have smthg which is more general, dynamic and this superscalar/ CISC over VLIW approaches win.

Power efficiency / Area efficiency wise: On constrained scenarios, RISC wins, as soon as you have enough area/power to go wide-issue (>=4 wise issue) with large parallel decoders then the difference will be low.

Emulation wise
: A good JIT will see the patterns of mov / add / shift / load and translate it to single instructions on the host, allowing to keep instruction cache cost very low. And that is where the gain is. Conversely, ARM to x64 would have a big inflation in term of code (I measure as high as x3 inflation on pure execution code when it had a lot of control, and about 70 to 100% on more compute code , neglecting completely the emulator's control code). Pure performance, I saw a lower performance hit on ARM to x64 side than x64 to ARM side. But it's hard to validate that measurement though, as it's hard to compare smaller ARM core to full fledged x64 cores. M1 is cheating as it borrow some HW emulation too. The inflation on the other hand is a good metric, as it will lead to much more cache misses, prefetch cost and so on.
Which leads me to last note, if you use some level of HW emulation, well, who care which ISA you use for that purpose, by definition you implemented the problematic parts in HW.

Hope that I was clear.
elmapul 19 Apr 2022
"It's not like arm is new in gaming. Mobile phones have been doing it for a long time, the Switch uses arm cores."
speaking of it, arm processors would be much better to run emulators for portable consoles.
hell, its possible to run psp(or vita?) apps on a switch without emulators!

It's the Vita which can run without a full emulator, the PSP is using MIPS.
One problem though, at least for older portable consoles, is that they use 32 bit arm ISA, which has been dropped from newer cores. Also, emulating RISCV over modern CISC tend to work very well due to reducing the instruction cache bloat - an x64 instruction might cover 3 or more ARM instruction (think of LEA vs a multiplication a shift and an addition), keeping the generated code small. So it's not 100% sure that emulating ARM 32 over ARM 64 will be faster than emulating on top of x64.

As for emulating x64 over ARM, it is quite costly... The best way to do it is to go semi hardware like Apple did with the M1 (Implement a bunch of x64 instuctions in hw - mostly memory related -, use x64 memory ordering etc etc). Without that, I'm afraid taking a big overhead is mostly unavoidable, making recent games unplayable.


i saw an video explaining it, and it was quite the opposite!
arm is better to emulate x86 than x86 to emulate arm!
the video is in portuguese so i'm not sure its gonna be usefull here, but te explanation was something like:
you can draw an square by drawing 4 lines, but you waste a lot of processing power if you have to draw an entire window with 1px of width every time you want an vertical line, and an entire window with 1px of height every time you want an horizontal line.
x86 complex instruction set is only usefull when most of those instructions get used often, but that simply is not the case, many instructions were put there to cheat on benchmarks or because hardware patents dont last forever and intel priorities were at not being copied instead of designing an efficient chip, in fact, most x86 instructions are already "emulated" using micro architecture or something like that in plain x86 chips.
(i say x86 but i mean both x86 and x86/64, its just laziness)

i dont remember the exactly explanation on why arm was better, but it was somethng like x86 have an number of instruction that vary too much to be predictable or anything like that.
the processor spend a lot of time trying to figure out the instruction instead of executing it.
anyway, i hope someone else who work on the area can figure out what i'm talking about and and explain it in better/more precise words. =p

The designs of each have become far too complex to be summarised properly in just a short paragraph or two, but some highlights:
  • arm is a design, which can be customised. Specific instructions can be added to the hardware - not all "arm" chips are equal!

  • arm is risc (reduced instruction set), x86/x86_64 is cisc (complex instruction set)

  • These days (well, last I checked) cisc is sort of emulated with microcode - that is, smaller instructions are used to build and run the complex ones. This is done to reduce chip size and complexity, but makes the instruction decode somewhat more entertaining from an engineering standpoint.

  • Branch prediction, long pipelines, decode units, and a whole host of extras are generally somewhat more beefy on x86/64 chips to get raw performance out of it, and all of that takes up an awful lot of chip space. One of the reasons arm can go smaller, more power efficient, is by minimising or removing some of that - at the cost of performance.

  • As manufacturing technologies improve, there are fewer gains to be had by x86/64, and arm can catch up in those areas, but it's still going to lack certain instructions that help make software run a lot faster (if the software uses them!).

Basically I can see arm right now being able to take over on normal desktop usage (Apple has certainly shown it's quite possible), but it's still a very, very long way away from what x86_64 can do when the thermal and power designs are more permissive. Only it's not quite so simple as that because hardware aside, there's quite a good deal of back & forth with software as well (just like with GPUs). All the fancy instructions in the world are useless if software never uses them - but x86/x86_64 has had a very long time in the spotlight, and compilers can make various assumptions about what is supported or not. Apple get away with it on the M1 because they control everything (they can add instructions and know they'll be used).

Both arm and x86_64 serve different purposes, but requirements change over time and the lines between purposes are blurring more all the time and there's space for a range of options.

this^ give that man a coockie, not only he explained 99% of what i said better than my self, but he also added some extras.
elmapul 19 Apr 2022
My thought is that the video is confusing power efficiency and performance. I will answer points by points and try to explain.

"nope, the guy really knows his stuff, he even made a video to talk about trade offs.
often you exchange processing power with energy efficiency, size, etc"
he said other examples instead of etc.
he knows that often one tech is not better than other, its just better at an specific thing.
more often than not.
3zekiel 19 Apr 2022
My thought is that the video is confusing power efficiency and performance. I will answer points by points and try to explain.

"nope, the guy really knows his stuff, he even made a video to talk about trade offs.
often you exchange processing power with energy efficiency, size, etc"
he said other examples instead of etc.

OK, then my misunderstanding.

he knows that often one tech is not better than other, its just better at an specific thing.
more often than not.

Yup, it kinda summarizes it all. In the case of ISA, you could also say that the answer is often in the middle ... A pure RISC, except in constrained cases, isn't going to cut it very far. And a pure CISC (as in, CPUs that desperately try to implement every last special cases, super complex "zero overhead" whatever instruction) is going to be inefficient as hell - I did say that a decoder, in the case of x64 can withstand useless instructions, but as you imagine, that's only true to a point. Then depending on your use case you will also want to take from more esoteric approaches (DSP stuff that has a whole vector manipulation lib in HW as an example)
And at the very end, unless you completely screwd up your ISA (which is rare considering the guys that design ISAs are usually good at what they do), once you go higher in power, the backend is going to count more and eventually dominate (I'd say in the 10~15w+ scenario, with modern lithographies, already, you won't see that much difference anymore).

I was also reacting because I see a lot of mixups since M1 chip came out between what is ARM and what is Apple. M1 chip is insane, but it has little to do with ARM in fact. The guys at Apple did an insane work on everything around the core, like crazy interconnects that can be exposed outside the chip so as to be able to basically stack chips, ram, and everything you need right on the die - which brings insane advantages in term of performance and scalability. They also extended the ARM ISA a fair bit, and cut parts here and there to make it more efficient (that makes it not very interoperable though ...), and especially better at emulation. And they are also helped by having the best available lithography out there (and an insane load of cash to pay themselves 600mm squared dies ...).
And I do see a lot of trashing on x64 here and there (visibly not your guy), either because they mixup the low power side, at which x64 does suck, and the whole area of computing - including higher power computing at which x64 is suddenly much better.

ARM wise, they actually added some very cisc-y stuff lately, in particular for matrix manipulation, but they made the choice of keeping pure 32 bit size instructions, I am actually curious why they did that, as it bound their hand on multiple things: register count, had to drop some instructions to liberate some space ... Well, I guess they did have their reasons, just very curious what they are. They used to have dual mode (compressed 2 byte, and full 4 bytes) which I personally have found very useful too, but dropped it, likely due to lack of opcode width. But overall it's a good arch for embedded (I include phones in that)/ specialized use cases. More excited about RISC-V though, but mostly due to its openness :)
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations: PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
The comments on this article are closed.