While doing some comparative benchmarks between my RX 470 and GTX 1060 on a Ryzen 1700 CPU and an i7-2700k CPU, I encountered odd behaviour with Shadow of Mordor.
On 1080p high preset this benchmark is almost exclusively CPU-bound on both a Ryzen 1700 (3,75GHz) and an i7-2700k (4,2GHz). So when I got 30 to 40% better performance on the i7 compared to the Ryzen with the GTX 1060, I was shocked and began to investigate what was causing such a performance drop with Ryzen.
Interesting to note is that, on Ryzen, the performance of the GTX 1060 and the RX 470 was identical in CPU-bound parts of the benchmark, even though AMD’s open source driver (Mesa 17.2-git in this case) still has a significantly higher CPU overhead than Nvidia's proprietary driver. So this pointed to a driver-independent bottleneck on the game side itself.
With that information, I started suspecting a thread allocation problem, either from the Linux kernel (4.12rc1) or from the game (if it forces the scheduling through CPU affinity).
You see, Ryzen has a specific architecture, quite different from Intel's i5 and i7. Ryzen is a bit like some sort of CPU Lego, with the CCX being the base building block. A CCX (core complex) comprises 4 CPU cores with SMT (simultaneous multithreading) and the associated memory caches (level 1 to 3). So a mainstream Ryzen CPU is made of 2 CCXes linked with AMD’s infinity fabric (a high speed communication channel). Even the 4 cores Ryzen are made this way (on these cpus, two cores are disabled in each CCX).
If you’re interested in the subject, you can find more in-depth information here: Anandtech.com review of Ryzen
So how does this all relate to Shadow of Mordor? Well, AMD’s architecture is made to scale efficiently to high core numbers (up to 32), but it has a drawback: communication between CPU cores that are not on the same CCX is slower because it has to go through the Infinity Fabric.
On a lot of workloads this won’t be a problem because threads don’t need to communicate much (for example in video encoding, or serving web pages) but in games threads often need to synchronize with each other. So it’s better if threads that are interdependent are scheduled on the same CCX.
This is not happening with Shadow of Mordor, so performance takes a huge hit, as you can see in the graph below.
This graph shows the FPS observed on a Ryzen 1700 @ 3,75GHz and an RX 470 during the automated benchmark of Shadow of Mordor. The blue line shows the FPS with the default scheduling and the red line with the game forced onto the first CCX. The yellow line shows the performance increase (in %) going from default to manual scheduling.
As you can see, manual scheduling roughly yelds a 30% performance improvement in CPU-bound parts of the benchmark. Quite nice, eh?
So how does one manually schedule Shadow of Mordor on a Ryzen CPU?
It’s quite simple really. Just edit the launch options of the game in Steam like this:
This command will force the game on logical cores 0-7 which are all located on the first CCX.
Note: due to SMT, there are twice the amount of logical cores as real physical cores. This is because SMT allows two threads to run simultaneously on each physical core (though not both at full speed).
The above command is for an 8 core / 16 threads Ryzen CPU (model 1700 and higher).
On 6 core Ryzen (models 1600/1600X), the command would be
Caveat: on a 4 core Ryzen limiting the game to the first CCX will only give it 2 cores / 4 threads to work with. This may prove insufficient and counter-productive compared to running the game with the default scheduling. You’ll have to try it for yourself to see what option gives the best performance.
Due to its specific architecture, Ryzen needs special care in thread scheduling from the OS and games. If you think a game does not have the performance level it should have you can try forcing the scheduling on the first CCX and see if it improves performance. In my (admittedly limited) experience though, Shadow of Mordor is the only game where manual scheduling mattered. The Linux scheduler does a pretty good job usually.
On 1080p high preset this benchmark is almost exclusively CPU-bound on both a Ryzen 1700 (3,75GHz) and an i7-2700k (4,2GHz). So when I got 30 to 40% better performance on the i7 compared to the Ryzen with the GTX 1060, I was shocked and began to investigate what was causing such a performance drop with Ryzen.
Interesting to note is that, on Ryzen, the performance of the GTX 1060 and the RX 470 was identical in CPU-bound parts of the benchmark, even though AMD’s open source driver (Mesa 17.2-git in this case) still has a significantly higher CPU overhead than Nvidia's proprietary driver. So this pointed to a driver-independent bottleneck on the game side itself.
With that information, I started suspecting a thread allocation problem, either from the Linux kernel (4.12rc1) or from the game (if it forces the scheduling through CPU affinity).
You see, Ryzen has a specific architecture, quite different from Intel's i5 and i7. Ryzen is a bit like some sort of CPU Lego, with the CCX being the base building block. A CCX (core complex) comprises 4 CPU cores with SMT (simultaneous multithreading) and the associated memory caches (level 1 to 3). So a mainstream Ryzen CPU is made of 2 CCXes linked with AMD’s infinity fabric (a high speed communication channel). Even the 4 cores Ryzen are made this way (on these cpus, two cores are disabled in each CCX).
If you’re interested in the subject, you can find more in-depth information here: Anandtech.com review of Ryzen
So how does this all relate to Shadow of Mordor? Well, AMD’s architecture is made to scale efficiently to high core numbers (up to 32), but it has a drawback: communication between CPU cores that are not on the same CCX is slower because it has to go through the Infinity Fabric.
On a lot of workloads this won’t be a problem because threads don’t need to communicate much (for example in video encoding, or serving web pages) but in games threads often need to synchronize with each other. So it’s better if threads that are interdependent are scheduled on the same CCX.
This is not happening with Shadow of Mordor, so performance takes a huge hit, as you can see in the graph below.
This graph shows the FPS observed on a Ryzen 1700 @ 3,75GHz and an RX 470 during the automated benchmark of Shadow of Mordor. The blue line shows the FPS with the default scheduling and the red line with the game forced onto the first CCX. The yellow line shows the performance increase (in %) going from default to manual scheduling.
As you can see, manual scheduling roughly yelds a 30% performance improvement in CPU-bound parts of the benchmark. Quite nice, eh?
So how does one manually schedule Shadow of Mordor on a Ryzen CPU?
It’s quite simple really. Just edit the launch options of the game in Steam like this:
taskset -c 0-7 %command%
This command will force the game on logical cores 0-7 which are all located on the first CCX.
Note: due to SMT, there are twice the amount of logical cores as real physical cores. This is because SMT allows two threads to run simultaneously on each physical core (though not both at full speed).
The above command is for an 8 core / 16 threads Ryzen CPU (model 1700 and higher).
On 6 core Ryzen (models 1600/1600X), the command would be
taskset -c 0-5 %command%
and on a 4 core Ryzen (models 1400/1500X) taskset -c 0-3 %command%
Caveat: on a 4 core Ryzen limiting the game to the first CCX will only give it 2 cores / 4 threads to work with. This may prove insufficient and counter-productive compared to running the game with the default scheduling. You’ll have to try it for yourself to see what option gives the best performance.
Due to its specific architecture, Ryzen needs special care in thread scheduling from the OS and games. If you think a game does not have the performance level it should have you can try forcing the scheduling on the first CCX and see if it improves performance. In my (admittedly limited) experience though, Shadow of Mordor is the only game where manual scheduling mattered. The Linux scheduler does a pretty good job usually.
Some you may have missed, popular articles from the last month:
So instead:
taskset -c 8-15 %command%
Last edited by octra on 28 May 2017 at 3:52 am UTC
I have read a few days ago on another site about a game, which works only with first 4 CPU threads.
It's fine, if you have Celeron, Pentium, Core i3 or even Core i5. But if you have Core i7, where each two logical threads is one physical core (same as Core i3), so the game works with 4 logical threads and two physical cores. So, Core i7 has a performance about Core i3 (where there are just a 2 physical cores) or about a half from Core i5 (where the game uses 4 physical cores).
You should also know that this is unnecessary if you can afford high speed memory. The infinite fabric runs at the same speed as the RAM, higher RAM clocks == higher infinite fabric performance. Now finally you have a reason to spend money on fast memory :)
What it's worth to mention, you can use this for other great things. For example: On a Ryzen 7 you can let run the game on the first 8 threads and OBS on the second 8 (CCX1 and CCX2). So you never influence the game or the recording with your CPU. This works amazingly well.
CPUs will always have some architectural bottlenecks, so fixing such issues on-die will not be an option (as fixing this issue will cause others...). µcode will not help, as it's a scheduling issue, and µcode has nothing to do with scheduling.
The right place to "fix" this issue is the task scheduler of the OS, and I would be strongly surprised if we wouldn't see patches from AMD that address such issues soon.
Edit: As far as I know there are (were?) the same issues on windows. People were even suggesting to treat Ryzen as NUMA due to the relatively long access times for cache on the other CCX.
Last edited by soulsource on 28 May 2017 at 7:56 am UTC
Do developers have to act or the kernel to fix this issue?
Prety much as expected by the looks of it
Core 1 does logical 1 & 2 and so on
To stop the stuttering maybe better fixing it to the real cores not the firs 8 or so
[pete@com1 ~]$ cat /proc/cpuinfo |egrep "processor|physical id|core id" | sed 's/^processor/\nprocessor/g'
processor : 0
physical id : 0
core id : 0
processor : 1
physical id : 0
core id : 0
processor : 2
physical id : 0
core id : 1
processor : 3
physical id : 0
core id : 1
processor : 4
physical id : 0
core id : 2
processor : 5
physical id : 0
core id : 2
processor : 6
physical id : 0
core id : 3
processor : 7
physical id : 0
core id : 3
processor : 8
physical id : 0
core id : 4
processor : 9
physical id : 0
core id : 4
processor : 10
physical id : 0
core id : 5
processor : 11
physical id : 0
core id : 5
processor : 12
physical id : 0
core id : 6
processor : 13
physical id : 0
core id : 6
processor : 14
physical id : 0
core id : 7
processor : 15
physical id : 0
core id : 7
Edit.
Better for me using
taskset -c 0,3,5,7,9,11,13,15 %command%
Last edited by pete910 on 28 May 2017 at 11:33 am UTC
Unfortunately I can't test with higher speeds for now as 2666Mhz results in a boot loop and I need to remove the BIOS battery / clear CMOS to get the computer to boot again.
This is probably because I have dual rank memory that isn't playing well with Ryzen for now (bought it real cheap last summer when we didn't know Ryzen would be that picky with RAM).
I expect the new bios updates in June to improve things (especially with the ability to choose 2T command rate) so I'll test again if I can get the memory higher.