BIOS 1.6 wasn't buggy at all for me, it was 1.7 that was bad, with the slow BIOS for example. So you're saying they fixed that in 1.8 already, that's good to hear. I wonder about the performance though, with the microcode updates.
About the RAM, funny enough, my G.Skills can't really do CR1T either at 3600 CL16 and otherwise optimized timings (well, they can, but then the RTLs/IOLs are trained all wonky).
CR1T:
tRTL CHA D1R0: 64
tRTL CHB D1R0: 59
tIOL CHA: 14
tIOL CHB: 7
CR2T:
tRTL CHA D1R0: 59
tRTL CHB D1R0: 60
tIOL CHA: 7
tIOL CHB: 6
RTLs/IOLs being too far apart after memory training is always a bad sign, as you may know. It's like it's saying "i don't like this". I've tried everything to get them to train more cleanly, like they do on CR2T, to no avail. Raising VDIMM, VCCSA and VCCIO all lead nowhere, "IO Compensation" was a dead end too, as were fixed values, and fixed RTL/IOL-Init-values. So i just went to CR2T. CR1T was stable, but it was just a bad sign with the weird values.
I did extensive testing with all of the secondary timings and some of the tertiary timings. One thing you should try is tFAW 20 instead of 16. This is a bit of a secret, everyone recommends tFAW to be set to the lowest value of 16. But set it to 20 and compare the "copy" performance.
As you may know, tRRD and tFAW are connected. tRRD is mininum 4, tFAW is mininum 4x tRRD, so mininum 16. But most people are unaware what these options actually do.
tRRD (Row-To-Row Activation Delay) and tFAW (Four Activate Window) were introduced to reduce the maximum power consumption and temperature of a RAM module. This is because a Row Activation draws a lot of power, way more than the actual reading of data from the memory bank. For a Row ACT, thousands of transistors spring into action, causing a spike of power draw each time. Only a REF (refresh cycle) uses way more power than an ACT again, this is the reason why there are no ACTs allowed during a refresh cycle.
Now, if there was no delay between row activations, all the power draw spikes would overlap each other, and the total power draw would go above the limits of DDR4 specs. So the first countermeasure is tRRD, which defines the minimum time between two consecutive ACT commands.
tFAW furthermore defines a time window in which four consecutive ACTs are allowed. If tFAW is set to 4x tRRD, no further delay is implemented. If tFAW is increased, there will be an additional delay after four ACT commands.
Here's the configuration with tRRD 4 & tFAW 20. Four ACT-commands are processed, the fifth ACT is declined, because the time window tFAW, in which only four ACTs are allowed, has not expired yet.
Of course, if tFAW is set unnecessarily high, it will impact the bandwidth negatively. But why is the Copy performance better with a slight additional delay, i.e. tFAW 20 instead of 16? Without affecting other things negatively? It might be that the delay allows other commands to be processed quicker. After all, not only are things read and written, but also copied within the RAM.
This is just my theory, based on my observations. But you might wanna try that little secret
