Turns out that switching my LAVfilters decoder from DXVA2 (copy-back) to Nvidia CUVID gives better performance...but that alone isn't quite enough.
Chainik wrote:also limiting frc to x2
Oh I'm an idiot; I know that x5 is quite a bit more GPU-intensive than x4, so I should have expected that x2.5 would be more demanding on the GPU than x2! That makes 24 & 25fps content work great and I can crank up the settings to much higher levels (uniform + complicated + one pixel + 8px; half pixel is too CPU-intensive).
However, 720p 30fps AVC content seems to still be too demanding...but I've got something for that as well.
You see, to really determine if I was in fact GPU-limited, I had underclocked the GPU shaders and my SVP performance dropped accordingly...so I got thinking, since my temperatures are definitely cool enough (I've got a 120mm fan blowing directly onto the GPU and I even tested with furmark), what if I overclocked the GPU?
Sure enough, once I set the shader to around 1650MHz, 720p 30fps AVC "just worked" (sometimes as low as 1610MHz would work, but with my test video 1650MHz always worked). Not only that, but it seems like the GPU is stable even up to above 1800MHz (maybe because it's a Quadro and therefore is binned higher?), so using something like 1700MHz should actually be quite within the stable range while still giving me a decent amount headroom. Oh, and this is all without even altering the voltage (which cannot be changed anyway).
But there is still some bad news - 1080p 30fps downscaled to 720p is still too demanding. For reference, using SVP to crop the video to 720p works perfectly fine, but downscaling does not. Is there anyway to just do some light-and-fast CPU-based bilinear downscaling or something? (you know, similar to what MPC-HC's non-PS2.0 bilinear resizer)
Chainik wrote:setting interpolation mode to "1m" should help
This doesn't seem to make a difference when using x2, or if it does then it's very minor (both uniform and 1m lagged at 1600MHz GPU shader but worked at 1650MHz), and using 1m + x2.5 interpolation is too slow even with my GPU shader overclocked to 1800MHz.
Since my display can't handle 48Hz but can handle 50Hz (and 60Hz), I may just look into using Reclock or similar to speed up 24fps content to 25fps.
Lastly, may I ask for more details on just what those two override.js settings you provided are doing? In particular, if I comment out "smooth.linear = false;" the performance is still fine with my shader at 1650MHz, but if I instead comment out "smooth.cubic = 0;" then performance isn't good unless I use a shader clock of like 1775MHz (same goes for if I comment out both settings).