VB_SVP wrote:Async Compute is related to HSA as it enables the GPU to be used to perform computation in lieu of the CPU without tanking the GPU's performance (and it is functionality that Nvidia GPUs don't have). With the iGPUs, so many of the Intel and AMD ones are rather useless for GPU computation as they lack the functionality outright or "pull an Nvidia" and emulate it, which lacks the performance. Tragically, AMD took too long to put GCN on its line of APUs so all pre-2014 APUs (Kaveri) are stuck with non-async compute, non-HSA enabled GPUs based on pre-GCN architecture, even though GCN was available on their dGPUs in 2011/2012. Compounding that brutality, even though they are quite capable of working in some capacity with Vulkan/DX12, AMD has no plans to support those APIs on pre-GCN GPUs.
Firstly, I would just like to clear up any confusion regarding Nvidia's support for asynchronous computation, see here: Nvidia supports Async Compute
I still think this whole iGPU/APU thing is a very bad idea (as long as people keep demanding they be low power as well). Think about it, a CPU has to process all the sequential program instructions (even modern software like new game engines, that completely utilize 4-cores, do so with four threads that do a lot of sequential computations), and you want it to be very good at that sort of thing to be able to keep up with modern GPUs (especially since transistor geometry scaling still brings huge performance gains to GPUs).
GPUs, on the other hand, aren't very good at dealing with things like logic expressions, branching ('if' , 'else') and nested 'for' loops. To use your example of SVP with duplicate frame detection by using the iGPU, the basic program flow of the most computationally intensive subroutine would look something like this:
>Get the current frame, the next frame, and their representations as a plane of blocks.
>Now get the 'quality' of the current-to-next and the next-to-current motion vectors by running a convolution with each one of the current frame's blocks over the regions of the next frame that correspond to the center location of the current block being tested + the requested motion-vector range to be searched, in every direction. Do the same with the next frame's blocks convolving over the current frame to get the 'reverse' motion-vectors.
>Select the 'best' motion-vector for each block, for both the next-to-current and the current-to-next frames.
>Interpolate a frame somewhere between the current and next frames by 'moving' one of the following blocks by an amount and direction dictated by a function of how good the motion-vector for the next-to-current and the current-to-next frames are:
#If there is occluded motion, use only the block from frame with the unoccluded pixels;
#If one block's motion-vector is much 'better' than the other frame's corresponding block's motion-vector, only use the 'good' one;
#If both motion-vecotors are of a similar quality, compose the new frame of a weighted combination of both, where the weight takes into account how 'good' the motion-vector is, as well as how close (temporally) the frame that the block is being taken from is to the interpolated frame (both frames will be equally close for simple frame doubling).
(In reality, you should loop through each pixel of the 'new' frame, that sits between the current and the next frames, and compute the interpolated value as a function of all the motion vectors that can reach that pixel from both frames. If you only compare the two corresponding blocks from each frame and then move the best one to the position in the interpolated frame that is dictated by it's motion-vector, then you may get regions of pixels without any value, i.e. 'holes' in your interpolated frame.)
Now, you would like to check for a duplicate frame, which would have a large majority of the motion-vectors simply being (0,0). That would only be possible near the end of the algorithm (when you know the 'best' motion-vectors for each direction), where you would then choose to 'discard' the next frame and redo the calculation with the current frame and the frame after the 'discarded' one (except if the number of repeated frames remains constant throughout the video, in which case you can simple use something like one or more Avisynth calls to SelectEven() or SelectOdd(), before the video gets to SVP's interpolation functions). This loop will continue until you find a next frame with some useful motion between it and the current frame. If you then simply interpolate the number of discarded frames multiplied by your interpolation factor, then 'jump over' those 'discarded' frames and continue your interpolation from the 'next' frame you used for the calculation, then you should have the same number of frames as if you had done the interpolation for all those duplicate frames. The computational cost would also come to about the same, but it would require some more logical comparisons to check whether or not you need to drop the frame. Those checks also have to be done for all frames that actually do contain motion, further decreasing performance. Now, this wouldn't be too bad on a CPU, but here is where the problem with APU processing comes in:
GPUs are very good at doing things like massive matrix calculations, image convolutions, interpolated frame composition, etc., but slow down immensely if you don't frame the problem as a matrix computation. Basically, lets say you compute the absolute difference between 2 frames, as an unsigned value at least 16-bits in size, pixel by pixel, via 2 methods.
First method: You do everything on the GPU via two loops (the outer one for, say, the columns and the nested one for the rows) and inside the inner (nested) loop, with which you then calculate:
a=pixel1-pixel2; if(a<0) then return -a, else return a;
Second method: You only do the following on the GPU: a1 = frame1-frame2; a2 = frame2-frame1; (both done as one massive SIMD command).
Thereafter, on the CPU, you loop though every pixel in a1 via two 'for' loops (just like in the above method for the GPU) and for each pixel you do the following:
if(pixel_a1<0) then pixel_a1=pixel_a2;
If done on a discrete, separate CPU and GPU, the speedup of method 2 over method 1 can be anywhere from 100% to 1000%. Newer GPU compute architectures (and especially NVIDIA's CUDA compiler) are getting better and better at reducing this difference, but the core principle remains the same.
If, however, you have an iGPU, then porting some calculations over to it from the CPU may or may not bring a performance advantage. Since you are stuck with OpenCL for iGPU programming, you either need to have basically been on the design team for the GPU you are programming (to be able to transform problematic code into the most optimal form for the iGPU you are targeting), or you need to avoid such code entirely (such as in method 2).
{Yes, I know, a lot of people try desperately to justify their purchases by saying things like "OpenCL is just as good as CUDA for X, but supports everything so it's better". Well, technically yes, you can do almost everything in OpenCL that you can do in CUDA, the problem is the difficulty of doing so. GPU programming is DIFFICULT, its very difficult, which is the main reason why we don't see apps utilizing the enormous potential computational performance inside them more often.
Trying to get the same performance from the same NVIDIA GPU (or an AMD one with an equivalent theoretical computational throughput) with OpenCL that you got from a CUDA program is just making a difficult task so much harder, that almost no one does manage to get the same ultimate performance in a real-world application.}
Finally, even with optimal code, an iGPU has to split its very limited power with the CPU. Compared to the same CPU and GPU as discrete products, the maximum performance of even an optimally coded 50-50 algorithm (one that needs an exactly equal amount of CPU and GPU power) will take anywhere from 50% to even 100% longer on an integrated APU comprised of exactly the same processors. However, if you are against even mildly overclocking your setup (for some strange reason some people, who are not even electrical engineers themselves, seem to think that a moderate overclock reduces the lifespan of your components by an appreciable amount), then the difference will of course be much smaller. You can even go further and underclock the discrete setup to achieve the same performance as the integrated setup, but that only serves to further show how power limited APUs are.
Its also not just limited power that constrains performance. Having to share the CPU's memory means the iGPU has to live with a minuscule portion of a similar dGPU's memory bandwidth. On-chip Crystal-Lake style L4 memory caches can help in this regard, but they are so tiny in comparison to discrete GPU memory, that doing any sort of video processing through only that cache is completely impossible.
APUs are a nice idea, don't get me wrong. Its easy to think of an iGPU similarly to some enhanced instructions set, such as AVX or FMA, that can serve to massively improve upon a CPU's performance in certain tasks. But in practice, it just seems like having the CPU keep the iGPU on constant life support (with woeful GPU memory bandwidth and power-gating killing entire blocks every few milliseconds) just seems to negate all of the advantages in having it so close to the CPU. If, however, the choice is between two similar processors, where one has an iGPU and the other which doesn't
Also, I really don't get this reluctance to develop for a 'closed' architecture. No matter what percentage of the market a certain GPU has, why not develop your application for the one that can actually deliver the required performance? If people really want to use a certain application, then they will buy the required hardware. This is how it works in the industry, and also in the consumer market for most non-software things (say you're following a cooking recipe, which calls for baking a cake in an oven, but you don't have an oven. Is it really necessary to rewrite the recipe to obtain at least something resembling a cake by using another, more readily available, piece of hardware. A hot stone, perhaps?).
I just think that this pathological fear of 'vendor lock-in' is unnecessarily limiting the applications and features in applications that are available to us. I mean, how many programs are there that can, for instance, do what you want (i.e. interpolate a video stream containing many duplicate frames of varying lengths)?
Thinking about this in another way, paints iGPUs in a much better light. What about systems that have a discrete GPU, but where the CPU just also happens to have an iGPU. With the massive transistor budgets available these days, simply integrating an iGPU onto the CPU die shouldn't sacrifice too much potential performance (if any).
Now we have a scenario where people have capable CPUs and discrete GPUs, but where coding in OpenCL allows developers to access a second, integrated iGPU when and where it would be most beneficial to do so.
Through OpenCL programming an AMD iGPU and an NVIDIA dGPU can simultaneously even be working on the exact same problem, with each GPU doing work that is most suited to its compute architecture!
The CPU's available power budget can also, theoretically, be utilized much more completely and effectively by interleaving CPU and GPU commands (such as doing some GPU work while the CPU waits for data to come in from the dGPU, for example). If the iGPU's wake up and sleep sequences are fast enough, it can even be used in place of SIMD instructions like SSE, AVX and FMA. The potential benefits in this scenario, are simply unimaginable.
Maybe I'm just a bit of a pessimist, but it doesn't seem like we will be seeing much use of this second usage scenario of HSA. I just don't get why though. Maybe the lifetime of a GPU compute architecture is just too short for developers to spend the time on optimizing an OpenCL workflow for that specific processor?
Or maybe most people still aren't even aware of the potential gains of simultaneous computation with a dGPU and an iGPU?