Add support for second DMA engine to overlap transfer of returning previous matrix while computing the current one.
This will at least double the number of correlation matrices to store since we have to pipeline the correlation matrix transfers. Moreover, it require reworking of the interface with the outside world since the algorithm would now be fully pipelined and require exposure of this to the calling applicaiton.