Thanks for the great contribution to the efficient library.
I was wondering is it is possible to do some operation like numpy's bitwise_and(x, Y)
where the function allows a vector x to calculate bit_wise operation to multiple vectors (stored as rows in Y) simultaneously.
What I observe is that numpy scales up very well, but it is actually 10x slower when Y is only a single vector.