# All-pairs Euclidean distance

Given two arrays of $$d$$-dimensional vectors, $$\mathbf{A}\in\mathbb{R}^{M\times d}$$ and $$\mathbf{B}\in\mathbb{R}^{N\times d}$$, we are interested in efficient vectorized code for obtaining a matrix $$\mathbf{D}\in\mathbb{R}^{M\times N}$$ such that

$(\mathbf{D})_{ij} = \vert\vert\mathbf{a}_i - \mathbf{b}_j\vert\vert_2$

where $$(\mathbf{D})_{ij}$$ is the $$ij$$th entry of $$\mathbf{D}$$, $$\mathbf{a}_i$$ is the $$i$$th row of $$\mathbf{A}$$ and $$\mathbf{b}_j$$ is the $$j$$th row of $$\mathbf{B}$$. In other words, we want to compute the Euclidean distance between all vectors in $$\mathbf{A}$$ and all vectors in $$\mathbf{B}$$. The following numpy code does exactly this:

def all_pairs_euclid_naive(A, B):
#
D = numpy.zeros((A.shape, B.shape), dtype=numpy.float32)
for i in range(0, D.shape):
for j in range(0, D.shape):
D[i, j] = numpy.linalg.norm(A[i, :] - B[j, :])
#
return D


Unfortunately, this code is really inefficient. To rectify the issue, we need to write a vectorized version in which we avoid the explicit usage of loops. To arrive at a solution, we first expand the formula for the Euclidean distance:

$\vert\vert\mathbf{a}_i - \mathbf{b}_j\vert\vert_2= (\mathbf{a}_i - \mathbf{b}_j)^T(\mathbf{a}_i - \mathbf{b}_j)= \mathbf{a}_i^T\mathbf{a}_i - 2\mathbf{a}_i^T\mathbf{b}_j + \mathbf{b}_j^T\mathbf{b}_j$

This leads us to the following equation for $$\mathbf{D}$$:

$\mathbf{D} = \sqrt{\mathbf{S}_A - 2\cdot\mathbf{A}\cdot\mathbf{B}^T + \mathbf{S}_B}$

where $$\mathbf{S}_A\in\mathbb{R}^{M\times N}$$ is such that $$(\mathbf{S}_A)_{ij}=\mathbf{a}_i^T\mathbf{a}_i$$ and $$\mathbf{S}_B\in\mathbb{R}^{M\times N}$$ is such that $$(\mathbf{S}_B)_{ij}=\mathbf{b}_j^T\mathbf{b}_j$$. The square root is taken elementwise.

Each of the three above terms can be obtain with vectorized code.

The first term is obtained for each $$i$$ by squaring the entries of $$\mathbf{A}$$, summing along the second dimension after that and repeating the obtained result $$N$$ times to obtain an $$M\times N$$ matrix. The second term can be computed with the standard matrix-matrix multiplication routine. The third term is obtained in a simmilar manner to the first term.

Without further ado, here is the numpy code:

def all_pairs_euclid_numpy(A, B):
#
sqrA = numpy.broadcast_to(numpy.sum(numpy.power(A, 2), 1).reshape(A.shape, 1), (A.shape, B.shape))
sqrB = numpy.broadcast_to(numpy.sum(numpy.power(B, 2), 1).reshape(B.shape, 1), (B.shape, A.shape)).transpose()
#
return numpy.sqrt(
sqrA - 2*numpy.matmul(A, B.transpose()) + sqrB
)


And the following implementation uses Pytorch:

def all_pairs_euclid_torch(A, B):
#
sqrA = torch.sum(torch.pow(A, 2), 1, keepdim=True).expand(A.shape, B.shape)
sqrB = torch.sum(torch.pow(B, 2), 1, keepdim=True).expand(B.shape, A.shape).t()
#
sqrA - 2*torch.mm(A, B.t()) + sqrB
)


Note that the above code can be executed on the GPU as well.

By setting $$M=2048$$, $$N=2048$$ and $$d=2048$$, we obtain the following timings for the modern 40-thread Intel CPU and the Nvidia 1080Ti GPU:

Implementation Time [ms]
naive 29500
numpy 160
pytorch 20
pytorch (cuda) 15

You can get the time measurements for your hardware by running the following script.