Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.


The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    aT = numpy.ascontiguousarray(a.T)
    bT = numpy.ascontiguousarray(b.T)
    return (a, b), (aT, bT)
b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(1, 25)],
    kernels=[
        lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
        lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
        lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
        lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
        lambda data: inner1d(data[0][0], data[0][1]),
    ],
    labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
    xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)