[SOLVED] How to perform an outer product with custom function (pandas/numpy)?

Issue

This Content is from Stack Overflow. Question asked by P i

My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.

I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.

My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.

It feels clumsy. But I can’t see clearly how to do it ‘properly’.

    def getDistanceMatrix(df, centroids):
        distanceMatrix = np.zeros((len(df), len(centroids)))

        distFunc = lambda centroid, row: sum(centroid != row)

        iCentroid = 0
        for _, centroid in centroids.iterrows():
            distanceMatrix[:, iCentroid] = df.apply(
                lambda row: distFunc(centroid, row),
                axis=1
            )
            iCentroid += 1

        return distanceMatrix

    distanceMatrix = getDistanceMatrix(df, centroids)

It feels like some kind of outer-product-with-a-custom-function.

What’s a good way to write this?



Solution

I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:

# Convert to numpy arrays (as I'm not proficient with
#   pandas dataframes (...yet))
df_np = df.to_numpy()
centroids_np = centroids.to_numpy()

# Broadcast df_np to (2,9,4) and centroids_np to (2,1,4),
#   then subtract the two.
# The result is a (2,9,4) array, where:
#   - axis=0 corresponds to the centroid of the difference
#   - axis=1 corresponds to the element in the dataframe
#   - axis=2 corresponds to the individual coordinates
diff = np.broadcast_to(
    df_np,
    (centroids_np.shape[0], df_np.shape[0], df_np.shape[1])
) - centroids_np[:, None, :]

# Convert to a binary distance
diff = (diff != 0).astype(df_np.dtype)

# Now sum along the coordinates
distanceMatrix2 = np.sum(diff, axis=-1).T
# array([[0, 2],
#       [3, 3],
#       [2, 2],
#       [2, 4],
#       [4, 3],
#       [2, 0],
#       [2, 4],
#       [4, 3],
#       [3, 3]], dtype=int64)

For reference, your code gives me:

distanceMatrix = getDistanceMatrix(df, centroids)
#array([[0., 2.],
#       [3., 3.],
#       [2., 2.],
#       [2., 4.],
#       [4., 3.],
#       [2., 0.],
#       [2., 4.],
#       [4., 3.],
#       [3., 3.]])


This Question was asked in StackOverflow by P i and Answered by André It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?