# Issue

This Content is from Stack Overflow. Question asked by P i

My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.

I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.

My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.

It feels clumsy. But I can’t see clearly how to do it ‘properly’.

``````    def getDistanceMatrix(df, centroids):
distanceMatrix = np.zeros((len(df), len(centroids)))

distFunc = lambda centroid, row: sum(centroid != row)

iCentroid = 0
for _, centroid in centroids.iterrows():
distanceMatrix[:, iCentroid] = df.apply(
lambda row: distFunc(centroid, row),
axis=1
)
iCentroid += 1

return distanceMatrix

distanceMatrix = getDistanceMatrix(df, centroids)
``````

It feels like some kind of outer-product-with-a-custom-function.

What’s a good way to write this?

# Solution

I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:

``````# Convert to numpy arrays (as I'm not proficient with
#   pandas dataframes (...yet))
df_np = df.to_numpy()
centroids_np = centroids.to_numpy()

# Broadcast df_np to (2,9,4) and centroids_np to (2,1,4),
#   then subtract the two.
# The result is a (2,9,4) array, where:
#   - axis=0 corresponds to the centroid of the difference
#   - axis=1 corresponds to the element in the dataframe
#   - axis=2 corresponds to the individual coordinates
df_np,
(centroids_np.shape[0], df_np.shape[0], df_np.shape[1])
) - centroids_np[:, None, :]

# Convert to a binary distance
diff = (diff != 0).astype(df_np.dtype)

# Now sum along the coordinates
distanceMatrix2 = np.sum(diff, axis=-1).T
# array([[0, 2],
#       [3, 3],
#       [2, 2],
#       [2, 4],
#       [4, 3],
#       [2, 0],
#       [2, 4],
#       [4, 3],
#       [3, 3]], dtype=int64)
``````

For reference, your code gives me:

``````distanceMatrix = getDistanceMatrix(df, centroids)
#array([[0., 2.],
#       [3., 3.],
#       [2., 2.],
#       [2., 4.],
#       [4., 3.],
#       [2., 0.],
#       [2., 4.],
#       [4., 3.],
#       [3., 3.]])
``````

``` This Question was asked in  StackOverflow by  P i and Answered by AndrĂ© It is licensed under the terms of
CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.```