Issue
This Content is from Stack Overflow. Question asked by P i
My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.
I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.
My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.
It feels clumsy. But I can’t see clearly how to do it ‘properly’.
def getDistanceMatrix(df, centroids):
distanceMatrix = np.zeros((len(df), len(centroids)))
distFunc = lambda centroid, row: sum(centroid != row)
iCentroid = 0
for _, centroid in centroids.iterrows():
distanceMatrix[:, iCentroid] = df.apply(
lambda row: distFunc(centroid, row),
axis=1
)
iCentroid += 1
return distanceMatrix
distanceMatrix = getDistanceMatrix(df, centroids)
It feels like some kind of outer-product-with-a-custom-function.
What’s a good way to write this?
Solution
I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:
# Convert to numpy arrays (as I'm not proficient with
# pandas dataframes (...yet))
df_np = df.to_numpy()
centroids_np = centroids.to_numpy()
# Broadcast df_np to (2,9,4) and centroids_np to (2,1,4),
# then subtract the two.
# The result is a (2,9,4) array, where:
# - axis=0 corresponds to the centroid of the difference
# - axis=1 corresponds to the element in the dataframe
# - axis=2 corresponds to the individual coordinates
diff = np.broadcast_to(
df_np,
(centroids_np.shape[0], df_np.shape[0], df_np.shape[1])
) - centroids_np[:, None, :]
# Convert to a binary distance
diff = (diff != 0).astype(df_np.dtype)
# Now sum along the coordinates
distanceMatrix2 = np.sum(diff, axis=-1).T
# array([[0, 2],
# [3, 3],
# [2, 2],
# [2, 4],
# [4, 3],
# [2, 0],
# [2, 4],
# [4, 3],
# [3, 3]], dtype=int64)
For reference, your code gives me:
distanceMatrix = getDistanceMatrix(df, centroids)
#array([[0., 2.],
# [3., 3.],
# [2., 2.],
# [2., 4.],
# [4., 3.],
# [2., 0.],
# [2., 4.],
# [4., 3.],
# [3., 3.]])
This Question was asked in StackOverflow by P i and Answered by André It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.