My dataframe has N rows.
I have M centroids. Each centroid is the same shape as a dataframe-row.
I need to create a Nrows by Mcols matrix, where the m-th column is created by applying the m-th centroid to the dataframe.
My solution involves pre-creating the output matrix and filling it one column at a time as we manually iterate over centroids.
It feels clumsy. But I can’t see clearly how to do it ‘properly’.
def getDistanceMatrix(df, centroids): distanceMatrix = np.zeros((len(df), len(centroids))) distFunc = lambda centroid, row: sum(centroid != row) iCentroid = 0 for _, centroid in centroids.iterrows(): distanceMatrix[:, iCentroid] = df.apply( lambda row: distFunc(centroid, row), axis=1 ) iCentroid += 1 return distanceMatrix distanceMatrix = getDistanceMatrix(df, centroids)
It feels like some kind of outer-product-with-a-custom-function.
What’s a good way to write this?
I mainly work with "vanilla numpy", so I can not give a nice solution based on pandas. I would do it like this if it only were numpy arrays, but I am not sure if there are any conversion overheads with pandas:
# Convert to numpy arrays (as I'm not proficient with # pandas dataframes (...yet)) df_np = df.to_numpy() centroids_np = centroids.to_numpy() # Broadcast df_np to (2,9,4) and centroids_np to (2,1,4), # then subtract the two. # The result is a (2,9,4) array, where: # - axis=0 corresponds to the centroid of the difference # - axis=1 corresponds to the element in the dataframe # - axis=2 corresponds to the individual coordinates diff = np.broadcast_to( df_np, (centroids_np.shape, df_np.shape, df_np.shape) ) - centroids_np[:, None, :] # Convert to a binary distance diff = (diff != 0).astype(df_np.dtype) # Now sum along the coordinates distanceMatrix2 = np.sum(diff, axis=-1).T # array([[0, 2], # [3, 3], # [2, 2], # [2, 4], # [4, 3], # [2, 0], # [2, 4], # [4, 3], # [3, 3]], dtype=int64)
For reference, your code gives me:
distanceMatrix = getDistanceMatrix(df, centroids) #array([[0., 2.], # [3., 3.], # [2., 2.], # [2., 4.], # [4., 3.], # [2., 0.], # [2., 4.], # [4., 3.], # [3., 3.]])