[SOLVED] Pandas to Pyspark environment

Issue

This Content is from Stack Overflow. Question asked by Abhinav bharti

newlist = []
for column in new_columns:
count12 = new_df.loc[new_df[col].diff() == 1]
new_df2=new_df2.groupby([‘my_id’,’friend_id’,’family_id’,’colleage_id’]).apply(len)

There is no option is available in pyspark for getting all length of column

How can we achieve this code into pyspark.

Thanks in advance..



Solution

Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax

import pyspark.sql.functions as F

(df
    .groupBy('my_id','friend_id','family_id','colleage_id')
    .agg(F.count('*'))
    .show()
)


This Question was asked in StackOverflow by Abhinav bharti and Answered by pltc It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?