Split large DataFrame to smaller DataFrame by Unix Timestamps


I have large dataframes in CSV format, each file is about 5 – 6 GB. The large dataframe has a column named Unix_Timestamp, I need to split the large dataframes to a set of smaller dataframes by the timestamp values. Please see the diagram below:

split large dataframe to smaller ones

The simple solution would be to parse the CSV file line by line and use a dictionary to organize those rows with the same timestamp values, then write the rows with same timestamp to a CSV file.

But I am also trying to use Dask or the regular Pandas library to process it with higher efficiency. I am not sure if the GroupBy would work for splitting the large dataframe.

I have code below to create an example dataframe that I would like to split, but I am stuck using GroupBy to split the dataframe. Can anyone provide hints? Thanks

from collections import OrderedDict
import pandas as pd

table = OrderedDict()

col_names = []
num_rows = 10 # number of rows for each timestamp (ts1, ts2)

ts1 = [1641220210] * num_rows
ts2 = [1851220221] * num_rows
timestamps = ts1 + ts2 # concatenate
all_values = []

# print(timestamps)

for i in range(4):
    values = [(i+1) * 100] * (num_rows * 2)

table['Unix_Timestamp'] = timestamps
for i in range(len(col_names)):
    table[col_names[i]] = all_values[i]

df = pd.DataFrame(table)

example dataframe for splitting by Unix_Timestamp


