Split large DataFrame to smaller DataFrame by Unix Timestamps

Issue

This Content is from Stack Overflow. Question asked by TonyGW

I have large dataframes in CSV format, each file is about 5 – 6 GB. The large dataframe has a column named Unix_Timestamp, I need to split the large dataframes to a set of smaller dataframes by the timestamp values. Please see the diagram below:

split large dataframe to smaller ones

The simple solution would be to parse the CSV file line by line and use a dictionary to organize those rows with the same timestamp values, then write the rows with same timestamp to a CSV file.

But I am also trying to use Dask or the regular Pandas library to process it with higher efficiency. I am not sure if the GroupBy would work for splitting the large dataframe.

I have code below to create an example dataframe that I would like to split, but I am stuck using GroupBy to split the dataframe. Can anyone provide hints? Thanks

from collections import OrderedDict
import pandas as pd

table = OrderedDict()

col_names = []
num_rows = 10 # number of rows for each timestamp (ts1, ts2)

ts1 = [1641220210] * num_rows
ts2 = [1851220221] * num_rows
timestamps = ts1 + ts2 # concatenate
all_values = []

# print(timestamps)

for i in range(4):
    col_names.append(f'Col{i+1}')
    values = [(i+1) * 100] * (num_rows * 2)
    all_values.append(values)

table['Unix_Timestamp'] = timestamps
for i in range(len(col_names)):
    table[col_names[i]] = all_values[i]

df = pd.DataFrame(table)
df

example dataframe for splitting by Unix_Timestamp



Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?