Issue
This Content is from Stack Overflow. Question asked by TonyGW
I have large dataframes in CSV format, each file is about 5 – 6 GB. The large dataframe has a column named Unix_Timestamp
, I need to split the large dataframes to a set of smaller dataframes by the timestamp values. Please see the diagram below:
The simple solution would be to parse the CSV file line by line and use a dictionary to organize those rows with the same timestamp values, then write the rows with same timestamp to a CSV file.
But I am also trying to use Dask or the regular Pandas library to process it with higher efficiency. I am not sure if the GroupBy
would work for splitting the large dataframe.
I have code below to create an example dataframe that I would like to split, but I am stuck using GroupBy
to split the dataframe. Can anyone provide hints? Thanks
from collections import OrderedDict
import pandas as pd
table = OrderedDict()
col_names = []
num_rows = 10 # number of rows for each timestamp (ts1, ts2)
ts1 = [1641220210] * num_rows
ts2 = [1851220221] * num_rows
timestamps = ts1 + ts2 # concatenate
all_values = []
# print(timestamps)
for i in range(4):
col_names.append(f'Col{i+1}')
values = [(i+1) * 100] * (num_rows * 2)
all_values.append(values)
table['Unix_Timestamp'] = timestamps
for i in range(len(col_names)):
table[col_names[i]] = all_values[i]
df = pd.DataFrame(table)
df
Solution
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.