I am trying to sample data based on operational hours. For example, Serial C was introduced in 2014-01-01 and it was was failed in 2014-01-03. Serial B, and D was never failed. I want to compute the operational hours as follows:
I was able to do it using PySpark as follows:
drive_spans = spark.sql(""" select serial_number, max(date) as retired_date, min(date) as launched_date, count(date) as observed_days, min(case when failure=1 then date end) as failed_date, max(smart_187_raw) as max_hours, min(case when failure=1 then smart_187_raw end) as failed_hours, max(failure) as failure from df group by serial_number """).cache() drive_spans.count() dfsurv = spark.sql(""" select drive_spans.*, datediff(coalesce(failed_date, retired_date), launched_date) as duration, min(launched_date) over (partition by serial_number) as model_introduced from drive_spans """)
However, I couldn’t use the same PySpark functionality in Pandas. I appreciate your suggestions for the following dataframe. I wanna get the above table for this data. Thanks!
import pandas as pd import numpy as np import datetime from datetime import date, timedelta df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='t') df['date'] = pd.to_datetime(df['date']) df = df.sort_values(by="date")
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.