This Content is from Stack Overflow. Question asked by ForestGump
I am trying to sample data based on operational hours. For example, Serial C was introduced in 2014-01-01 and it was was failed in 2014-01-03. Serial B, and D was never failed. I want to compute the operational hours as follows:
I was able to do it using PySpark as follows:
drive_spans = spark.sql(""" select serial_number, max(date) as retired_date, min(date) as launched_date, count(date) as observed_days, min(case when failure=1 then date end) as failed_date, max(smart_187_raw) as max_hours, min(case when failure=1 then smart_187_raw end) as failed_hours, max(failure) as failure from df group by serial_number """).cache() drive_spans.count() dfsurv = spark.sql(""" select drive_spans.*, datediff(coalesce(failed_date, retired_date), launched_date) as duration, min(launched_date) over (partition by serial_number) as model_introduced from drive_spans """)
However, I couldn’t use the same PySpark functionality in Pandas. I appreciate your suggestions for the following dataframe. I wanna get the above table for this data. Thanks!
import pandas as pd import numpy as np import datetime from datetime import date, timedelta df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='t') df['date'] = pd.to_datetime(df['date']) df = df.sort_values(by="date")
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.