PySpark SQL operation in Pandas to compute the drive ages

Issue

This Content is from Stack Overflow. Question asked by ForestGump

I am trying to sample data based on operational hours. For example, Serial C was introduced in 2014-01-01 and it was was failed in 2014-01-03. Serial B, and D was never failed. I want to compute the operational hours as follows:

enter image description here

I was able to do it using PySpark as follows:

PySpark version:

drive_spans = spark.sql("""
select
    serial_number,
    max(date) as retired_date,
    min(date) as launched_date,
    count(date) as observed_days,
    min(case when failure=1 then date end) as failed_date,
    max(smart_187_raw) as max_hours,
    min(case when failure=1 then smart_187_raw end) as failed_hours,
    max(failure) as failure
from df
group by serial_number
""").cache()
drive_spans.count()

dfsurv = spark.sql("""
select
    drive_spans.*,
    datediff(coalesce(failed_date, retired_date), launched_date) as duration,
    min(launched_date) over (partition by serial_number) as model_introduced
from drive_spans
""")

However, I couldn’t use the same PySpark functionality in Pandas. I appreciate your suggestions for the following dataframe. I wanna get the above table for this data. Thanks!

import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='t')
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by="date")



Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?