[SOLVED] Removing rows if they occur within a certain time of each other by a group value in R – Stack Overflow

Issue

This Content is from Stack Overflow. Question asked by Aleix Dorca

This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R.

I have modified the df presented there with a particular case I am trying to filter.

This is the code with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question.

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

df %>% 
  group_by(ID) %>% 
  mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
         keep = is.na(d) | d > 5*60) %>% 
  filter(keep) %>% 
  select(-d, -keep) %>% 
  arrange(Row)

This is the result:

# A tibble: 8 × 3
# Groups:   ID [3]
    Row Timestamp           ID   
  <dbl> <chr>               <chr>
1     1 0020-06-29 12:14:00 B    
2     2 0020-06-29 12:27:00 A    
3     3 0020-06-29 12:27:22 B    
4     8 0020-06-29 12:43:00 B    
5     9 0020-06-29 12:44:00 C    
6    11 0020-06-29 12:55:00 A    
7    12 0020-06-29 12:57:00 C    
8    13 0020-06-29 13:04:00 B   

This is not what I want because the time difference is always obtained to the previous row (using lag). This means that Row 6 and Row 7 are removed because they are less than 5 minutes away each other and Row 5. The truth is, though, that Row 6 is more than five minutes away from the first instance in group A. It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6.

What I have been unable to obtain (without a for loop) is to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.



Solution

Supposing that what I comment above does not occur, a possible solution is the following:

library(tidyverse)
library(lubridate)

elapsed <- function(x)
{
  y <- abs(as.duration(x[2:length(x)] %--% x[1:(length(x)-1)]))
  y >= 5*60
} 

df %>% 
  group_split(ID) %>% 
  map_dfr(~ .[c(T, if (nrow(.) > 1) elapsed(.$Timestamp)),]) %>% 
  arrange(Row)

The output:

# A tibble: 8 × 3
    Row Timestamp           ID   
  <int> <chr>               <chr>
1     1 0020-06-29 12:14:00 B    
2     2 0020-06-29 12:27:00 A    
3     3 0020-06-29 12:27:22 B    
4     5 0020-06-29 12:43:00 B    
5     6 0020-06-29 12:44:00 C    
6     8 0020-06-29 12:55:00 A    
7     9 0020-06-29 12:57:00 C    
8    10 0020-06-29 13:04:00 B    


This Question was asked in StackOverflow by bill_scoresby and Answered by PaulS It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?