Issue
This Content is from Stack Overflow. Question asked by e_putyora
I have a series of different columns in my dataset that I am trying to use the aggregate command on in order to determine the number of observations, mean and sd for each individual in my study. I then want to determine the mean and sd across all individuals. The problem I am encountering is that for the first aggregate command all is fine but for the second, the mean command is excluding any mean values from the first aggregate that were calculated using a single number. I have made an example data frame below to try to illustrate my problem.
Firstly here is the sample dataset:
structure(list(groupnumber = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), tagnumber = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), disturbance = c("Y",
"Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",
"Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",
"Y", "Y", "Y", "Y", "Y", "Y"), sleepstate = c("WAKE", "SWS",
"REM", "WAKE", "SWS", "REM", "WAKE", "SWS", "REM", "WAKE", "SWS",
"REM", "WAKE", "SWS", "REM", "WAKE", "SWS", "REM", "WAKE", "SWS",
"REM", "WAKE", "SWS", "REM", "WAKE", "SWS", "REM", "WAKE", "SWS",
"REM", "WAKE", "SWS", "REM"), proportion = c(0.25, 0.33, 0.42,
0.18, 0.44, 0.38, 0.11, 0.19, 0.7, 0.55, 0.17, 0.28, 0.42, 0.22,
0.36, 0.6, 0.2, 0.2, 0.42, 0.31, 0.27, 0.21, 0.38, 0.41, 0.65,
0.2, 0.15, 0.11, 0.52, 0.37, 0.28, 0.39, 0.33)), class = "data.frame", row.names = c(NA,
-33L))
Then I run the first aggregate as follows:
troubletablelong<-as.data.frame(aggregate(cbind(trouble$proportion)~trouble$tagnumber+trouble$disturbance+trouble$sleepstate,data=trouble,FUN=function(x)c(nobs(x),mean(x),sd(x))))
Resulting in:
structure(list(`trouble$tagnumber` = c(1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L), `trouble$disturbance` = c("Y", "Y", "Y", "Y", "Y",
"Y", "Y", "Y", "Y"), `trouble$sleepstate` = c("REM", "REM", "REM",
"SWS", "SWS", "SWS", "WAKE", "WAKE", "WAKE"), V1 = structure(c(5,
5, 1, 5, 5, 1, 5, 5, 1, 0.428, 0.28, 0.33, 0.27, 0.322, 0.39,
0.302, 0.398, 0.28, 0.160374561573836, 0.11, NA, 0.113357840487546,
0.134610549363711, NA, 0.180194339533738, 0.236156727619604,
NA), dim = c(9L, 3L))), row.names = c(NA, -9L), class = "data.frame")
And with a tiny bit of clean-up and column headings:
structure(list(tagnumber = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), disturbance = c("Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",
"Y"), sleepstate = c("REM", "REM", "REM", "SWS", "SWS", "SWS",
"WAKE", "WAKE", "WAKE"), nobs = c(5, 5, 1, 5, 5, 1, 5, 5, 1),
mean = c(0.428, 0.28, 0.33, 0.27, 0.322, 0.39, 0.302, 0.398,
0.28), sd = c(0.160374561573836, 0.11, NA, 0.113357840487546,
0.134610549363711, NA, 0.180194339533738, 0.236156727619604,
NA)), row.names = c(NA, -9L), class = "data.frame")
Of note here is the fact that the number of observations for tagnumber 3 for each sleep state is only 1. This results in a mean value calculated using only a single value and of course no sd can be calculated hence the 3 NAs in the last column.
I then want to run another aggregate function exactly the same as the first but across all tag numbers so that I only get a single value for each of the 3 sleep states. The code for that is as follows:
newtroubletablelong2<-as.data.frame(aggregate(cbind(newtroubletablelong$mean,newtroubletablelong$sd)~newtroubletablelong$sleepstate+newtroubletablelong$disturbance,data=newtroubletablelong,FUN=function(x) c(nobs(x),mean(x),sd(x))))
Resulting in:
structure(list(`newtroubletablelong$sleepstate` = c("REM", "SWS",
"WAKE"), `newtroubletablelong$disturbance` = c("Y", "Y", "Y"),
V1 = structure(c(2, 2, 2, 0.354, 0.296, 0.35, 0.104651803615609,
0.0367695526217005, 0.0678822509939086), dim = c(3L, 3L)),
V2 = structure(c(2, 2, 2, 0.135187280786918, 0.123984194925629,
0.208175533576671, 0.0356201940881585, 0.0150279345649194,
0.0395713841069094), dim = c(3L, 3L))), row.names = c(NA,
-3L), class = "data.frame")
Again with some clean-up and column headings we get:
structure(list(sleepstate = c("REM", "SWS", "WAKE"), disturbance = c("Y",
"Y", "Y"), meannobs = c(2, 2, 2), meanmean = c(0.354, 0.296,
0.35), meansd = c(0.104651803615609, 0.0367695526217005, 0.0678822509939086
), sdnobs = c(2, 2, 2), sdmean = c(0.135187280786918, 0.123984194925629,
0.208175533576671), sdsd = c(0.0356201940881585, 0.0150279345649194,
0.0395713841069094)), row.names = c(NA, -3L), class = "data.frame")
Unfortunately the number of observations used here is only 2 when it should be 3 as per the previous table and I am unsure why R is omitting this value for this second mean calculation. If anyone could tell me why this is happening and how I could tweak my code to prevent this it would be greatly appreciated. And apologies for any formatting issues as this is my first post!
Solution
Here is a simpler way. Two aggregate
will do it.
Notice na.action = na.pass
, the default is na.omit
. And since in the second aggregate
1 in 3 values of sd
are NA
‘s, with the default setting only two values are passed on to the aggregation function.
agg <- aggregate(proportion ~ ., trouble[-1], \(x) {
c(nobs = length(x), mean = mean(x), sd = sd(x))
}, na.action = na.pass)
agg <- cbind(agg[-ncol(agg)], agg[[ncol(agg)]])
agg
#> tagnumber disturbance sleepstate nobs mean sd
#> 1 1 Y REM 5 0.428 0.1603746
#> 2 2 Y REM 5 0.280 0.1100000
#> 3 3 Y REM 1 0.330 NA
#> 4 1 Y SWS 5 0.270 0.1133578
#> 5 2 Y SWS 5 0.322 0.1346105
#> 6 3 Y SWS 1 0.390 NA
#> 7 1 Y WAKE 5 0.302 0.1801943
#> 8 2 Y WAKE 5 0.398 0.2361567
#> 9 3 Y WAKE 1 0.280 NA
agg2 <- aggregate(cbind(nobs, mean, sd) ~ ., agg[-1], \(x) {
c(meannobs = length(x), meanmean = mean(x, na.rm = TRUE), meansd = sd(x, na.rm = TRUE))
}, na.action = na.pass)
agg2 <- cbind(agg2[1:2], agg2[[4]], agg2[[5]])
names(agg2)[6:8] <- sub("^mean", "sd", names(agg2)[6:8])
agg2
#> disturbance sleepstate meannobs meanmean meansd sdnobs sdmean
#> 1 Y REM 3 0.3460000 0.07528612 3 0.1351873
#> 2 Y SWS 3 0.3273333 0.06017752 3 0.1239842
#> 3 Y WAKE 3 0.3266667 0.06274817 3 0.2081755
#> sdsd
#> 1 0.03562019
#> 2 0.01502793
#> 3 0.03957138
Created on 2022-09-19 with reprex v2.0.2
This Question was asked in StackOverflow by e_putyora and Answered by Rui Barradas It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.