Usuń kolumny/wiersze z brakiem więcej niż X%
Chcę usunąć wszystkie kolumny lub wiersze z więcej niż 50% NA
s w ramce danych.
To jest moje rozwiązanie:
# delete columns with more than 50% missings
miss <- c()
for(i in 1:ncol(data)) {
if(length(which(is.na(data[,i]))) > 0.5*nrow(data)) miss <- append(miss,i)
}
data2 <- data[,-miss]
# delete rows with more than 50% percent missing
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[i,]))) > 0.5*ncol(data)) miss2 <- append(miss2,i)
}
data <- data[-miss,]
Ale szukam ładniejszego / szybszego rozwiązania.
Byłbym również wdzięczny za dplyr
Rozwiązanie
22
3 answers
Aby usunąć kolumny z pewną ilością NA, możesz użyć
colMeans(is.na(...))
## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]
## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]
## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]
44
Author: Rorschach,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-02-25 20:45:47
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-02-25 20:45:47
A tidyverse
rozwiązanie, które usuwa kolumny z x% NA
S (50%) tutaj:
test_data <- data.frame(A=c(rep(NA,12),
520,233,522),
B = c(rep(10,12),
520,233,522))
# Remove all with %NA >= 50
# can just use >50
test_data %>%
purrr::discard(~sum(is.na(.x))/length(.x)* 100 >=50)
Wynik:
B
1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
10 10
11 10
12 10
13 520
14 233
15 522
9
Author: NelsonGon,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2019-08-21 13:21:11
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2019-08-21 13:21:11
Oto kolejna wskazówka RO Filtr df, który ma 50 Nan w kolumnach:
## Remove columns with more than 50% NA
rawdf.prep1 = rawdf[, sapply(rawdf, function(x) sum(is.na(x)))/nrow(rawdf)*100 <= 50]
Spowoduje to DF z tylko NaN w kolumnach nie większe do 50%.
0
Author: abdoulsn,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-12-19 13:21:14
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-12-19 13:21:14