Podziel kolumnę łańcucha ramki danych na wiele kolumn

Question

Podziel kolumnę łańcucha ramki danych na wiele kolumn

Chciałbym pobrać dane z formularza

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
  attr          type
1    1   foo_and_bar
2   30 foo_and_bar_2
3    4   foo_and_bar
4    6 foo_and_bar_2

I użyj split() na kolumnie " type" z góry, aby uzyskać coś takiego:

  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

Wymyśliłem coś niewiarygodnie skomplikowanego, obejmującego jakąś formę apply, która zadziałała, ale od tego czasu zgubiłem to. To wydawało się zbyt skomplikowane, aby być najlepszym sposobem. Mogę użyć strsplit Jak poniżej, ale nie wiem, jak to przywrócić do kolumn 2 w ramce danych.

> strsplit(as.character(before$type),'_and_')
[[1]]
[1] "foo" "bar"

[[2]]
[1] "foo"   "bar_2"

[[3]]
[1] "foo" "bar"

[[4]]
[1] "foo"   "bar_2"

Dzięki za wskazówki. Nie do końca wygrzebałem listy R TYLKO jeszcze.

170

dataframe string split r r-faq

Author: David Arenburg, 2010-12-04

Source

15 answers

Inną opcją jest użycie nowego pakietu tidyr.

library(dplyr)
library(tidyr)

before <- data.frame(
  attr = c(1, 30 ,4 ,6 ), 
  type = c('foo_and_bar', 'foo_and_bar_2')
)

before %>%
  separate(type, c("foo", "bar"), "_and_")

##   attr foo   bar
## 1    1 foo   bar
## 2   30 foo bar_2
## 3    4 foo   bar
## 4    6 foo bar_2

123

Author: hadley,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-06-11 16:50:59

5 lat później dodanie obowiązkowego rozwiązania data.table

library(data.table) ## v 1.9.6+ 
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]
before
#    attr          type type1 type2
# 1:    1   foo_and_bar   foo   bar
# 2:   30 foo_and_bar_2   foo bar_2
# 3:    4   foo_and_bar   foo   bar
# 4:    6 foo_and_bar_2   foo bar_2

Możemy również upewnić się, że wynikowe kolumny będą miały poprawne typy i poprawią wydajność poprzez dodanie argumentów type.convert i fixed (ponieważ "_and_" tak naprawdę nie jest regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]

47

Author: David Arenburg,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-08-22 07:47:42

Yet another approach: use rbind on out:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  
out <- strsplit(as.character(before$type),'_and_') 
do.call(rbind, out)

     [,1]  [,2]   
[1,] "foo" "bar"  
[2,] "foo" "bar_2"
[3,] "foo" "bar"  
[4,] "foo" "bar_2"

I połączyć:

data.frame(before$attr, do.call(rbind, out))

42

Author: Aniko,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-12-04 00:51:30

Zauważ, że sapply z " ["Może być użyty do wyodrębnienia pierwszego lub drugiego elementu z tych list tak:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)
before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)
before$type <- NULL

A oto metoda gsub:

before$type_1 <- gsub("_and_.+$", "", before$type)
before$type_2 <- gsub("^.+_and_", "", before$type)
before$type <- NULL

31

Author: 42-,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-09-02 05:20:32

Oto jedna linijka wzdłuż tych samych linii co rozwiązanie aniko, ale przy użyciu pakietu stringr Hadleya:

do.call(rbind, str_split(before$type, '_and_'))

27

Author: Ramnath,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-12-04 02:09:23

Aby dodać do opcji, możesz również użyć mojej splitstackshape::cSplit funkcji w ten sposób:

library(splitstackshape)
cSplit(before, "type", "_and_")
#    attr type_1 type_2
# 1:    1    foo    bar
# 2:   30    foo  bar_2
# 3:    4    foo    bar
# 4:    6    foo  bar_2

18

Author: A5C1D2H2I1M1N2O1R2T1,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-07-12 12:34:58

Łatwym sposobem jest użycie sapply() i [ funkcji:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
out <- strsplit(as.character(before$type),'_and_')

Na przykład:

> data.frame(t(sapply(out, `[`)))
   X1    X2
1 foo   bar
2 foo bar_2
3 foo   bar
4 foo bar_2

sapply()'wynikiem S jest matryca i wymaga przetransponowania i oddania z powrotem do ramki danych. Jest to wtedy kilka prostych manipulacji, które dają pożądany wynik:

after <- with(before, data.frame(attr = attr))
after <- cbind(after, data.frame(t(sapply(out, `[`))))
names(after)[2:3] <- paste("type", 1:2, sep = "_")

W tym momencie, after jest to, czego chciałeś

> after
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

12

Author: Gavin Simpson,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-12-03 23:36:58

Tutaj jest baza r jeden liner, który pokrywa się z wieloma wcześniejszymi rozwiązaniami, ale zwraca dane.ramka z odpowiednimi nazwami.

out <- setNames(data.frame(before$attr,
                  do.call(rbind, strsplit(as.character(before$type),
                                          split="_and_"))),
                  c("attr", paste0("type_", 1:2)))
out
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

Używa strsplit do rozbicia zmiennej i data.frame z do.call/rbind aby umieścić dane z powrotem w danych.rama. Dodatkowym ulepszeniem przyrostowym jest użycie setNames do dodania nazw zmiennych do danych.rama.

7

Author: lmo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-07-22 20:34:38

Temat jest prawie wyczerpany, chciałbym jednak zaoferować rozwiązanie nieco bardziej ogólnej wersji, w której nie znasz liczby kolumn wyjściowych, a priori. Więc na przykład masz

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))
  attr                    type
1    1             foo_and_bar
2   30           foo_and_bar_2
3    4 foo_and_bar_2_and_bar_3
4    6             foo_and_bar

Nie możemy użyć dplyr separate(), ponieważ nie znamy liczby kolumn wynikowych przed podzieleniem, więc stworzyłem funkcję, która używa stringr do dzielenia kolumny, biorąc pod uwagę wzór i prefiks nazwy dla generowanych kolumn. Mam nadzieję, że zastosowane wzorce kodowania są zgadza się.

split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as.tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]

  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

Możemy następnie użyć split_into_multiple w rurze dplyr w następujący sposób:

after <- before %>% 
  bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% 
  # selecting those that start with 'type_' will remove the original 'type' column
  select(attr, starts_with("type_"))

>after
  attr type_1 type_2 type_3
1    1    foo    bar   <NA>
2   30    foo  bar_2   <NA>
3    4    foo  bar_2  bar_3
4    6    foo    bar   <NA>

A potem możemy użyć gather do posprzątania...

after %>% 
  gather(key, val, -attr, na.rm = T)

   attr    key   val
1     1 type_1   foo
2    30 type_1   foo
3     4 type_1   foo
4     6 type_1   foo
5     1 type_2   bar
6    30 type_2 bar_2
7     4 type_2 bar_2
8     6 type_2   bar
11    4 type_3 bar_3

5

Author: Yannis P.,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-11-01 17:26:04

Innym podejściem, jeśli chcesz trzymać się strsplit(), jest użycie polecenia unlist(). Oto rozwiązanie w tym zakresie.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,
   byrow=TRUE)
after <- cbind(before$attr, as.data.frame(tmp))
names(after) <- c("attr", "type_1", "type_2")

4

Author: ashaw,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-12-03 23:52:51

Od wersji R 3.4.0 można używać strcapture()z pakietu utils (dołączonego do instalacji base r), wiążąc wyjście z drugą kolumną(kolumnami).

out <- strcapture(
    "(.*)_and_(.*)",
    as.character(before$type),
    data.frame(type_1 = character(), type_2 = character())
)

cbind(before["attr"], out)
#   attr type_1 type_2
# 1    1    foo    bar
# 2   30    foo  bar_2
# 3    4    foo    bar
# 4    6    foo  bar_2

4

Author: Rich Scriven,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-08-28 19:21:12

To pytanie jest dość stare, ale dodam rozwiązanie, które uważam za najprostsze w chwili obecnej.

library(reshape2)
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
newColNames <- c("type1", "type2")
newCols <- colsplit(before$type, "_and_", newColNames)
after <- cbind(before, newCols)
after$type <- NULL
after

3

Author: Swifty McSwifterton,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-09-28 20:14:42

Bazowy, ale pewnie powolny:

n <- 1
for(i in strsplit(as.character(before$type),'_and_')){
     before[n, 'type_1'] <- i[[1]]
     before[n, 'type_2'] <- i[[2]]
     n <- n + 1
}

##   attr          type type_1 type_2
## 1    1   foo_and_bar    foo    bar
## 2   30 foo_and_bar_2    foo  bar_2
## 3    4   foo_and_bar    foo    bar
## 4    6 foo_and_bar_2    foo  bar_2

3

Author: Joe,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-02-17 03:44:05

tp <- c("a-c","d-e-f","g-h-i","m-n")

temp = strsplit(as.character(tp),'-')

x=c();
y=c();
z=c();

#tab=data.frame()
#tab= cbind(tab,c(x,y,z))

for(i in 1:length(temp) )
{
  l = length(temp[[i]]);

  if(l==2)
  {
     x=c(x,temp[[i]][1]);
     y=c(y,"NA")
     z=c(z,temp[[i]][2]);

    df= as.data.frame(cbind(x,y,z)) 

  }else
  {
    x=c(x,temp[[i]][1]);
    y=c(y,temp[[i]][2]);
    z=c(z,temp[[i]][3]);

    df= as.data.frame(cbind(x,y,z))
   }
}

-5

Author: Soumya Das,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-26 18:08:56

score 209 · Accepted Answer

Użyj stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

209

Author: hadley,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-12-04 04:21:27