Eu tenho um data_frame contendo 10 colunas e 2000 linhas. Meus dados de amostra seriam semelhantes a:
rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
Eu estou tentando encontrar a combinação mais alta (A.Ag, B.Bg C.Ag) para rs_id = 0, 1 e 2. Como posso conseguir isso? A saída seria
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/24:02 3 06:02/06:02 3 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
Respostas:
3 para resposta № 1Essa abordagem remodela os dados do formato amplo para o longo (fusão dois medir colunas simultaneamente), escolhe a linha com o topo Ag
valor para cada combinação única de rs_id
, Code
e variable
. Por fim, o resultado é remodelado novamente do formato longo para o formato grande com a ordem da coluna reorganizada para retornar o resultado esperado:
library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
, setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]
rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D 1: rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1 2: rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1 3: rs_1 2 1:01/24:02 3 13:02/8:01 1 06:02/06:02 3 1:01/4:04 3 4: rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1 5: rs_2 1 11:01/2:01 4 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2 6: rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
Editar
O OP pediu uma explicação sobre o último dos acorrentados data.table
expressões setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))
.
Esta expressão ordena as colunas do resultado por referência, ou seja, sem copiar. Ao remodelar vários value.var
s as colunas são agrupadas por value.var
:
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)]
rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D 1: rs_1 0 1:01/11:01 13:02/49:01 03:04/03:04 1:01/1:01 2 2 6 1 2: rs_1 1 1:01/2:01 13:02/57:01 03:04/7:01 1:01/3:01 6 1 2 1 3: rs_1 2 1:01/24:02 13:02/8:01 06:02/06:02 1:01/4:04 3 1 3 3 4: rs_2 0 1:01/3:01 14:01/7:02 06:02/2:02: 1:01/4:07 1 1 1 1 5: rs_2 1 11:01/2:01 15:01/15:01 06:02/3:03 1:01/7:01 4 1 1 2 6: rs_2 2 11:01/25:01 15:01/44:02 06:02/4:01 10:01/3:01 1 2 1 5
enquanto o OP espera que a saída seja agrupada por variable
. Portanto, a ordem da coluna desejada é
c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10)
.
1
e 2
denotar o id.var
colunas. as.vector(outer(c(0, 4), 3:6, "+")))
é apenas uma maneira de economizar digitando 3, 7, 4, 8, 5, 9, 6, 10
.
outer(c(0, 4), 3:6, "+")
[,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 7 8 9 10
as.vector(outer(c(0, 4), 3:6, "+"))
[1] 3 7 4 8 5 9 6 10
Editar 2
O código pode ser mais simplificado. A chamada para as.vector()
não é necessário dentro c()
Como c()
transforma matrizes em vetores. Então, ao invés de
c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))
nós podemos escrever
c(1:2, outer(c(0, 4), 3:6, "+"))
Dados
Observe que eu completei os cabeçalhos de coluna ausentes para as duas últimas colunas.
library(data.table)
DF <- fread(
"rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag Combination_Dg D.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5"
)