logo
Tags down

shadow

number similar/duplicated rows in R


By : Ayesh Khalid
Date : October 16 2020, 08:10 PM
around this issue Hi I'm using R and I have a data like this: , This is your data:
code :
df <- data.frame(a=c(1,1,3,1,3), 
                 b=c(2,2,4,2,4), 
                 c=c(3,1,1,3,1), 
                 d=c(4,2,2,4,2), 
                 e=c(5,2,3,5,3))
library(data.table)
i <- interaction(data.table(df), drop=TRUE)
df.out <- cbind(df, id=factor(i,labels=length(unique(i)):1))
#  a b c d e  id
#1 1 2 3 4 5   1
#2 1 2 1 2 2   3
#3 3 4 1 2 3   2
#4 1 2 3 4 5   1
#5 3 4 1 2 3   2
library(plyr)
.id <- 0
df.out <- ddply(df, colnames(df), transform, id=(.id<<-.id+1))    
#  a b c d e  id
#1 1 2 1 2 2   1
#2 1 2 3 4 5   2
#3 1 2 3 4 5   2
#4 3 4 1 2 3   3
#5 3 4 1 2 3   3


Share : facebook icon twitter icon

Calculating number of duplicated rows


By : user2358591
Date : March 29 2020, 07:55 AM
I hope this helps you . Perhaps because your first query is only summing groups that have a group by count exactly equal to 1. Your second query will return all counts, whether the group by count is one or more.
So, it's possible that the combination of A, B and C occurs more than once. And if this is the case, your counts won't be the same.

Exclude duplicated values from different rows of similar id


By : AJ Love
Date : March 29 2020, 07:55 AM
this one helps. I have a dataset where different values are located in one column. I know it's not a good practice but that is beyond my control. An example dataset is as follows:
code :
DT[DT[,toString(unique(scan(text = v2,sep = ","))),by=v1],on="v1"]
Read 5 items
Read 9 items
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14
DT[DT[,toString(unique(scan(text = v2,sep = ",",quiet = T))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

DT[DT[,toString(unique(unlist(strsplit(v2,",")))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14
 DT[DT[,.(V5=paste(unique(unlist(strsplit(v2,","))),collapse=",")),by=v1],on="v1"]
   v1             v2             V5
1:  a 12,13,12,12,10       12,13,10
2:  b    10,10,11,12 10,11,12,13,14
3:  b 10,10,13,14,12 10,11,12,13,14

Find duplicated rows, multiply a certain column by number of duplicates, drop duplicated rows


By : Again
Date : March 29 2020, 07:55 AM
it helps some times I think this question is nothing more of figuring out how to get a count of the occurrences of each unique row. If a row occurs only once, this number is one. If it occurs more often, it will be > 1. This count you can then use to multiply, filter, etc.
This nice one-liner (taken from How to count duplicate rows in pandas dataframe?) creates an extra column with the number of occurrences of each row:

is there an R function to collapse duplicated rows while combining unique columns within these duplicated rows?


By : Zwierzolak
Date : March 29 2020, 07:55 AM
To fix the issue you can do I want to collapse duplicated rows, by unique record ID, in order to consolidate unique variables that exist on these duplicated rows. Certain variables are only listed on one version of the duplicate row, while other variables that are unique exist on a different row of the duplicated record. I'm working in R. I'd like to just have records exist on one row, without losing any of the unique columns. One "sum-total" row basically, that collects each of the columns that may have been filled on different rows, so that this final row is not a duplicate, and shows each variable that could have been filled all together... , With base R's aggregate:
code :
aggregate(df[2:ncol(df)], by = df["record"], sum, na.rm = T)

#### OUTPUT ####

  record Var1 var2 var3 var4 var5
1      2    1    1    1    1    1
2      3    2    2    2    2    2
3      4    1    1    0    0    0
4      5    0    2    1    1    1
library(dplyr)

df %>% group_by(record) %>% summarize_all(sum, na.rm = T)


#### OUTPUT ####
# A tibble: 4 x 6
  record  Var1  var2  var3  var4  var5
   <int> <int> <int> <int> <int> <int>
1      2     1     1     1     1     1
2      3     2     2     2     2     2
3      4     1     1     0     0     0
4      5     0     2     1     1     1

Duplicated rows: select rows based on criteria and store duplicated values


By : user3444025
Date : March 29 2020, 07:55 AM
should help you out Using data.table, a dcast based on rowid(ID, Year) after ordering by Val2 descending gets you there with the exception of column names. The "_1" columns are the "keep" columns, and the "_2" columns are the "del" columns.
code :
library(data.table)
setDT(df)

setorder(df, ID, Year, -Val2)

out <- 
  dcast(df, ID + Year ~ rowid(ID, Year), value.var = c('treatment', 'Val', 'Val2'))
out
#       ID Year treatment_1 treatment_2 Val_1 Val_2 Val2_1 Val2_2
# 1: Alpha 1970           B           A     0     0   2.34   0.00
# 2: Alpha 1980           C        <NA>     0    NA   1.30     NA
# 3: Alpha 1990           D        <NA>     1    NA   0.00     NA
# 4:  Beta 1970           E        <NA>     0    NA   0.00     NA
# 5:  Beta 1980           G           F     0     1   3.20   2.34
# 6:  Beta 1990           H        <NA>     1    NA   1.30     NA
setnames(out, function(x) gsub('(.*)_1', '\\1', x))
setnames(out, function(x) gsub('(.*_\\d+)', 'del_\\1', x))
out
#       ID Year treatment del_treatment_2 Val del_Val_2 Val2 del_Val2_2
# 1: Alpha 1970         B               A   0         0 2.34       0.00
# 2: Alpha 1980         C            <NA>   0        NA 1.30         NA
# 3: Alpha 1990         D            <NA>   1        NA 0.00         NA
# 4:  Beta 1970         E            <NA>   0        NA 0.00         NA
# 5:  Beta 1980         G               F   0         1 3.20       2.34
# 6:  Beta 1990         H            <NA>   1        NA 1.30         NA
Related Posts Related Posts :
  • Finding the first non-zero year in data frame for multiple variables using tidyverse
  • ggplot2 - how to assign geom_text with arrow icon to second yaxis scale
  • regex fails with dollar sign
  • Drop first element of list of lists, condense list of lists? Too many elements?
  • R - how to apply output of ifelse(str_detect ...) to whole group
  • caret package confusion matrix define positive case with multiple classes
  • Generating a pairwise 'distance' matrix
  • Change all R columns names using a reference file
  • In R & dabestr, how do I get grouped differences correctly?
  • Exclude or set a unique color to the bottom triangle of a correlation matrix heatmap
  • r shiny observe function clears text input
  • Split column by multiple delimiters, keeping delimiters
  • How to random search in a specified grid in caret package?
  • merge 2 data frames in a loop for each column in one of them
  • how to edit the codes for the summary of R S4 Object?
  • Remove specific rows in R
  • Flatten JSON list into data frame
  • Filtering a dataset and making a ggplot
  • Align cells vertically to be at the bottom flextable
  • R speed up sapply
  • invalid subscript type 'list' Azure Machine Learning
  • Use rollapply with xts object and an anonymous defined function
  • Isolate data frames from a spreadsheet to create a list
  • Error in xts, as.POSIXct "'order.by' cannot contain 'NA', 'NaN', or 'Inf'"
  • Column splitting in R
  • Count the number of times each value appears in a row dataframe r
  • how to vectorise my code in r using for loop?
  • A function to fill in a column with NA of the same type
  • Network flow balancing constraint in R
  • Adding main titles from list to graphs in for loop
  • create a matrix in Perl or R if data is provided in CSV file
  • Passing column names as string to with
  • R - filtering rows and summing
  • How to change the order of fill aesthetic in faceted ggplot?
  • Function to remove outliers by group from dataframe
  • How to find the difference of max & min values in one group in a variable in a dataframe
  • Convert unicode to a readable string
  • Wrong scale/difficult to interpret times on time series object using 'ts'
  • Joining three numeric columns without adding them in r
  • Is there any way to extract the names of columns from an excel sheet without actually loading the sheet into the RAM?
  • case_when() not working: Error in mutate_impl(.data, dots)
  • Hide boxes if input not suitable in Shiny
  • Make nodes as images in R with visNetwork lib
  • Re-shape status columns based on value in another column
  • Why does the plot size differ between docx and html in rmarkdown::render?
  • Reverse x-axis that contains categorical data and a lot of annotations
  • shadow
    Privacy Policy - Terms - Contact Us © voile276.org