如何在r中的两个多类别列之间找到差异/setDiff()

发布于 2025-02-01 18:19:22 字数 7678 浏览 2 评论 0原文

我将数据放在两个数据表中,如下(比此处显示的列更多) -

dataTable 1 = data_sale

site idid countryid
1000375476in id canadaug10000000000000000 wisd
country加拿大UGD12895
1000706152瑞士UG10000-WISD
UG80000-1000797366
ITALYUG10000WISD
ITALY>UG12785

DATATABL

-NTCD6
1000797366加拿大UGD12895
10007973661000706152-
1000797366意大利瑞士
G意大利UG12210

我想计算Data_sale中所有站点ID的唯一产品ID的集合差,请保留所有行。

这是我到目前为止所做的 -

  1. 对于两个数据表,我都创建了一个带有所有独特产品的新列。
data_sale <-
  data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","), 
              keyby = c("Site Id")]
data_licenses <-
  data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")), 
              keyby = c("Site Id")]
  1. 现在与数据_LICENS合并的数据_SALE
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)

现在,合并的数据表现 -

网站ID国家产品IDsimel_products.data_salesimel_products.data_licenses
1000375476加拿大UG100000000000000000000-WISD UG1000000000000000000000000-WISD,UGD12895
-WISD128955128954UG10000wist 12895UG10000-WISD ,UGD12895
1000706152瑞士UG10000-WISD-WISD,UG80000-NTCD-GUG80000
-NTCD-GUG100000000000000000000000000-210
1000797366意大利UG12210WISDNA
1000706152瑞士UG10000UG10000-WISD,UG12210UG12785,UG12210

问题在于我的最后一步,我想要一个新的列显示data_sale和data_licenss产品之间的区别,看起来应该像这样 -

站点ID乡村产品IDsimels data_sale_salesimelor_products.data_licenss差异
1000375476加拿大ug10000 wisd ug1000000000000 wisd,ug1000000 wisd,ug1000000 wisd, UGD12895,UGD12895NA
1000375476加拿大UGD12895WISD,UGD12895UGD12895NA
100070706152 1000706152瑞士UGD10000 -WISDUG10000- WISDUG10000
-UG80000-NTCD -GUG10000-WISD,UG80000-NTCD-GNAUG10000-WISD,UG80000-NTCD-G
1000797366ITALY-WISDUG10000-WISD,UG122210UG10000-
UG12785WIG12211211101211121101210 210UG10000UG12210UG10000- WISD

关于如何实现它的任何潜在客户都会有很大的帮助。谢谢!

以下是使用DPUT()的数据

structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152", 
"1000706152", "1000797366", "1000797366"), Country = c("Canada", 
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD", 
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", 
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895", 
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G", 
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895", 
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)

I have my data in two data tables as below (with many more columns than just shown here) -

DataTable 1 = data_sale

Site IdCountryProduct ID
1000375476CanadaUG10000-WISD
1000375476CanadaUGD12895
1000706152SwitzerlandUG10000-WISD
1000706152SwitzerlandUG80000-NTCD-G
1000797366ItalyUG10000-WISD
1000797366ItalyUG12210

DataTable 2 = data_licenses

Site IdCountryProduct ID
1000375476CanadaUG10000-WISD
1000375476CanadaUGD12895
1000797366ItalyUG12785
1000797366ItalyUG12210

I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.

Here is what I've done so far -

  1. For both of the data tables, I've created a new column with all unique products in it.
data_sale <-
  data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","), 
              keyby = c("Site Id")]
data_licenses <-
  data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")), 
              keyby = c("Site Id")]
  1. Left Merged data_sale with data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)

Now the merged datatable look like this -

Site IdCountryProduct IDunique_products.data_saleunique_products.data_licenses
1000375476CanadaUG10000-WISDUG10000-WISD,UGD12895UG10000-WISD,UGD12895
1000375476CanadaUGD12895UG10000-WISD,UGD12895UG10000-WISD,UGD12895
1000706152SwitzerlandUG10000-WISDUG10000-WISD,UG80000-NTCD-GNA
1000706152SwitzerlandUG80000-NTCD-GUG10000-WISD,UG80000-NTCD-GNA
1000797366ItalyUG10000-WISDUG10000-WISD,UG12210UG12785,UG12210
1000797366ItalyUG12210UG10000-WISD,UG12210UG12785,UG12210

The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -

Site IdCountryProduct IDunique_products.data_saleunique_products.data_licensesdifference
1000375476CanadaUG10000-WISDUG10000-WISD,UGD12895UG10000-WISD,UGD12895NA
1000375476CanadaUGD12895UG10000-WISD,UGD12895UG10000-WISD,UGD12895NA
1000706152SwitzerlandUG10000-WISDUG10000-WISD,UG80000-NTCD-GNAUG10000-WISD,UG80000-NTCD-G
1000706152SwitzerlandUG80000-NTCD-GUG10000-WISD,UG80000-NTCD-GNAUG10000-WISD,UG80000-NTCD-G
1000797366ItalyUG10000-WISDUG10000-WISD,UG12210UG12785,UG12210UG10000-WISD
1000797366ItalyUG12210UG10000-WISD,UG12210UG12785,UG12210UG10000-WISD

Any leads on how it can be achieved will be of great help. Thanks!

Below is the data using dput() for the merged datatable

structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152", 
"1000706152", "1000797366", "1000797366"), Country = c("Canada", 
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD", 
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", 
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895", 
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G", 
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895", 
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

触ぅ动初心 2025-02-08 18:19:22

这将使不在data_sale中的产品中获取data_license by 网站ID中的产品。与其串联唯一的产品ID,不如将唯一列作为字符向量工作更容易。

library(data.table)

data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
                            Country = c("Canada", "Canada", "Italy", "Italy"),
                            `Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
                        Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
                        `Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))

data_unique <- data_sale[
  , .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
][
  data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
  unique_products.data_licenses := i.unique_products,
  on = "Site Id"
][
  , difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
]
print(data_unique)
#>       Site Id     Country   unique_products.data_sale unique_products.data_licenses                  difference
#> 1: 1000375476      Canada       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                            
#> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G                               UG10000-WISD,UG80000-NTCD-G
#> 3: 1000797366       Italy        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD

This will get the products in data_sale that are not in data_license by Site Id. Instead of concatenating the unique product IDs, it's easier to work with the unique columns as character vectors.

library(data.table)

data_licenses <- data.table(`Site Id` = c("1000375476", "1000375476", "1000797366", "1000797366"),
                            Country = c("Canada", "Canada", "Italy", "Italy"),
                            `Product ID` = c("UG10000-WISD", "UGD12895", "UG12785", "UG12210"))
data_sale <- data.table(`Site Id` = c("1000375476", "1000375476", "1000706152", "1000706152", "1000797366", "1000797366"),
                        Country = c("Canada", "Canada", "Switzerland", "Switzerland", "Italy", "Italy"),
                        `Product ID` = c("UG10000-WISD", "UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD", "UG12210"))

data_unique <- data_sale[
  , .(unique_products.data_sale = .(unique(`Product ID`))), c("Site Id", "Country")
][
  data_licenses[, .(unique_products = .(unique(`Product ID`))), "Site Id"],
  unique_products.data_licenses := i.unique_products,
  on = "Site Id"
][
  , difference := lapply(.I, function(i) setdiff(unique_products.data_sale[[i]], unique_products.data_licenses[[i]]))
]
print(data_unique)
#>       Site Id     Country   unique_products.data_sale unique_products.data_licenses                  difference
#> 1: 1000375476      Canada       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                            
#> 2: 1000706152 Switzerland UG10000-WISD,UG80000-NTCD-G                               UG10000-WISD,UG80000-NTCD-G
#> 3: 1000797366       Italy        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
转角预定愛 2025-02-08 18:19:22

可能有一种方法可以尝试结合一些内置功能,但是一个具有简单自定义功能的示例:

find_differences = function(x,y){
  # x: column list of strings we want to compare to
  # y: other column list
  x = strsplit(x,',') # transform strings to lists
  y = strsplit(y,',')
  
  differences = list()

  for(i in seq(1,length(x))){  # for every row (nested-list)
    
    if(identical(x[[i]],y[[i]])){
      row_diff = NA
    }
    else{
    row_diff = paste(x[[i]][ ! x[[i]] %in% y[[i]] ],collapse=',')
    }
    
    differences = c(differences,row_diff)
  }
  return(differences)
}

使用您的示例:

example = rename(example, 
                 unique_products.data_sale = unique_products.x,
                 unique_products.data_licenses = unique_products.y)

example$difference = find_differences(example$unique_products.data_sale, example$unique_products.data_license)

> example
      Site Id     Country     Product ID   unique_products.data_sale unique_products.data_licenses                  difference
1: 1000375476      Canada   UG10000-WISD       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
2: 1000375476      Canada       UGD12895       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
3: 1000706152 Switzerland   UG10000-WISD UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
4: 1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
5: 1000797366       Italy   UG10000-WISD        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
6: 1000797366       Italy        UG12210        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD

Probably there is a way trying to combine some built-in functions, but an example with a simple custom function:

find_differences = function(x,y){
  # x: column list of strings we want to compare to
  # y: other column list
  x = strsplit(x,',') # transform strings to lists
  y = strsplit(y,',')
  
  differences = list()

  for(i in seq(1,length(x))){  # for every row (nested-list)
    
    if(identical(x[[i]],y[[i]])){
      row_diff = NA
    }
    else{
    row_diff = paste(x[[i]][ ! x[[i]] %in% y[[i]] ],collapse=',')
    }
    
    differences = c(differences,row_diff)
  }
  return(differences)
}

With your example:

example = rename(example, 
                 unique_products.data_sale = unique_products.x,
                 unique_products.data_licenses = unique_products.y)

example$difference = find_differences(example$unique_products.data_sale, example$unique_products.data_license)

> example
      Site Id     Country     Product ID   unique_products.data_sale unique_products.data_licenses                  difference
1: 1000375476      Canada   UG10000-WISD       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
2: 1000375476      Canada       UGD12895       UG10000-WISD,UGD12895         UG10000-WISD,UGD12895                          NA
3: 1000706152 Switzerland   UG10000-WISD UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
4: 1000706152 Switzerland UG80000-NTCD-G UG10000-WISD,UG80000-NTCD-G                          <NA> UG10000-WISD,UG80000-NTCD-G
5: 1000797366       Italy   UG10000-WISD        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
6: 1000797366       Italy        UG12210        UG10000-WISD,UG12210               UG12785,UG12210                UG10000-WISD
佞臣 2025-02-08 18:19:22

这种快速方法如何获得每个站点的差异?然后,您可以将结果合并回任何具有站点ID的帧:

data_licenses[, .(licen_p = .(.(`Product ID`))), by = `Site Id`] %>% 
  .[data_sale[, .(sale_p= .(.(`Product ID`))), by=`Site Id`],on=.(`Site Id`)] %>% 
  .[,.(difference = toString(unlist(setdiff(sale_p, licen_p)))), by=`Site Id`]

输出:

      Site Id                   difference
1: 1000375476                             
2: 1000706152 UG10000-WISD, UG80000-NTCD-G
3: 1000797366        UG10000-WISD, UG12210

How about this quick approach to get the difference for each site; you can then merge the result back to any frame that has Site Id:

data_licenses[, .(licen_p = .(.(`Product ID`))), by = `Site Id`] %>% 
  .[data_sale[, .(sale_p= .(.(`Product ID`))), by=`Site Id`],on=.(`Site Id`)] %>% 
  .[,.(difference = toString(unlist(setdiff(sale_p, licen_p)))), by=`Site Id`]

Output:

      Site Id                   difference
1: 1000375476                             
2: 1000706152 UG10000-WISD, UG80000-NTCD-G
3: 1000797366        UG10000-WISD, UG12210
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文