如何在r中的两个多类别列之间找到差异/setDiff()
我将数据放在两个数据表中,如下(比此处显示的列更多) -
dataTable 1 = data_sale
site id | id country | id |
---|---|---|
1000375476 | in id canada | ug10000000000000000 wisd |
country | 加拿大 | UGD12895 |
1000706152 | 瑞士 | UG10000-WISD |
UG80000 | - | 1000797366 |
ITALY | UG10000 | WISD |
ITALY | > | UG12785 |
DATATABL
- | NTCD | 6 |
---|---|---|
1000797366 | 加拿大 | UGD12895 |
1000797366 | 1000706152 | - |
1000797366 | 意大利 | 瑞士 |
G | 意大利 | UG12210 |
我想计算Data_sale中所有站点ID的唯一产品ID的集合差,请保留所有行。
这是我到目前为止所做的 -
- 对于两个数据表,我都创建了一个带有所有独特产品的新列。
data_sale <-
data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","),
keyby = c("Site Id")]
data_licenses <-
data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")),
keyby = c("Site Id")]
- 现在与数据_LICENS合并的数据_SALE
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
现在,合并的数据表现 -
网站ID | 国家 | 产品ID | simel_products.data_sale | simel_products.data_licenses |
---|---|---|---|---|
1000375476 | 加拿大 | UG100000000000000000000 | -WISD UG1000000000000000000000000 | -WISD,UGD12895 |
- | WISD128955128954 | UG10000 | wist 12895 | UG10000-WISD ,UGD12895 |
1000706152 | 瑞士 | UG10000-WISD | -WISD,UG80000-NTCD-G | UG80000 |
- | NTCD | -G | UG100000000000000000000000000- | 210 |
1000797366 | 意大利 | UG12210 | WISD | NA |
1000706152 | 瑞士 | UG10000 | UG10000-WISD,UG12210 | UG12785,UG12210 |
问题在于我的最后一步,我想要一个新的列显示data_sale和data_licenss产品之间的区别,看起来应该像这样 -
站点ID | 乡村 | 产品ID | simels data_sale_sale | simelor_products.data_licenss | 差异 |
---|---|---|---|---|---|
1000375476 | 加拿大 | ug10000 wisd ug1000000000000 wisd, | ug1000000 wisd,ug1000000 wisd, UGD12895 | ,UGD12895 | NA |
1000375476 | 加拿大 | UGD12895 | WISD,UGD12895 | UGD12895 | NA |
100070706152 1000706152 | 瑞士 | UGD10000 - | WISD | UG10000- WISD | UG10000 |
- | , | UG80000-NTCD -G | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
1000797366 | ITALY | -WISD | UG10000-WISD,UG122210 | UG10000 | - |
UG12785 | WIG12211211101211121101210 210 | UG10000 | , | UG12210 | UG10000- WISD |
关于如何实现它的任何潜在客户都会有很大的帮助。谢谢!
以下是使用DPUT()的数据
structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152",
"1000706152", "1000797366", "1000797366"), Country = c("Canada",
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD",
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD",
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895",
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G",
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895",
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
I have my data in two data tables as below (with many more columns than just shown here) -
DataTable 1 = data_sale
Site Id | Country | Product ID |
---|---|---|
1000375476 | Canada | UG10000-WISD |
1000375476 | Canada | UGD12895 |
1000706152 | Switzerland | UG10000-WISD |
1000706152 | Switzerland | UG80000-NTCD-G |
1000797366 | Italy | UG10000-WISD |
1000797366 | Italy | UG12210 |
DataTable 2 = data_licenses
Site Id | Country | Product ID |
---|---|---|
1000375476 | Canada | UG10000-WISD |
1000375476 | Canada | UGD12895 |
1000797366 | Italy | UG12785 |
1000797366 | Italy | UG12210 |
I want to calculate the set difference for unique Product ID for all the Site Id in data_sale, keeping all rows.
Here is what I've done so far -
- For both of the data tables, I've created a new column with all unique products in it.
data_sale <-
data_sale[, `unique_products` := paste0(unique(`Product ID`), collapse = ","),
keyby = c("Site Id")]
data_licenses <-
data_licenses[, .(`unique_products` = paste0(unique(`Product ID`), collapse = ",")),
keyby = c("Site Id")]
- Left Merged data_sale with data_licenses
merge(data_sale, data_licenses, by = 'Site Id', all.x = TRUE)
Now the merged datatable look like this -
Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses |
---|---|---|---|---|
1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 |
1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA |
1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA |
1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 |
1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 |
The problem is with my final step where I want a new column showing difference between the products of data_sale and data_licenses, it should look like this -
Site Id | Country | Product ID | unique_products.data_sale | unique_products.data_licenses | difference |
---|---|---|---|---|---|
1000375476 | Canada | UG10000-WISD | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
1000375476 | Canada | UGD12895 | UG10000-WISD,UGD12895 | UG10000-WISD,UGD12895 | NA |
1000706152 | Switzerland | UG10000-WISD | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
1000706152 | Switzerland | UG80000-NTCD-G | UG10000-WISD,UG80000-NTCD-G | NA | UG10000-WISD,UG80000-NTCD-G |
1000797366 | Italy | UG10000-WISD | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
1000797366 | Italy | UG12210 | UG10000-WISD,UG12210 | UG12785,UG12210 | UG10000-WISD |
Any leads on how it can be achieved will be of great help. Thanks!
Below is the data using dput() for the merged datatable
structure(list(`Site Id` = c("1000375476", "1000375476", "1000706152",
"1000706152", "1000797366", "1000797366"), Country = c("Canada",
"Canada", "Switzerland", "Switzerland", "Italy", "Italy"), `Product ID` = c("UG10000-WISD",
"UGD12895", "UG10000-WISD", "UG80000-NTCD-G", "UG10000-WISD",
"UG12210"), unique_products.x = c("UG10000-WISD,UGD12895", "UG10000-WISD,UGD12895",
"UG10000-WISD,UG80000-NTCD-G", "UG10000-WISD,UG80000-NTCD-G",
"UG10000-WISD,UG12210", "UG10000-WISD,UG12210"), unique_products.y = c("UG10000-WISD,UGD12895",
"UG10000-WISD,UGD12895", NA, NA, "UG12785,UG12210", "UG12785,UG12210"
)), sorted = "Site Id", class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x556bb5c10a40>)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这将使不在
data_sale
中的产品中获取data_license
by网站ID
中的产品。与其串联唯一的产品ID,不如将唯一列作为字符向量工作更容易。This will get the products in
data_sale
that are not indata_license
bySite Id
. Instead of concatenating the unique product IDs, it's easier to work with the unique columns as character vectors.可能有一种方法可以尝试结合一些内置功能,但是一个具有简单自定义功能的示例:
使用您的示例:
Probably there is a way trying to combine some built-in functions, but an example with a simple custom function:
With your example:
这种快速方法如何获得每个站点的差异?然后,您可以将结果合并回任何具有
站点ID
的帧:输出:
How about this quick approach to get the difference for each site; you can then merge the result back to any frame that has
Site Id
:Output: