如何获得与R分组的入射矩阵

发布于 2025-02-04 08:32:52 字数 1641 浏览 3 评论 0原文

我必须通过共同发表的论文和文章的数量来研究科学研究所之间的合作网络。每篇文章都有一个独特的代码来标识特定的文章。如果两个(或更多)机构在其数据库中具有相同的 ArticleCode ,则意味着在该文章的出版中进行协作。

这就是数据集的组织方式:

Intitute name | ArticleName | ArticleCode | Area | Pages | NumberofCitations | ...

我总共有90个机构,因此 90 CSV文件在这里。我必须达到的结果是我拥有此信息的一张表:

Institute#1 | Institute#2 | TotArticles | TotArea#1 | TotArea#2 | TotArea#3 |...

因此,我需要 已经合作的几个研究所,他们的姓名,释放的文章总数(鸟类)以及它们在其他列中的区域(总计5:艺术&人文科学;生命科学&生物医学;物理科学;社会科学;技术)。

一开始,我认为可以通过ArricleCode合并CSV来完成,但是我很快意识到,如果我想分析所有机构之间的所有可能组合时间。

如果我合并一个唯一的CSV,每个研究所的总计都会合并,然后在上面进行操作以获取最终表格,也许可以更快地完成。添加90 CSV的所有行,我将获得1.300.000行之类的东西,不知道是否可以在技术上进行操作。

在一个示例下。

希望我能清楚地将问题暴露出来,否则请通过评论让我知道。


的事情开始...

Institute | ArticleCode | Area             | Pages | ...  
In.AAA    | articleX    | Arts & Humanities| 90    | ...
In.AAA    | articleP    | Technology       | 10    | ...
In.BBB    | articleZ    | ...              | 907   | ...
In.BBB    | articleX    | Arts & Humanities| 90    | ...
In.CCC    | articleF    | Arts & Humanities| 89    | ...
In.DDD    | articleP    | Technology       | 10    | ...
In.DDD    | articleX    | Arts & Humanities| 90    | ...

从这样

Institute#1 | Institute#2 | TotArticles |Arts & Humanities | TotTechnology
In. AAA     | In.BBB      | 1           | 1                | 0         
In. AAA     | In.DDD      | 2           | 1                | 1           
In. BBB     | In.DDD      | 1           | 1                | 0  

I have to study collaboration network between Scientific Institutes through the number of papers and articles published together. Every article has a unique code that identifies the specific article. If two (or more) institutes have the same ArticleCode in their database that implies a collaboration amid them in the publication of that article.

Here's how the dataset is organized:

Intitute name | ArticleName | ArticleCode | Area | Pages | NumberofCitations | ...

I have around 90 institutes in total, so 90 csv files like this up here. The result that I have to reach is a single table in which I have this information:

Institute#1 | Institute#2 | TotArticles | TotArea#1 | TotArea#2 | TotArea#3 |...

So I need for each couple of Institute that has collaborated, their name, the total number of articles together released (TotArticles) and their subdivision in Areas in the other columns (5 in total: Arts & Humanities;Life Sciences & Biomedicine;Physical Sciences;Social Sciences;Technology).

At the beginning I thought it could be done by merging the csv by ArticleCode but I soon realized that if I want to analyze all the possible combinations between all the institutes I'll have to repeat the merge steps 4000 times... a huge waste of time.

Maybe it could be done faster if I merge in one unique csv the total of publications of every single institute and then operate on it in order to obtain the final table. Adding all the rows of the 90 csv I'll obtain something like 1.300.000 rows, don't know if it could be technically possible to operate on it.

Below an example.

Hope I exposed the problem clearly enough, otherwise just let me know with a comment.


Starting from something like this...

Institute | ArticleCode | Area             | Pages | ...  
In.AAA    | articleX    | Arts & Humanities| 90    | ...
In.AAA    | articleP    | Technology       | 10    | ...
In.BBB    | articleZ    | ...              | 907   | ...
In.BBB    | articleX    | Arts & Humanities| 90    | ...
In.CCC    | articleF    | Arts & Humanities| 89    | ...
In.DDD    | articleP    | Technology       | 10    | ...
In.DDD    | articleX    | Arts & Humanities| 90    | ...

to this:

Institute#1 | Institute#2 | TotArticles |Arts & Humanities | TotTechnology
In. AAA     | In.BBB      | 1           | 1                | 0         
In. AAA     | In.DDD      | 2           | 1                | 1           
In. BBB     | In.DDD      | 1           | 1                | 0  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

百思不得你姐 2025-02-11 08:32:52

您可以将Institute列转换为数字因子,然后使用非Equi加入,使用data.table

library(data.table)

setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
  df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
  Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1,2)]

输出:

   Institute Institute2 Arts & Humanities Technology TotArticles
1:    In.AAA     In.BBB                 1          0           1
2:    In.AAA     In.DDD                 1          1           2
3:    In.BBB     In.DDD                 1          0           1

update :

structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB", 
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP", 
"articleZ", "articleX", "articleF", "articleP", "articleX"), 
    Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities", 
    "Arts & Humanities", "Technology", "Arts & Humanities"), 
    Pages = c("90", "10", "907", "90", "89", "10", "90")), class = "data.frame", row.names = c(NA, 
-7L))

update(6/6/22)

现在,现在想要添加一个额外的列,例如,它指示了文章的年份,并另外按该值将行分开。只需要两个小更改:

  1. 在dcast中的公式的左侧添加Year
  2. 扩展了从c(1,2) <代码> c(1:3)(即从前两列到前三列
setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
  df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
  Year + Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1:3)]

输出:

Key: <Year, Institute, Institute2>
    Year Institute Institute2 Arts & Humanities Technology TotArticles
   <num>    <fctr>     <fctr>             <int>      <int>       <num>
1:  2005    In.AAA     In.BBB                 1          0           1
2:  2005    In.AAA     In.DDD                 1          0           1
3:  2005    In.BBB     In.DDD                 1          0           1
4:  2006    In.AAA     In.DDD                 0          1           1

新输入:

structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB", 
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP", 
"articleZ", "articleX", "articleF", "articleP", "articleX"), 
    Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities", 
    "Arts & Humanities", "Technology", "Arts & Humanities"), 
    Pages = c("90", "10", "907", "90", "89", "10", "90"), Year = c(2005, 
    2006, 2005, 2005, 2005, 2006, 2005)), row.names = c(NA, -7L
), class = "data.frame")

You can convert the Institute column to a numeric factor, and then join the table on itself with a non-equi join, using data.table

library(data.table)

setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
  df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
  Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1,2)]

Output:

   Institute Institute2 Arts & Humanities Technology TotArticles
1:    In.AAA     In.BBB                 1          0           1
2:    In.AAA     In.DDD                 1          1           2
3:    In.BBB     In.DDD                 1          0           1

Input:

structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB", 
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP", 
"articleZ", "articleX", "articleF", "articleP", "articleX"), 
    Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities", 
    "Arts & Humanities", "Technology", "Arts & Humanities"), 
    Pages = c("90", "10", "907", "90", "89", "10", "90")), class = "data.frame", row.names = c(NA, 
-7L))

Update (6/6/22)

OP now wants to add an additional column, say Year, that indicates the year of the article, and additionally separate rows by that value. Only two small changes are needed:

  1. Add Year to the Left-Hand-Side of the formula in dcast
  2. Expand the columns that are excluded from the rowSums from c(1,2) to c(1:3) (i.e. from the first two columns to the first three columns
setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
  df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
  Year + Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1:3)]

Output:

Key: <Year, Institute, Institute2>
    Year Institute Institute2 Arts & Humanities Technology TotArticles
   <num>    <fctr>     <fctr>             <int>      <int>       <num>
1:  2005    In.AAA     In.BBB                 1          0           1
2:  2005    In.AAA     In.DDD                 1          0           1
3:  2005    In.BBB     In.DDD                 1          0           1
4:  2006    In.AAA     In.DDD                 0          1           1

New Input:

structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB", 
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP", 
"articleZ", "articleX", "articleF", "articleP", "articleX"), 
    Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities", 
    "Arts & Humanities", "Technology", "Arts & Humanities"), 
    Pages = c("90", "10", "907", "90", "89", "10", "90"), Year = c(2005, 
    2006, 2005, 2005, 2005, 2006, 2005)), row.names = c(NA, -7L
), class = "data.frame")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文