r levenshtein-distance cosine-similarity stringdist

R

发布于 2025-01-26 19:35:13 字数 757 浏览 2 评论 0原文

我有两个DataFrames DF1和DF2。其中之一是非常大的DF。

我已经创建了示例DF1和2这样的示例：

library(tidyverse)

A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000)
DF1<-data.frame(A)

B<-rep(c('Rockets', 'Pacers', 'Warriors', 'Suns', 'Celtics'), 1000)
DF2<-data.frame(B)

我想计算DF1中每个单词与DF2单词中每个单词的余弦相似性和Levenshtein距离，然后将其存储在数据框架中。为了以“整洁的方式”这样做，我使用了“ fuzzyjoin”软件包。我正在尝试这样的事情：

library(fuzzyjoin)
DF1 <- DF1 %>% stringdist_full_join (DF2, by = c('A' = 'B'), 
                              method = "cosine", 
                              distance_col = "distance Cos")

这与小型数据集可以正常工作。但是问题是DF1和DF2的大量数据。 R给了我错误的消息：不能分配尺寸N GB的向量。

是否有一种简单的方法来解决此问题？也可以计算Levenshtein距离吗？

感谢您的帮助！谢谢！

原文

I've two dataframes DF1 and DF2. One of them is a very large DF.

I've created examples DF1 and 2 like this:

library(tidyverse)

A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000)
DF1<-data.frame(A)

B<-rep(c('Rockets', 'Pacers', 'Warriors', 'Suns', 'Celtics'), 1000)
DF2<-data.frame(B)

I want to compute cosine similarity and levenshtein distance of each word in DF1 to each word of DF2 and store it in a DataFrame. To do that in a "tidy way", I used the package "fuzzyjoin". I'm trying something like this:

library(fuzzyjoin)
DF1 <- DF1 %>% stringdist_full_join (DF2, by = c('A' = 'B'), 
                              method = "cosine", 
                              distance_col = "distance Cos")

This works fine with small datasets. But the problem is the large amount of data of DF1 and DF2. R gives me the message of Error: cannot allocate vector of size N Gb.

Is there a simple way to solve this problem? Is possible to calculate the levenshtein distance too?

I will appreciate help! Thanks!

分享到QQ

分享到微博