如何使用 R 项目创建电影评分的向量矩阵?

发布于 2024-12-27 19:54:04 字数 526 浏览 4 评论 0原文

假设我正在使用这个电影评级数据集: http://www.grouplens.org/node/73< /a>

它包含格式为的文件中的评级 userID::movieID:: rating::timestamp

鉴于此,我想在 R 项目中构建一个特征矩阵,其中每一行对应于一个用户,每一列表示用户对电影的评分(如果有)。

例如,如果数据文件包含

1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14

那么输出矩阵将如下所示:

UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3

那么 R 项目中是否有一些内置方法可以实现此目的。我写了一个简单的 python 脚本来做同样的事情,但我敢打赌有更有效的方法来完成这个任务。

Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73

It contains ratings in a file formatted as
userID::movieID::rating::timestamp

Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).

Example, if the data file contains

1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14

Then the output matrix would look like:

UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3

So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

︶葆Ⅱㄣ 2025-01-03 19:54:04

您可以使用 reshape2 包中的 dcast 函数,但生成的 data.frame 可能会很大(而且稀疏)。

d <- read.delim(
  "u1.base", 
  col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )

如果您的字段以双冒号分隔,则无法使用 read.delimsep 参数,该参数只能是一个字符。
如果您已经在 R 之外进行了一些预处理,那么在那里执行会更容易(例如,在 Perl 中,它只是 s/::/\t/g),但您也可以这样做在 R 中:将文件作为单列读取,拆分字符串,然后连接结果。

d <- read.delim("a")
d <- as.character( d[,1] )   # vector of strings
d <- strsplit( d, "::" )     # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d )     # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )

You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).

d <- read.delim(
  "u1.base", 
  col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )

If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character.
If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.

d <- read.delim("a")
d <- as.character( d[,1] )   # vector of strings
d <- strsplit( d, "::" )     # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d )     # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )
拥有 2025-01-03 19:54:04

从上一个问题中指出的网站来看,您似乎想用您

> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb

另一个问题。另外,总长度小于 R 中的最大向量长度,所以这也应该没问题。但请参阅回复末尾的重要警告!

我在 R 之外创建了数据文件的制表符分隔版本。然后我读了我感兴趣的信息,

what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)

“NULL”删除了未使用的时间戳数据。 “User”和“Film”条目不是连续的,并且我的平台上的 numeric() 占用的内存是 integer() 的两倍,因此我将 User 和电影因子,评级加倍为整数()(原始分数为 1 到 5,增量为 1/2)。

x <- list(User=factor(x$User), Film=factor(x$Film),
          Rating=as.integer(2 * x$Rating))

然后,我分配了矩阵

ratings <- matrix(NA_integer_ ,
                 nrow=length(levels(x$User)),
                 ncol=length(levels(x$Film)),
                 dimnames=list(levels(x$User), levels(x$Film)))

,并利用两列矩阵可用于索引另一个矩阵的事实,

ratings[cbind(x$User, x$Film)] <- x$Rating

这是内存使用量最大的步骤。然后,我会删除不需要的变量。

rm(x)

gc() 函数告诉我使用了多少内存……

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    140609    7.6     407500   21.8    350000   18.7
Vcells 373177663 2847.2  450519582 3437.2 408329775 3115.4

略多于 3 GB,所以这很好。

完成此操作后,您现在将遇到严重的问题。 kmeans(来自您对早期答案问题的回答)不适用于缺失值

> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

,并且作为一个非常粗略的经验法则,我预计现成的 R 解决方案需要的内存是起始数据大小的 3-5 倍。您是否使用较小的数据集进行了分析?

From the web site pointed to in a previous question, it appears that you want to represent

> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb

which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!

I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in

what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)

the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric() on my platform take up twice as much memory as integer(), so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).

x <- list(User=factor(x$User), Film=factor(x$Film),
          Rating=as.integer(2 * x$Rating))

I then allocated the matrix

ratings <- matrix(NA_integer_ ,
                 nrow=length(levels(x$User)),
                 ncol=length(levels(x$Film)),
                 dimnames=list(levels(x$User), levels(x$Film)))

and use the fact that a two-column matrix can be used to index another matrix

ratings[cbind(x$User, x$Film)] <- x$Rating

This is the step where memory use is maximum. I'd then remove unneeded variable

rm(x)

The gc() function tells me how much memory I've used...

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    140609    7.6     407500   21.8    350000   18.7
Vcells 373177663 2847.2  450519582 3437.2 408329775 3115.4

... a little over 3 Gb, so that's good.

Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values

> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?

落花随流水 2025-01-03 19:54:04

很简单,您可以使用 Matrix 包中的 sparseMatrix 将其表示为稀疏矩阵。

只需创建一个 3 列坐标对象列表,即采用 (i, j, value) 的形式,例如在名为 myDF 的 data.frame 中。然后,执行 mySparseMat <-稀疏矩阵(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - 你需要决定的数量行和列,否则将使用最大索引来决定矩阵的大小。

将稀疏数据存储在密集矩阵中即使不是很奇怪,也是不合适的。

Quite simply, you can represent it as a sparse matrix, using sparseMatrix from the Matrix package.

Just create a 3 column coordinate object list, i.e. in the form (i, j, value), say in a data.frame named myDF. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.

It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文