如何使用 R 项目创建电影评分的向量矩阵?
假设我正在使用这个电影评级数据集: http://www.grouplens.org/node/73< /a>
它包含格式为的文件中的评级 userID::movieID:: rating::timestamp
鉴于此,我想在 R 项目中构建一个特征矩阵,其中每一行对应于一个用户,每一列表示用户对电影的评分(如果有)。
例如,如果数据文件包含
1::1::1::10 2::2::2::11 1::2::3::12 2::1::5::13 3::3::4::14
那么输出矩阵将如下所示:
UserID, Movie1, Movie2, Movie3 1, 1, 3, NA 2, 5, 2, NA 3, NA, NA, 3
那么 R 项目中是否有一些内置方法可以实现此目的。我写了一个简单的 python 脚本来做同样的事情,但我敢打赌有更有效的方法来完成这个任务。
Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73
It contains ratings in a file formatted as
userID::movieID::rating::timestamp
Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).
Example, if the data file contains
1::1::1::10 2::2::2::11 1::2::3::12 2::1::5::13 3::3::4::14
Then the output matrix would look like:
UserID, Movie1, Movie2, Movie3 1, 1, 3, NA 2, 5, 2, NA 3, NA, NA, 3
So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用
reshape2
包中的dcast
函数,但生成的 data.frame 可能会很大(而且稀疏)。如果您的字段以双冒号分隔,则无法使用
read.delim
的sep
参数,该参数只能是一个字符。如果您已经在 R 之外进行了一些预处理,那么在那里执行会更容易(例如,在 Perl 中,它只是
s/::/\t/g
),但您也可以这样做在 R 中:将文件作为单列读取,拆分字符串,然后连接结果。You can use the
dcast
function, in thereshape2
package, but the resulting data.frame may be huge (and sparse).If your fields are separated by double colons, you cannot use the
sep
argument ofread.delim
, which has to be only one character.If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be
s/::/\t/g
), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.从上一个问题中指出的网站来看,您似乎想用您
在 另一个问题。另外,总长度小于 R 中的最大向量长度,所以这也应该没问题。但请参阅回复末尾的重要警告!
我在 R 之外创建了数据文件的制表符分隔版本。然后我读了我感兴趣的信息,
“NULL”删除了未使用的时间戳数据。 “User”和“Film”条目不是连续的,并且我的平台上的
numeric()
占用的内存是integer()
的两倍,因此我将 User 和电影因子,评级加倍为整数()(原始分数为 1 到 5,增量为 1/2)。然后,我分配了矩阵
,并利用两列矩阵可用于索引另一个矩阵的事实,
这是内存使用量最大的步骤。然后,我会删除不需要的变量。
gc()
函数告诉我使用了多少内存……略多于 3 GB,所以这很好。
完成此操作后,您现在将遇到严重的问题。 kmeans(来自您对早期答案问题的回答)不适用于缺失值
,并且作为一个非常粗略的经验法则,我预计现成的 R 解决方案需要的内存是起始数据大小的 3-5 倍。您是否使用较小的数据集进行了分析?
From the web site pointed to in a previous question, it appears that you want to represent
which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!
I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in
the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and
numeric()
on my platform take up twice as much memory asinteger()
, so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).I then allocated the matrix
and use the fact that a two-column matrix can be used to index another matrix
This is the step where memory use is maximum. I'd then remove unneeded variable
The
gc()
function tells me how much memory I've used...... a little over 3 Gb, so that's good.
Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values
and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?
很简单,您可以使用
Matrix
包中的sparseMatrix
将其表示为稀疏矩阵。只需创建一个 3 列坐标对象列表,即采用
(i, j, value)
的形式,例如在名为myDF
的 data.frame 中。然后,执行 mySparseMat <-稀疏矩阵(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - 你需要决定的数量行和列,否则将使用最大索引来决定矩阵的大小。将稀疏数据存储在密集矩阵中即使不是很奇怪,也是不合适的。
Quite simply, you can represent it as a sparse matrix, using
sparseMatrix
from theMatrix
package.Just create a 3 column coordinate object list, i.e. in the form
(i, j, value)
, say in a data.frame namedmyDF
. Then, executemySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols)
- you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.