我应该使用 data.frame 还是矩阵?
什么时候应该使用data.frame
,什么时候最好使用matrix
?
两者都以矩形格式保存数据,因此有时会不清楚。
对于何时使用哪种数据类型有什么通用的经验法则吗?
When should one use a data.frame
, and when is it better to use a matrix
?
Both keep data in a rectangular format, so sometimes it's unclear.
Are there any general rules of thumb for when to use which data type?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
部分答案已包含在您的问题中:如果列(变量)预计具有不同类型(数字/字符/逻辑等),则可以使用数据框。矩阵用于相同类型的数据。
因此,只有当您拥有相同类型的数据时,选择矩阵/data.frame 才会出现问题。
答案取决于您要如何处理 data.frame/matrix 中的数据。如果要将其传递给其他函数,则这些函数的参数的预期类型将决定选择。
另外:
矩阵的内存效率更高:
如果您计划进行任何线性代数类型的运算,则矩阵是必需的。
如果您经常按名称引用数据框的列(通过紧凑的 $ 运算符),数据框会更方便。
恕我直言,数据框也更适合报告(打印)表格信息,因为您可以单独对每列应用格式。
Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.
Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.
The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.
Also:
Matrices are more memory efficient:
Matrices are a necessity if you plan to do any linear algebra-type of operations.
Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).
Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.
@Michal 没有提到的是,矩阵不仅比等效数据框小,而且使用矩阵可以使您的代码比使用数据框更高效,通常是这样。这就是为什么在内部,许多 R 函数将强制转换为数据帧中的矩阵数据的原因之一。
数据框通常要方便得多;人们并不总是拥有仅原子的数据块。
请注意,您可以有一个字符矩阵;在 R 中,您不仅需要拥有数字数据来构建矩阵。
在将数据帧转换为矩阵时,请注意有一个
data.matrix()
函数,它可以通过以下方式适当处理因子:根据内部级别将它们转换为数值。如果任何因子标签是非数字,则通过 as.matrix() 强制将生成字符矩阵。比较:我几乎总是使用数据框来执行数据分析任务,因为我经常拥有的不仅仅是数字变量。当我为包编写函数时,我几乎总是强制转换为矩阵,然后将结果格式化为数据框。这是因为数据框很方便。
Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.
Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.
Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.
In converting a data frame to a matrix, note that there is a
data.matrix()
function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing viaas.matrix()
will result in a character matrix if any of the factor labels is non-numeric. Compare:I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.
@Michal:矩阵实际上并没有更高的内存效率:
...除非你有大量的列:
@Michal: Matrices aren't really more memory efficient:
... unless you have a large number of columns:
该矩阵实际上是一个具有附加方法的向量。而 data.frame 是一个列表。
区别在于向量与列表。为了计算效率,坚持使用矩阵。如果必须的话,使用 data.frame 。
The matrix is actually a vector with additional methods. while data.frame is a list.
The difference is down to vector vs list. for computation efficiency, stick with matrix. Using data.frame if you have to.
我无法再强调两者之间的效率差异!虽然 DF 在某些特别是数据分析的情况下确实更方便,但它们也允许异构数据,并且一些库只接受它们,这些都是次要的,除非您为特定任务编写一次性代码。
让我举个例子。有一个函数可以计算 MCMC 方法的 2D 路径。基本上,这意味着我们采用初始点 (x,y),并迭代某种算法以在每一步找到新点 (x,y),从而构建整个路径。该算法涉及计算一个相当复杂的函数,并在每次迭代时生成一些随机变量,因此当它运行 12 秒时,我认为考虑到它在每个步骤中做了多少事情,这很好。也就是说,该函数收集了构建路径中的所有点以及 3 列 data.frame 中的目标函数值。因此,3 列并没有那么大,而且步数也超过合理的 10,000(在此类问题中,长度为 1,000,000 的路径是典型的,因此 10,000 不算什么)。所以,我认为 DF 10,000x3 绝对不是问题。使用 DF 的原因很简单。调用该函数后,调用 ggplot() 来绘制结果 (x,y) 路径。并且 ggplot() 不接受矩阵。
然后,出于好奇,我决定更改函数以收集矩阵中的路径。很高兴 DF 和矩阵的语法是相似的,我所做的就是将指定 df 作为 data.frame 的行更改为将其初始化为矩阵的行。这里我还需要提到的是,在初始代码中,DF 被初始化为最终大小,因此在函数的代码中,只有新值被记录到已经分配的空间中,并且没有向 DF 添加新行的开销。 DF。这使得比较更加公平,也使我的工作变得更简单,因为我不需要在函数中进一步重写任何内容。只需一行即可将所需大小的 data.frame 的初始分配更改为相同大小的矩阵。为了使新版本的函数适应 ggplot(),我将现在返回的矩阵转换为 data.frame 以在 ggplot() 中使用。
重新运行代码后,我简直不敢相信结果。代码在不到一秒的时间内运行!而不是大约12秒。同样,该函数在 10,000 次迭代期间仅读取和写入值到 DF 中已分配的空间(现在是矩阵中)。这种差异也适用于合理(或相当小的)尺寸 10000x3。
因此,如果您使用 DF 的唯一原因是使其与 ggplot() 等库函数兼容,那么您始终可以在最后一刻将其转换为 DF——只要您觉得方便,就可以使用矩阵。另一方面,如果有更实质性的理由使用 DF,例如您使用一些数据分析包,否则需要不断地从矩阵到 DF 并返回,或者您自己不进行任何密集计算而仅使用标准包(其中许多实际上在内部将 DF 转换为矩阵,完成它们的工作,然后将结果转换回来 - 所以它们为您完成所有效率工作),或者做一次性工作,这样您就不会关心和感觉如果您对 DF 更加满意,那么您就不必担心效率。
或者一个不同的更实际的规则:如果你有一个问题,比如在OP中,使用矩阵,所以只有当你没有这样的问题时你才会使用DF(因为你已经知道你必须使用DF,或者因为你这样做不太关心,因为代码是一次性的等)。
但总的来说,请始终将这个效率点作为优先事项牢记在心。
I cannot stress out more the efficiency difference between the two! While it is true that DFs are more convenient in some especially data analysis cases, they also allow heterogeneous data, and some libraries accept them only, these all is really secondary unless you write a one-time code for a specific task.
Let me give you an example. There was a function that would calculate the 2D path of the MCMC method. Basically, this means we take an initial point (x,y), and iterate a certain algorithm to find a new point (x,y) at each step, constructing this way the whole path. The algorithm involves calculating a quite complex function and the generation of some random variable at each iteration, so when it run for 12 seconds I thought it is fine given how much stuff it does at each step. That being said, the function collected all points in the constructed path together with the value of an objective function in a 3-column data.frame. So, 3 columns is not that large, and the number of steps was also more than reasonable 10,000 (in this kind of problems paths of length 1,000,000 are typical, so 10,000 is nothing). So, I thought a DF 10,000x3 is definitely not an issue. The reason a DF was used is simple. After calling the function, ggplot() was called to draw the resulting (x,y)-path. And ggplot() does not accept a matrix.
Then, at some point out of curiosity I decided to change the function to collect the path in a matrix. Gladly the syntax of DFs and matrices is similar, all I did was to change the line specifying df as a data.frame to one initializing it as a matrix. Here I need also to mention that in the initial code the DF was initialized to have the final size, so later in the code of the function only new values were recorded into already allocated spaces, and there was no overhead of adding new rows to the DF. This makes the comparison even more fair, and it also made my job simpler as I did not need to rewrite anything further in the function. Just one line change from the initial allocation of a data.frame of the required size to a matrix of the same size. To adapt the new version of the function to ggplot(), I converted the now returned matrix to a data.frame to use in ggplot().
After I rerun the code I could not believe the result. The code run in a fraction of a second! Instead of about 12 seconds. And again, the function during the 10,000 iterations only read and wrote values to already allocated spaces in a DF (and now in a matrix). And this difference is also for the reasonable (or rather small) size 10000x3.
So, if your only reason to use a DF is to make it compatible with a library function such as ggplot(), you can always convert it to a DF at the last moment -- work with matrices as far as you feel convenient. If on the other hand there is a more substantial reason to use a DF, such as you use some data analysis package that would require otherwise constant transforming from matrices to DFs and back, or you do not do any intensive calculations yourself and only use standard packages (many of them actually internally transform a DF to a matrix, do their job, and then transform the result back -- so they do all efficiency work for you), or do a one-time job so you do not care and feel more comfortable with DFs, then you should not worry about efficiency.
Or a different more practical rule: if you have a question such as in the OP, use matrices, so you would use DFs only when you do not have such a question (because you already know you have to use DFs, or because you do not really care as the code is one-time etc.).
But in general keep this efficiency point always in mind as a priority.
这是一个有趣的结果。使用矩阵与 tibbles 相比速度更快,但随着矩阵变大,差异会缩小。请注意,将 tibble 转换为矩阵的开销大于将 tibble 转换为矩阵的开销。作为广泛的概括,在矩阵空间中专门工作(无类型强制)比在 dplyr 空间中工作大约快 20%。
创建于 2023-07-03,使用 reprex v2.0.2
Here is an interesting result. Working with matrices vs tibbles is faster but the difference shrinks as the matrix gets larger. Note the overhead of converting a tibble to a matrix is larger than for converting a tibble to a matrix. As a broad generalization, working exclusively (no type coercion) in matrix space is about 20% faster than working in dplyr space.
Created on 2023-07-03 with reprex v2.0.2