Mathematica 中的条件数据操作
我正在尝试在 Mathematica 中准备高效数据分析的最佳工具。 我有大约 300 个列和多个列。 100 000 行。
最好的技巧是什么:
“删除”、“提取”或简单地“考虑”数据结构的部分,用于绘制例如
我能想到的最棘手的例子之一是:
给定数据结构,
对于第 2 列中的值等于 x 并且第 8 列中的值不同于 y 的每一行,提取第 1 至 3、6 至 9 列以及最后一列
我也欢迎任何有关数据操作的一般性建议。
I am trying to prepare the best tools for efficient Data Analysis in Mathematica.
I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
I also welcome any general advice on data manipulation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于具有命名列的表中数据的通用操作,我建议您参考此 我的解决方案,针对类似的问题。对于任何特定情况,手动编写
Select
函数可能会更容易。然而,对于许多列和许多不同的查询,搞乱索引的可能性很高。以下是上述帖子中经过修改的解决方案,它提供了更友好的语法:这里发生的情况是
Select
中使用的函数会根据您的规范自动生成。例如(使用@Yoda的例子):我们需要定义列名称(必须是没有值的字符串或符号):(
当然,在实践中,通常名称更具描述性)。这是表格:
这是您需要的选择语句(我选择了
x = 4
和y=2
):现在,对于单个查询,这可能看起来像一个复杂的方法来做到这一点。但您可以执行许多不同的查询,例如“as”
和“similar”。
当然,如果您的数据中存在特定的相关性,您可能会找到更快的特定专用算法。上面的函数可以通过多种方式扩展,以简化常见查询(包括“全部”等),或自动编译生成的纯函数(如果可能)。
编辑
从哲学角度来看,我确信许多 Mathematica 用户(包括我自己)发现自己时不时地一次又一次地编写类似的代码。事实上,Mathematica 具有简洁的语法,因此通常很容易针对任何特定情况进行编写。然而,只要一个人在某个特定领域工作(例如,表中的数据操作),许多操作的重复成本就会很高。我的示例在一个非常简单的设置中说明了一种可能的出路 - 创建特定于域的语言(DSL)。为此,通常需要为其定义一种语法/文法,并将其编写为 Mathematica 的编译器(以自动生成 Mathematica 代码)。现在,上面的例子是这个想法的一个非常原始的实现,但我的观点是 Mathematica 通常非常适合 DSL 创建,我认为这是一项非常强大的技术。
For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for
Select
manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:What happens here is that the function used in
Select
gets generated automatically from your specifications. For example (using @Yoda's example):We need to define the column names (must be strings or symbols without values):
(in practice, usually names are more descriptive, of course). Here is the table then:
Here is the select statement you need (I picked
x = 4
andy=2
):Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.
获取矩阵片段和列表的一些有用命令是
Span
(;;)、Drop
、Take
、Select
、案例
等等。请参阅 tutorial/GettingAndSettingPiecesOfMatrices 和 guide/PartsOfMatrices,Part
([[...]]) 与;;
结合使用可以非常强大。例如,a[[All, 1;;-1;;2]] 表示获取所有行和所有奇数列(-1 具有从末尾开始计数的通常含义)。Select
可用于根据逻辑函数从列表中选取元素(记住矩阵是列表的列表)。它的孪生兄弟是Cases
,它根据模式进行选择。我在这里使用的函数是'pure'函数,其中#指的是应用此函数的参数(在本例中为列表的元素)。由于元素本身是列表(矩阵的行),我可以使用Part
([[..]]) 函数引用列。Some useful commands to get pieces of matrices and list are
Span
(;;),Drop
,Take
,Select
,Cases
and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,Part
([[...]]) in combination with;;
can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).Select
can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother isCases
which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using thePart
([[..]]) function.要拉出列(或行),您可以通过部分索引来完成
最后一行只是为了美观地查看它。
正如 Sjoerd 在他的评论中提到的(以及在他的答案的解释中),可以使用
Span
(;;
) 命令。如果要连接多个不相交的范围,请使用Flatten
组合使用Range
比手动输入更容易。
To pull out columns (or rows) you can do it by part indexing
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the
Span
(;;
) command. If you are joining multiple disjoint ranges, usingFlatten
to combine the separate ranges created withRange
is easier than entering them by hand.我读到:
这意味着我们想要:
和
[[2]] == x && [[8]]!=y
。这是我一起破解的:
I read:
as meaning that we want:
AND
[[2]] == x && [[8]] != y
.This is what I hacked together: