Mathematica 中的条件数据操作

发布于 2024-11-09 10:55:40 字数 331 浏览 0 评论 0原文

我正在尝试在 Mathematica 中准备高效数据分析的最佳工具。 我有大约 300 个列和多个列。 100 000 行。

最好的技巧是什么:

“删除”、“提取”或简单地“考虑”数据结构的部分,用于绘制例如

我能想到的最棘手的例子之一是:

给定数据结构,

对于第 2 列中的值等于 x 并且第 8 列中的值不同于 y 的每一行,提取第 1 至 3、6 至 9 列以及最后一列

我也欢迎任何有关数据操作的一般性建议。

I am trying to prepare the best tools for efficient Data Analysis in Mathematica.
I have a approximately 300 Columns & 100 000 Rows.

What would be the best tricks to :

"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.

One of the trickiest examples I could think of is :

Given a data structure,

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y

I also welcome any general advice on data manipulation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

半枫 2024-11-16 10:55:40

对于具有命名列的表中数据的通用操作,我建议您参考 我的解决方案,针对类似的问题。对于任何特定情况,手动编写 Select 函数可能会更容易。然而,对于许多列和许多不同的查询,搞乱索引的可能性很高。以下是上述帖子中经过修改的解决方案,它提供了更友好的语法:

Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];

ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules =  Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
    indexRules  =  Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
     With[{selF = Apply[Function, Hold[condition] /. colRules]},
       Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];

这里发生的情况是 Select 中使用的函数会根据您的规范自动生成。例如(使用@Yoda的例子):

rows = Array[#1 #2 &, {5, 15}];

我们需要定义列名称(必须是没有值的字符串或符号):(

In[425]:= 
colnames = "c" <> ToString[#] & /@ Range[15]

Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", 
"c13", "c14", "c15"}

当然,在实践中,通常名称更具描述性)。这是表格:

table = Prepend[rows, colnames];

这是您需要的选择语句(我选择了 x = 4y=2):

select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
    where["c2" == 4 && "c8" != 2]]

{{2, 4, 6, 12, 14, 16, 18, 30}}

现在,对于单个查询,这可能看起来像一个复杂的方法来做到这一点。但您可以执行许多不同的查询,例如“as”

In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]

Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}

和“similar”。

当然,如果您的数据中存在特定的相关性,您可能会找到更快的特定专用算法。上面的函数可以通过多种方式扩展,以简化常见查询(包括“全部”等),或自动编译生成的纯函数(如果可能)。

编辑

从哲学角度来看,我确信许多 Mathematica 用户(包括我自己)发现自己时不时地一次又一次地编写类似的代码。事实上,Mathematica 具有简洁的语法,因此通常很容易针对任何特定情况进行编写。然而,只要一个人在某个特定领域工作(例如,表中的数据操作),许多操作的重复成本就会很高。我的示例在一个非常简单的设置中说明了一种可能的出路 - 创建特定于域的语言(DSL)。为此,通常需要为其定义一种语法/文法,并将其编写为 Mathematica 的编译器(以自动生成 Mathematica 代码)。现在,上面的例子是这个想法的一个非常原始的实现,但我的观点是 Mathematica 通常非常适合 DSL 创建,我认为这是一项非常强大的技术。

For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:

Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];

ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules =  Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
    indexRules  =  Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
     With[{selF = Apply[Function, Hold[condition] /. colRules]},
       Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];

What happens here is that the function used in Select gets generated automatically from your specifications. For example (using @Yoda's example):

rows = Array[#1 #2 &, {5, 15}];

We need to define the column names (must be strings or symbols without values):

In[425]:= 
colnames = "c" <> ToString[#] & /@ Range[15]

Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", 
"c13", "c14", "c15"}

(in practice, usually names are more descriptive, of course). Here is the table then:

table = Prepend[rows, colnames];

Here is the select statement you need (I picked x = 4 and y=2):

select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
    where["c2" == 4 && "c8" != 2]]

{{2, 4, 6, 12, 14, 16, 18, 30}}

Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as

In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]

Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}

and similar.

Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).

EDIT

On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.

时光礼记 2024-11-16 10:55:40
data = RandomInteger[{1, 20}, {40, 20}]

x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]

==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}

获取矩阵片段和列表的一些有用命令是 Span (;;)、DropTakeSelect案例等等。请参阅 tutorial/GettingAndSettingPiecesOfMatricesguide/PartsOfMatrices,

Part ([[...]]) 与 ;; 结合使用可以非常强大。例如,a[[All, 1;;-1;;2]] 表示获取所有行和所有奇数列(-1 具有从末尾开始计数的通常含义)。

Select 可用于根据逻辑函数从列表中选取元素(记住矩阵是列表的列表)。它的孪生兄弟是 Cases,它根据模式进行选择。我在这里使用的函数是'pure'函数,其中#指的是应用此函数的参数(在本例中为列表的元素)。由于元素本身是列​​表(矩阵的行),我可以使用 Part ([[..]]) 函数引用列。

data = RandomInteger[{1, 20}, {40, 20}]

x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]

==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}

Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,

Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).

Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.

ぇ气 2024-11-16 10:55:40

要拉出列(或行),您可以通过部分索引来完成

data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten@{Range@3, Range @@ {6, 9}, -1}]]

MatrixForm@%

最后一行只是为了美观地查看它。

正如 Sjoerd 在他的评论中提到的(以及在他的答案的解释中),可以使用 Span (;;) 命令。如果要连接多个不相交的范围,请使用 Flatten 组合使用 Range比手动输入更容易。

To pull out columns (or rows) you can do it by part indexing

data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten@{Range@3, Range @@ {6, 9}, -1}]]

MatrixForm@%

The last line is just to view it pretty.

As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span (;;) command. If you are joining multiple disjoint ranges, using Flatten to combine the separate ranges created with Range is easier than entering them by hand.

谜泪 2024-11-16 10:55:40

我读到:

对于第 2 列中的值等于 x 并且第 8 列中的值不同于 y 的每一行,提取第 1 至 3、6 至 9 列以及最后一列

这意味着我们想要:

  • 元素 1-3 和 6每行的 -9

  • 行中的最后一个元素,其中 [[2]] == x && [[8]]!=y

这是我一起破解的:

a = RandomInteger[5, {20, 10}];          (*define the array*)
x = 4; y = 0;                            (*define the test values*)

Join @@ Range @@@ {1 ;; 3, 6 ;; 9};      (*define the column ranges*)

#2 == x && #8 != y & @@@ a;              (*test the rows*)

Append[%%, #] & /@ % /. {True -> -1, False :> Sequence[]};  (*complete the ranges according to the test*)

MapThread[Part, {a, %}] // TableForm     (*extract and display*)

I read:

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y

as meaning that we want:

  • elements 1-3 and 6-9 from each row

AND

  • the last element from rows wherein [[2]] == x && [[8]] != y.

This is what I hacked together:

a = RandomInteger[5, {20, 10}];          (*define the array*)
x = 4; y = 0;                            (*define the test values*)

Join @@ Range @@@ {1 ;; 3, 6 ;; 9};      (*define the column ranges*)

#2 == x && #8 != y & @@@ a;              (*test the rows*)

Append[%%, #] & /@ % /. {True -> -1, False :> Sequence[]};  (*complete the ranges according to the test*)

MapThread[Part, {a, %}] // TableForm     (*extract and display*)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文