在 Mathematica 中查找相似但不相同的元素

发布于 2024-12-21 04:30:05 字数 857 浏览 1 评论 0原文

我有一个数字列表。我想从列表中提取属于某个范围内且具有一定最小长度的数字。例如，假设我想对此列表进行操作：

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}

使用 band=1 和 runLength=3。我想要

{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}

这样的结果。现在我正在使用

Cases[
 Partition[thisList,runLength,1],
 x_ /; Abs[Max[x] - Min[x]] < band
]

主要问题是在运行重叠的地方，我得到了运行的许多副本。例如，使用

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

为我提供了

{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}

我宁愿拥有的地方，

{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

而无需对重叠结果进行一些做作的减少。正确的方法是什么？如果不涉及使用 Partition 分解数据，那就太好了。

原文

I have a list of numbers. I want to extract from the list runs of numbers that fall inside some band and have some minimum length. For instance, suppose the I want to operate on this list:

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}

with band=1 and runLength=3. I would like to have

{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}

as the result. Right now I'm using

Cases[
 Partition[thisList,runLength,1],
 x_ /; Abs[Max[x] - Min[x]] < band
]

The main problem is that where runs overlap, I get many copies of the run. For instance, using

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

gives me

{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}

where I would rather have

{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

without doing some hokey reduction of the overlapping result. What's the proper way? It'd be nice if it didn't involve exploding the data using Partition.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温馨耳语 2024-12-28 04:30:05

编辑

显然，我的第一个解决方案至少有两个严重缺陷：它非常慢并且对于超过 100 个元素的列表完全不切实际，并且它包含一些我无法识别的错误然而 - 有时会缺少一些乐队。因此，我将提供两种（希望是正确的）且更有效的替代方案，并且我为任何感兴趣的人提供下面有缺陷的替代方案。

基于链表的解决方案

下面是一个基于链表的解决方案。它允许我们仍然使用模式，但避免由包含 __ 或 ___ 的模式导致的低效率（重复应用时）：

ClearAll[toLinkedList];
toLinkedList[x_List] := Fold[{#2, #1} &, {}, Reverse@x]

ClearAll[accumF];
accumF[llFull_List, acc_List, {h_, t_List}, ctr_, max_, min_, band_, rLen_] :=
  With[{cmax = Max[max, h], cmin = Min[min, h]},
     accumF[llFull, {acc, h}, t, ctr + 1, cmax, cmin, band, rLen] /; 
        Abs[cmax - cmin] < band];
accumF[llFull_List, acc_List, ll : {h_, _List}, ctr_, _, _, band_, rLen_] /; ctr >= rLen :=
     accumF[ll, (Sow[acc]; {}), ll, 0, h, h, band, rLen];
accumF[llFull : {h_, t : {_, _List}}, _List, ll : {head_, _List}, _, _, _, band_, rLen_] :=
     accumF[t, {}, t, 0, First@t, First@t, band, rLen];
accumF[llFull_List, acc_List, {}, ctr_, _, _, _, rLen_] /; ctr >= rLen := Sow[acc];

ClearAll[getBandsLL];
getBandsLL[lst_List, runLength_Integer, band_?NumericQ] :=
  Block[{$IterationLimit = Infinity},
     With[{ll = toLinkedList@lst},
        Map[Flatten,
          If[# === {}, #, First@#] &@
            Reap[
              accumF[ll, {}, ll, 0, First@ll, First@ll, band,runLength]
            ][[2]]
        ]
     ]
  ];

以下是使用示例：

In[246]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[246]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[247]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[247]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

函数的主要思想 < code>accumF 是遍历数字列表（之前转换为链表），并在另一个链表中累加一个 band，并将其作为第二个参数传递给它。一旦 band 条件失败，就会使用 Sow 存储累积的 band（如果它足够长），并且该过程从链表的剩余部分重新开始。如果我们选择使用 Depth[acc] 来代替，则可能不需要 ctr 参数。

上面的代码中有一些不明显的东西。一个微妙的点是，尝试将accumF的两个中间规则连接到单个规则中（它们看起来非常相似）并使用CompoundExpression（类似于(If rhs 上的 [ctr>=rLen, Sow[acc];accumF[...])) 将导致非尾递归 accumF （参见这个答案对这个问题进行更详细的讨论这也是我的原因。在函数调用中添加 (Sow[acc]; {}) 行 - 以避免在 rhs 上使用顶级 CompoundExpression）。另一个微妙的点是，我必须在找到最后一个成功匹配后立即维护包含剩余元素的链接列表的副本，因为在序列不成功的情况下，我需要回滚到该列表减去其第一个元素，然后开始超过。该链表存储在accumF 的第一个参数中。

请注意，传递大型链表的成本并不高，因为复制的只是第一个元素（头）和指向其余元素（尾）的指针。与 {___,x__,right___} 等模式的情况相比，这是使用链表极大提高性能的主要原因 - 因为在后一种情况下，完整的序列 x< /code> 或 right 被复制。使用链表，我们实际上只复制一些引用，因此我们的算法的行为大致如我们所期望的那样（此处与数据列表的长度呈线性关系）。在这个答案中，我还提到了在以下情况下使用链表：优化代码的技术之一（第 3.4 节）。

基于编译的解决方案

这是一个基于 Compile 的简单但不太优雅的函数，它查找列表中的起始和结束带位置列表：

bandPositions = 
  Compile[{{lst, _Real, 1}, {runLength, _Integer}, {band, _Real}},
   Module[{i = 1, j, currentMin, currentMax, 
        startEndPos = Table[{0, 0}, {Length[lst]}], ctr = 0},
    For[i = 1, i <= Length[lst], i++,
      currentMin = currentMax = lst[[i]];
      For[j = i + 1, j <= Length[lst], j++,
        If[lst[[j]] < currentMin,
           currentMin = lst[[j]],
           (* else *)
           If[lst[[j]] > currentMax,
             currentMax = lst[[j]]
           ]
        ];
        If[Abs[currentMax - currentMin] >= band ,
          If[ j - i >= runLength,
             startEndPos[[++ctr]] = {i, j - 1}; i = j - 1
          ];
          Break[],
          (* else *)
          If[j == Length[lst] && j - i >= runLength - 1,
              startEndPos[[++ctr]] = {i, j}; i = Length[lst];
              Break[];
          ];
        ]
      ]; (* inner For *)
    ]; (* outer For *)
    Take[startEndPos, ctr]], CompilationTarget -> "C"];

这在最终函数中使用：

getBandsC[lst_List, runLength_Integer, band_?NumericQ] :=
   Map[Take[lst, #] &, bandPositions[lst, runLength, band]]

使用示例：

In[305]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[305]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[306]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[306]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

基准测试

In[381]:= 
largeTest  = RandomReal[{-5,5},50000];
(res1 =getBandsLL[largeTest,3,1]);//Timing
(res2 =getBandsC[largeTest,3,1]);//Timing
res1==res2

Out[382]= {1.109,Null}
Out[383]= {0.016,Null}
Out[384]= True

显然，如果想要性能，Compile 无疑会胜出。我对较大列表的观察是，这两种解决方案都具有与数字列表的大小近似线性的复杂性（正如它们应该的那样），在我的机器上编译的解决方案比基于链接列表的解决方案快大约 150 倍。

备注

事实上，两种方法都编码相同的算法，尽管这可能并不明显。具有递归和模式的可以说更容易理解，但这是一个观点问题。

一个简单但缓慢且有错误的版本

这是我首先编写的用于解决此问题的原始代码。这是基于相当简单的模式使用和重复规则应用。如前所述，该方法的一个缺点是其性能非常差。这实际上是反对将 {___,x__,y___} 等结构与重复规则应用结合使用的另一种情况，适用于任何长度超过几十个元素的情况。在提到的代码优化技术建议中，这对应于第4.1节。

无论如何，这是代码：

If[# === {}, #, First@#] &@
 Reap[thisList //. {
    left___, 
    Longest[x__] /;Length[{x}] >= runLength && Abs[Max[{x}] - Min[{x}]] < band,
    right___} :> (Sow[{x}]; {right})][[2]]

它对于两个原始的小测试列表都可以正常工作。它看起来总体上是正确的，但对于较大的列表，它经常会错过一些带，这可以通过与其他两种方法进行比较来看出。到目前为止，我还无法定位该错误，因为代码看起来相当透明。

EDIT

Apparenty, my first solution has at least two serious flaws: it is dead slow and completely impractical for lists larger than 100 elements, and it contains some bug(s) which I wasn't able to identify yet - it is missing some bands sometimes. So, I will provide two (hopefuly correct) and much more efficient alternatives, and I provide the flawed one below for any one interested.

A solution based on linked lists

Here is a solution based on linked lists. It allows us to still use patterns but avoid inefficiencies caused by patterns containing __ or ___ (when repeatedly applied):

ClearAll[toLinkedList];
toLinkedList[x_List] := Fold[{#2, #1} &, {}, Reverse@x]

ClearAll[accumF];
accumF[llFull_List, acc_List, {h_, t_List}, ctr_, max_, min_, band_, rLen_] :=
  With[{cmax = Max[max, h], cmin = Min[min, h]},
     accumF[llFull, {acc, h}, t, ctr + 1, cmax, cmin, band, rLen] /; 
        Abs[cmax - cmin] < band];
accumF[llFull_List, acc_List, ll : {h_, _List}, ctr_, _, _, band_, rLen_] /; ctr >= rLen :=
     accumF[ll, (Sow[acc]; {}), ll, 0, h, h, band, rLen];
accumF[llFull : {h_, t : {_, _List}}, _List, ll : {head_, _List}, _, _, _, band_, rLen_] :=
     accumF[t, {}, t, 0, First@t, First@t, band, rLen];
accumF[llFull_List, acc_List, {}, ctr_, _, _, _, rLen_] /; ctr >= rLen := Sow[acc];

ClearAll[getBandsLL];
getBandsLL[lst_List, runLength_Integer, band_?NumericQ] :=
  Block[{$IterationLimit = Infinity},
     With[{ll = toLinkedList@lst},
        Map[Flatten,
          If[# === {}, #, First@#] &@
            Reap[
              accumF[ll, {}, ll, 0, First@ll, First@ll, band,runLength]
            ][[2]]
        ]
     ]
  ];

Here are examples of use:

In[246]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[246]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[247]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[247]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

The main idea of the function accumF is to traverse the number list (converted to a linked list prior to that), and accumulate a band in another linked list, which is passed to it as a second argument. Once the band condition fails, the accumulated band is memorized using Sow (if it was long enough), and the process starts over with the remaining part of the linked list. The ctr parameter may not be needed if we would choose to use Depth[acc] instead.

There are a few non-obvious things present in the above code. One subtle point is that an attempt to join the two middle rules for accumF into a single rule (they look very similar) and use CompoundExpression (something like (If[ctr>=rLen, Sow[acc];accumF[...])) on the r.h.s. would lead to a non-tail-recursive accumF (See this answer for a more detailed discussion of this issue. This is also why I make the (Sow[acc]; {}) line inside a function call - to avoid the top-level CompoundExpression on the r.h.s.). Another subtle point is that I have to maintain a copy of the linked list containing the remaining elements right after the last successful match was found, since in the case of unsuccessful sequence I need to roll back to that list minus its first element, and start over. This linked list is stored in the first argument of accumF.

Note that passing large linked lists does not cost much, since what is copied is only a first element (head) and a pointer to the rest (tail). This is the main reason why using linked lists vastly improves performance, as compared to the case of patterns like {___,x__,right___} - because in the latter case, a full sequences x or right are copied. With linked lists, we effectively only copy a few references, and therefore our algorithms behave roughly as we expect (linearly in the length of the data list here). In this answer, I also mentioned the use of linked lists in such cases as one of the techniques to optimize code (section 3.4).

Compile - based solution

Here is a straightforward but not too elegant function based on Compile, which finds a list of starting and ending bands positions in the list:

bandPositions = 
  Compile[{{lst, _Real, 1}, {runLength, _Integer}, {band, _Real}},
   Module[{i = 1, j, currentMin, currentMax, 
        startEndPos = Table[{0, 0}, {Length[lst]}], ctr = 0},
    For[i = 1, i <= Length[lst], i++,
      currentMin = currentMax = lst[[i]];
      For[j = i + 1, j <= Length[lst], j++,
        If[lst[[j]] < currentMin,
           currentMin = lst[[j]],
           (* else *)
           If[lst[[j]] > currentMax,
             currentMax = lst[[j]]
           ]
        ];
        If[Abs[currentMax - currentMin] >= band ,
          If[ j - i >= runLength,
             startEndPos[[++ctr]] = {i, j - 1}; i = j - 1
          ];
          Break[],
          (* else *)
          If[j == Length[lst] && j - i >= runLength - 1,
              startEndPos[[++ctr]] = {i, j}; i = Length[lst];
              Break[];
          ];
        ]
      ]; (* inner For *)
    ]; (* outer For *)
    Take[startEndPos, ctr]], CompilationTarget -> "C"];

This is used in the final function:

getBandsC[lst_List, runLength_Integer, band_?NumericQ] :=
   Map[Take[lst, #] &, bandPositions[lst, runLength, band]]

Examples of use:

In[305]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[305]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[306]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[306]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

Benchmarks

In[381]:= 
largeTest  = RandomReal[{-5,5},50000];
(res1 =getBandsLL[largeTest,3,1]);//Timing
(res2 =getBandsC[largeTest,3,1]);//Timing
res1==res2

Out[382]= {1.109,Null}
Out[383]= {0.016,Null}
Out[384]= True

Obviously, if one wants performance, Compile wins hands down. My observations for larger lists are that both solutions have approximately linear complexity with the size of the number list (as they should), with compiled one roughly 150 times faster on my machine than the one based on linked lists.

Remarks

In fact, both methods encode the same algorithm, although this may not be obvious. The one with recursion and patterns is arguably somewhat more understandable, but that is a matter of opinion.

A simple, but slow and buggy version

Here is the original code that I wrote first to solve this problem. This is based on a rather straightforward use of patterns and repeated rule application. As mentioned, one disadvantage of this method is its very bad performance. This is actually another case against using constructs like {___,x__,y___} in conjunction with repeated rule application, for anything longer than a few dozens elements. In the mentioned recommendations for code optimization techniques, this corresponds to the section 4.1.

Anyways, here is the code:

If[# === {}, #, First@#] &@
 Reap[thisList //. {
    left___, 
    Longest[x__] /;Length[{x}] >= runLength && Abs[Max[{x}] - Min[{x}]] < band,
    right___} :> (Sow[{x}]; {right})][[2]]

It works correctly for both of the original small test lists. It also looks generally correct, but for larger lists it often misses some bands, which can be seen by comparison with the other two methods. I wasn't so far able to localize the bug, since the code seems pretty transparent.

回复收藏 0 原文

渔村楼浪 2024-12-28 04:30:05

我会尝试这样做：

thisList /. {___, Longest[a : Repeated[_, {3, Infinity}]], ___} :> 
               {a} /; Abs[Max@{a} - Min@{a}] < 1

其中 Repeated[_, {3, Infinity}] 保证您至少获得 3 个术语，而 Longest 确保它为您提供最长的运行时间可能的。作为一个函数，

Clear[f]
f[list_List, band_, minlen_Integer?Positive] := f[list, band, minlen, Infinity]
f[list_List, band_, 
  minlen_Integer?Positive, maxlen_?Positive] /; maxlen >= minlen := 
 list /. {___, Longest[a : Repeated[_, {minlen, maxlen}]], ___} :> {a} /; 
    Abs[Max@{a} - Min@{a}] < band

I'd try this instead:

thisList /. {___, Longest[a : Repeated[_, {3, Infinity}]], ___} :> 
               {a} /; Abs[Max@{a} - Min@{a}] < 1

where Repeated[_, {3, Infinity}] guarantees that you get at least 3 terms, and Longest ensures it gives you the longest run possible. As a function,

Clear[f]
f[list_List, band_, minlen_Integer?Positive] := f[list, band, minlen, Infinity]
f[list_List, band_, 
  minlen_Integer?Positive, maxlen_?Positive] /; maxlen >= minlen := 
 list /. {___, Longest[a : Repeated[_, {minlen, maxlen}]], ___} :> {a} /; 
    Abs[Max@{a} - Min@{a}] < band

回复收藏 0 原文