在 Mathematica 中查找相似但不相同的元素
我有一个数字列表。我想从列表中提取属于某个范围内且具有一定最小长度的数字。例如,假设我想对此列表进行操作:
thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}
使用 band=1
和 runLength=3
。我想要
{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}
这样的结果。现在我正在使用
Cases[
Partition[thisList,runLength,1],
x_ /; Abs[Max[x] - Min[x]] < band
]
主要问题是在运行重叠的地方,我得到了运行的许多副本。例如,使用
thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}
为我提供了
{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}
我宁愿拥有的地方,
{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}
而无需对重叠结果进行一些做作的减少。正确的方法是什么?如果不涉及使用 Partition
分解数据,那就太好了。
I have a list of numbers. I want to extract from the list runs of numbers that fall inside some band and have some minimum length. For instance, suppose the I want to operate on this list:
thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}
with band=1
and runLength=3
. I would like to have
{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}
as the result. Right now I'm using
Cases[
Partition[thisList,runLength,1],
x_ /; Abs[Max[x] - Min[x]] < band
]
The main problem is that where runs overlap, I get many copies of the run. For instance, using
thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}
gives me
{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}
where I would rather have
{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}
without doing some hokey reduction of the overlapping result. What's the proper way? It'd be nice if it didn't involve exploding the data using Partition
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑
显然,我的第一个解决方案至少有两个严重缺陷:它非常慢并且对于超过 100 个元素的列表完全不切实际,并且它包含一些我无法识别的错误然而 - 有时会缺少一些乐队。因此,我将提供两种(希望是正确的)且更有效的替代方案,并且我为任何感兴趣的人提供下面有缺陷的替代方案。
基于链表的解决方案
下面是一个基于链表的解决方案。它允许我们仍然使用模式,但避免由包含
__
或___
的模式导致的低效率(重复应用时):以下是使用示例:
函数的主要思想 < code>accumF 是遍历数字列表(之前转换为链表),并在另一个链表中累加一个 band,并将其作为第二个参数传递给它。一旦 band 条件失败,就会使用
Sow
存储累积的 band(如果它足够长),并且该过程从链表的剩余部分重新开始。如果我们选择使用Depth[acc]
来代替,则可能不需要ctr
参数。上面的代码中有一些不明显的东西。一个微妙的点是,尝试将accumF的两个中间规则连接到单个规则中(它们看起来非常相似)并使用CompoundExpression(类似于
(If rhs 上的 [ctr>=rLen, Sow[acc];accumF[...])
) 将导致非尾递归accumF
(参见 这个答案对这个问题进行更详细的讨论这也是我的原因。在函数调用中添加(Sow[acc]; {})
行 - 以避免在 rhs 上使用顶级CompoundExpression
)。另一个微妙的点是,我必须在找到最后一个成功匹配后立即维护包含剩余元素的链接列表的副本,因为在序列不成功的情况下,我需要回滚到该列表减去其第一个元素,然后开始超过。该链表存储在accumF 的第一个参数中。请注意,传递大型链表的成本并不高,因为复制的只是第一个元素(头)和指向其余元素(尾)的指针。与
{___,x__,right___}
等模式的情况相比,这是使用链表极大提高性能的主要原因 - 因为在后一种情况下,完整的序列x< /code> 或
right
被复制。使用链表,我们实际上只复制一些引用,因此我们的算法的行为大致如我们所期望的那样(此处与数据列表的长度呈线性关系)。在这个答案中,我还提到了在以下情况下使用链表:优化代码的技术之一(第 3.4 节)。基于编译的解决方案
这是一个基于
Compile
的简单但不太优雅的函数,它查找列表中的起始和结束带位置列表:这在最终函数中使用:
使用示例:
基准测试
显然,如果想要性能,
Compile
无疑会胜出。我对较大列表的观察是,这两种解决方案都具有与数字列表的大小近似线性的复杂性(正如它们应该的那样),在我的机器上编译的解决方案比基于链接列表的解决方案快大约 150 倍。备注
事实上,两种方法都编码相同的算法,尽管这可能并不明显。具有递归和模式的可以说更容易理解,但这是一个观点问题。
一个简单但缓慢且有错误的版本
这是我首先编写的用于解决此问题的原始代码。这是基于相当简单的模式使用和重复规则应用。如前所述,该方法的一个缺点是其性能非常差。这实际上是反对将
{___,x__,y___}
等结构与重复规则应用结合使用的另一种情况,适用于任何长度超过几十个元素的情况。在提到的代码优化技术建议中,这对应于第4.1节。无论如何,这是代码:
它对于两个原始的小测试列表都可以正常工作。它看起来总体上是正确的,但对于较大的列表,它经常会错过一些带,这可以通过与其他两种方法进行比较来看出。到目前为止,我还无法定位该错误,因为代码看起来相当透明。
EDIT
Apparenty, my first solution has at least two serious flaws: it is dead slow and completely impractical for lists larger than 100 elements, and it contains some bug(s) which I wasn't able to identify yet - it is missing some bands sometimes. So, I will provide two (hopefuly correct) and much more efficient alternatives, and I provide the flawed one below for any one interested.
A solution based on linked lists
Here is a solution based on linked lists. It allows us to still use patterns but avoid inefficiencies caused by patterns containing
__
or___
(when repeatedly applied):Here are examples of use:
The main idea of the function
accumF
is to traverse the number list (converted to a linked list prior to that), and accumulate a band in another linked list, which is passed to it as a second argument. Once the band condition fails, the accumulated band is memorized usingSow
(if it was long enough), and the process starts over with the remaining part of the linked list. Thectr
parameter may not be needed if we would choose to useDepth[acc]
instead.There are a few non-obvious things present in the above code. One subtle point is that an attempt to join the two middle rules for
accumF
into a single rule (they look very similar) and useCompoundExpression
(something like(If[ctr>=rLen, Sow[acc];accumF[...])
) on the r.h.s. would lead to a non-tail-recursiveaccumF
(See this answer for a more detailed discussion of this issue. This is also why I make the(Sow[acc]; {})
line inside a function call - to avoid the top-levelCompoundExpression
on the r.h.s.). Another subtle point is that I have to maintain a copy of the linked list containing the remaining elements right after the last successful match was found, since in the case of unsuccessful sequence I need to roll back to that list minus its first element, and start over. This linked list is stored in the first argument ofaccumF
.Note that passing large linked lists does not cost much, since what is copied is only a first element (head) and a pointer to the rest (tail). This is the main reason why using linked lists vastly improves performance, as compared to the case of patterns like
{___,x__,right___}
- because in the latter case, a full sequencesx
orright
are copied. With linked lists, we effectively only copy a few references, and therefore our algorithms behave roughly as we expect (linearly in the length of the data list here). In this answer, I also mentioned the use of linked lists in such cases as one of the techniques to optimize code (section 3.4).Compile - based solution
Here is a straightforward but not too elegant function based on
Compile
, which finds a list of starting and ending bands positions in the list:This is used in the final function:
Examples of use:
Benchmarks
Obviously, if one wants performance,
Compile
wins hands down. My observations for larger lists are that both solutions have approximately linear complexity with the size of the number list (as they should), with compiled one roughly 150 times faster on my machine than the one based on linked lists.Remarks
In fact, both methods encode the same algorithm, although this may not be obvious. The one with recursion and patterns is arguably somewhat more understandable, but that is a matter of opinion.
A simple, but slow and buggy version
Here is the original code that I wrote first to solve this problem. This is based on a rather straightforward use of patterns and repeated rule application. As mentioned, one disadvantage of this method is its very bad performance. This is actually another case against using constructs like
{___,x__,y___}
in conjunction with repeated rule application, for anything longer than a few dozens elements. In the mentioned recommendations for code optimization techniques, this corresponds to the section 4.1.Anyways, here is the code:
It works correctly for both of the original small test lists. It also looks generally correct, but for larger lists it often misses some bands, which can be seen by comparison with the other two methods. I wasn't so far able to localize the bug, since the code seems pretty transparent.
我会尝试这样做:
其中
Repeated[_, {3, Infinity}]
保证您至少获得 3 个术语,而Longest
确保它为您提供最长的运行时间可能的。作为一个函数,I'd try this instead:
where
Repeated[_, {3, Infinity}]
guarantees that you get at least 3 terms, andLongest
ensures it gives you the longest run possible. As a function,给出了非常复杂的答案。 :-) 我想我有一个更简单的方法给你。自己定义相似性对您意味着什么,并使用
GatherBy[]
收集所有相似元素,或使用SplitBy[]
收集相似元素的“runs”(然后删除“运行”的最小不可接受长度,例如 1 或 2,通过DeleteCases[]
)。更难的问题是建立相似性。按照你的方法 1.2,0.9,1.9,0.8 会将前三个元素分组,但不会将后三个元素分组,但 0.9 和 0.8 同样相似,1.9 会将其从你的乐队中踢出。 .4,.5,.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5 呢?相似性在哪里结束?
另请查看
ClusteringComponents[]
和FindClusters[]
Very complex answers given. :-) I think I have a simpler approach for you. Define to yourself what similarity means to you, and use
GatherBy[]
to collect all similar elements, orSplitBy[]
to collect "runs" of similar elements (then remove "runs" of minimal unaccepted length, say 1 or 2, viaDeleteCases[]
).Harder question is establishing similarity. By your method 1.2,0.9,1.9,0.8 would group first three elements, but not last three, but 0.9 and 0.8 are just as similar, and 1.9 would kick it out of your band. What about .4,.5,.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5 - where does similarity end?
Also look into
ClusteringComponents[]
andFindClusters[]