交错稀疏排序数组
我有一组事件列表。事件总是按给定的顺序发生,但并非每个事件总是发生。下面是一个输入示例:
[[ do, re, fa, ti ],
[ do, re, mi ],
[ do, la, ti, za ],
[ mi, fa ],
[ re, so, za ]]
输入值没有任何固有顺序。它们实际上是诸如“创建符号链接”和“重新索引搜索”之类的消息。它们在单独的列表中排序,但无法仅查看第一个列表中的“fa”和第二个列表中的“mi”并确定哪个在另一个之前。
我希望能够接受该输入并生成所有事件的排序列表:
[ do, re, mi, fa, so, la, ti, za ]
或者更好的是,有关每个事件的一些信息,例如计数:
[ [do, 3], [re, 3], [mi, 2],
[fa, 2], [so, 1], [la, 1],
[ti, 1], [za, 2] ]
是否有我的名称正在做?有公认的算法吗?我用 Perl 写这个,如果这很重要的话,但伪代码就可以了。
我知道,鉴于我的示例输入,我可能无法保证“正确”的顺序。但我的真实输入有吨更多的数据点,而且我相信,通过一些技巧,它的正确率将达到 95%(这确实是我所需要的)。如果没有必要,我只是不想重新发明轮子。
I've got a set of lists of events. The events always happen in a given order, but not every event always happens. Here's an example input:
[[ do, re, fa, ti ],
[ do, re, mi ],
[ do, la, ti, za ],
[ mi, fa ],
[ re, so, za ]]
The input values don't have any inherent order. They're actually messages like "creating symlinks" and "reindexing search". They're sorted in the individual list, but there's no way to look at only 'fa' in the first list and 'mi' in the second and determine which comes before the other.
I'd like to be able to take that input and generate a sorted list of all events:
[ do, re, mi, fa, so, la, ti, za ]
or better yet, some information about each event, like a count:
[ [do, 3], [re, 3], [mi, 2],
[fa, 2], [so, 1], [la, 1],
[ti, 1], [za, 2] ]
Is there a name for what I'm doing? Are there accepted algorithms? I'm writing this in Perl, if that matters, but pseudocode will do.
I know that given my example input, I probably can't be guaranteed of the "right" order. But my real input has tons more datapoints, and I feel confident that with some cleverness it'll be 95% right (which is really all I need). I just don't want to re-invent the wheel if I don't have to.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
您可以使用
tsort
推断合理的(尽管不一定是唯一的)排序顺序(称为拓扑顺序)根据您观察到的顺序。您可能有兴趣阅读tsort
的原始用途,其结构与您的问题类似。请注意,
tsort
需要非循环图。就您的示例而言,这意味着您无法在一个序列中看到 do 后跟 re,而在另一个序列中看不到 re 后跟 do。由于您将数据描述为稀疏数据,因此上面的代码为
tsort
提供了有关事件邻接矩阵的尽可能多的信息。有了这些信息,计算直方图并对其组件进行排序就很简单了:
对于您提供的问题中的输入,输出是
这看起来很有趣,因为我们知道 solfège 的顺序,但是 re 和 la 在 偏序由
$events
定义:我们只知道它们必须都在do之后。You can use
tsort
to infer a reasonable—although not necessarily unique—sort order (known as a topological order) from the ordering you've observed. You may be interested in readingtsort
's original use, which is similar in structure to your problem.Note that
tsort
requires an acyclic graph. In terms of your example, this means you couldn't see do followed by re in one sequence and re followed by do in another.Because you described the data as sparse, the code above provides
tsort
with as much information as possible about the events' adjacency matrix.Having that information, computing a histogram and sorting its components is straightforward:
For the input in your question you provided, the output is
This looks funny because we know the order of solfège, but re and la are incomparable in the partial order defined by
$events
: we know only that they must both come after do.从理论上讲,我建议采用以下算法:
附注
这仅假设所有事件都按特定顺序发生(始终!)。如果情况并非如此,则问题变为 NP 完全问题。
聚苯硫醚
只是为了让你有一些有用的东西: Sort::Topological (不要知道它是否真的有效,但看起来是正确的)
Theoretically speaking, let me suggest the following algorithm:
PS
This is only assuming that all events occur in a specific order (always!). If that's not the case, the problem becomes NP-Complete.
PPS
And just so that you have something useful: Sort::Topological (don't know if it actually works but it seems right)
如果您不喜欢编写太多代码,则可以使用 unix 命令行实用程序
tsort
:它是示例输入中所有对的列表。这会产生输出:
这基本上就是你想要的。
If you're not into writing to much code, you could use the unix command-line utility
tsort
:Which is a list of all pairs in your sample input. This produces as output:
which is basically what you want.
使用哈希值进行聚合。
如果您想要的话,%out散列
可以轻松转换为列表。
Use a hash to aggregate.
Yields
The %out hash is easily converted into a list if that is what you want.
我能想到的最快的捷径。无论哪种方式,你都必须至少迭代一次......
Quickiest shortcut I can think of. Either way, you have to iterate through things at least once...
这是合并排序的完美候选者。请访问此处的维基百科页面,了解该算法的良好表示 http://en.wikipedia.org/ wiki/Merge_sort
您所描述的实际上是合并排序的子集/小调整。您不是从未排序的数组开始,而是想要将一组已排序的数组合并在一起。只需按照维基百科页面中所述对数组和合并函数的结果调用“合并”函数,直到获得单个数组(将被排序)。
要将输出调整为您想要的方式,您需要定义一个比较函数,该函数可以在一个事件小于、等于或大于另一个事件时返回。然后,当您的合并函数发现两个相等的事件时,您可以将它们折叠成一个事件并保留该事件的计数。
This is a perfect candidate for a Merge Sort. Go to the wikipedia page here for a pretty good representation of the algorithm http://en.wikipedia.org/wiki/Merge_sort
What you have described is actually a subset/small tweak of the merge sort. Instead of starting with an unsorted array, you have a set of sorted arrays that you want to merge together. Just call the "merge" function as described in the wikipedia page on pairs of your arrays and the results of the merge function until you have a single array (which will be sorted).
To tweak the output to the way you want, you'll need to define a comparison function that can return if one event is less than, equal to, or greater than a different event. Then, when your merge function finds two events that are equal, you can collapse them into a single event and keep a count for that event.
粗略地说,我给它起的名字是“散列”。您将事物放入名称值对中。如果你想保持某种表面上的顺序,你必须用一个保持顺序的数组来补充哈希。这个命令对我来说就是“遭遇命令”。
还有其他类似的答案,但我的包含了这种巧妙的自动生存技巧。
Roughly, the name I would give it is "hashing". You are putting things into name value pairs. If you want to keep some semblance of order, you have to supplement the hash with an array that keeps order. That order is "encounter order" for me.
There are other answers like this one, but mine contains this neat autovivification trick.
我也不太确定这会被称为什么,但我找到了一种方法来查找给定数组数组作为输入的顺序。本质上,伪代码是:
10 在所有数组中查找最早的项目
20 将其推入列表
30 从所有数组中删除该项目
40 如果还有剩余项目,则转到 10
这是一个工作原型:
输出:
do,re,mi,fa,la,ti,so,za
I'm not really sure what this would be called either, but I figured out a way to find the order given the array of arrays as an input. Essentially the pseudo-code is:
10 Find earliest item in all arrays
20 Push that onto a list
30 Remove that item from all arrays
40 Goto 10 if there are any items left
Here's a working prototype:
Output:
do,re,mi,fa,la,ti,so,za
刚刚意识到你的问题说他们没有预定的顺序,所以这可能不相关。
Perl 代码:
输出:
Just realized your question said their is no predetermined order, so this may not be relevent.
Perl code:
output:
解决方案:
这解决了提问者修改之前的原始问题。
如何检索信息:
基本上,我们在哈希中创建一个唯一的元素列表。对于每个元素,我们都有一个“属性”哈希,其中包含一个标量计数和一个位置数组。数组中元素的数量应根据元素在原始元素中出现的次数而变化。
标量属性并不是真正必要的,因为您始终可以采用
position
数组的标量来检索相同的数字。注意:如果您从数组中添加/删除元素,count
和position
的含义将不相关。print scalar @{$new{za}{position}};
将为您提供与print $new{za}{count};
相同的结果Solution:
This solves the original question before it was modified by the asker.
How to Retrieve Info:
Basically, we create a unique list of elements in the hash. For each of those elements we have a "property" hash, that contains a scalar
count
and an array for theposition
. The number of elements in the array should vary, based on how many occurrences of the element were in the original.The scalar property isn't really necessary since you could always take the scalar of the
position
array to retrieve the same number. Note: if you ever add/remove elements from the arraycount
andposition
will not be correlate in their meaning.print scalar @{$new{za}{position}};
will give you the same asprint $new{za}{count};