合并共享公共元素的列表
我的输入是一个列表列表。其中一些具有共同的元素,例如。
L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
我需要合并共享公共元素的所有列表,并重复此过程,只要没有更多列表具有相同的项目。我考虑过使用布尔运算和 while 循环,但无法想出一个好的解决方案。
最终结果应该是:
L = [['a','b','c','d','e','f','g','o','p'],['k']]
My input is a list of lists. Some of them share common elements, eg.
L = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
I need to merge all lists, that share a common element, and repeat this procedure as long as there are no more lists with the same item. I thought about using boolean operations and a while loop, but couldn't come up with a good solution.
The final result should be:
L = [['a','b','c','d','e','f','g','o','p'],['k']]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
您可以将列表视为图表的符号,即
['a','b','c']
是一个包含 3 个节点相互连接的图表。您试图解决的问题是找到此图中的连接组件。您可以使用 NetworkX 来实现此目的,它的优点是几乎可以保证它是正确的:
要有效地解决这个问题无论如何,你自己都必须将列表转换成图形化的东西,所以你最好从一开始就使用networkX。
You can see your list as a notation for a Graph, ie
['a','b','c']
is a graph with 3 nodes connected to each other. The problem you are trying to solve is finding connected components in this graph.You can use NetworkX for this, which has the advantage that it's pretty much guaranteed to be correct:
To solve this efficiently yourself you have to convert the list into something graph-ish anyways, so you might as well use networkX from the start.
算法:
因此您可能想使用集合而不是列表。下面的程序应该可以做到这一点。
Algorithm:
So you might want to use sets instead of list. The following program should do it.
我需要对相当大的列表执行 OP 描述的聚类技术数百万次,因此想要确定上面建议的哪种方法最准确且性能最高。
我对上述每种方法的大小从 2^1 到 2^10 的输入列表进行了 10 次试验,每种方法使用相同的输入列表,并测量了上面提出的每种算法的平均运行时间(以毫秒为单位)。以下是结果:
这些结果帮助我发现,在始终返回正确结果的方法中,@jochen 的方法是最快的。在那些不能始终返回正确结果的方法中,mak 的解决方案通常不包含所有输入元素(即缺少列表成员列表),并且 braaksma、cmangla 和 asterisk 的解决方案不能保证最大程度地合并。
有趣的是,迄今为止,两种最快、正确的算法获得了最多的赞成票,并且按正确的排名顺序排列。
这是用于运行测试的代码:
以及用于绘图的代码:
I needed to perform the clustering technique described by the OP millions of times for rather large lists, and therefore wanted to determine which of the methods suggested above is both most accurate and most performant.
I ran 10 trials for input lists sized from 2^1 through 2^10 for each method above, using the same input list for each method, and measured the average runtime for each algorithm proposed above in milliseconds. Here are the results:
These results helped me see that of the methods that consistently return correct results, @jochen's is the fastest. Among those methods that don't consistently return correct results, mak's solution often does not include all of the input elements (i.e. list of list members are missing), and the solutions of braaksma, cmangla, and asterisk are not guaranteed to be maximally merged.
It's interesting that the two fastest, correct algorithms have the top two amount of upvotes to date, in properly ranked order.
Here's the code used to run the tests:
And for plotting:
我遇到了尝试合并具有共同值的列表的相同问题。这个例子可能就是您正在寻找的。
它只循环列表一次并更新结果集。
#
I came across the same issue of trying to merge down lists with common values. This example may be what you are looking for.
It only loops over lists once and updates resultset as it goes.
#
我发现 itertools 是合并列表的快速选项,它为我解决了这个问题:
对于大型集合,按频率从最常见的元素到最少的元素对 LL 进行排序可以加快速度
I have found itertools a fast option for merging lists and it solved this problem for me:
For large sets sorting LL by frequency from the most common elements to the least can speed things up a bit
我认为这可以通过将问题建模为图表来解决。每个子列表都是一个节点,并且仅当两个子列表具有某些共同元素时才与另一个节点共享边。因此,合并的子列表基本上是图中的连接组件。合并所有组件只需找到所有连接的组件并列出它们即可。
这可以通过对图的简单遍历来完成。 BFS 和 DFS 可以使用,但我在这里使用 DFS,因为它对我来说有点短。
I think this can be solved by modelling the problem as a graph. Each sublist is a node and shares an edge with another node only if the two sublists have some element in common. Thus, a merged sublist is basically a connected component in the graph. Merging all of them is simply a matter of finding all connected components and listing them.
This can be done by a simple traversal over the graph. Both BFS and DFS can be used, but I'm using DFS here since it is somewhat shorter for me.
正如 Jochen Ritzel 指出的,您正在寻找图中的连通分量。以下是在不使用图形库的情况下实现它的方法:
As Jochen Ritzel pointed out you are looking for connected components in a graph. Here is how you could implement it without using a graph library:
您可以使用networkx库,因为它是图论和连接组件问题:
输出:
You can use networkx library because is a graph theory and connected components problem:
Output:
我怀念一个非古怪的版本。我在 2018 年(7 年后)发布了它
简单且不稳定方法:
1)使笛卡尔积(交叉连接)合并两个共同的 if 元素
2)删除重复项
I miss a non quirurgic version. I post it on 2018 (7 years later)
An easy and understable approach:
1) make cartesian product ( cross join ) merging both if elements in common
2) remove dups
我的尝试。具有实用的外观。
My attempt. Has functional look to it.
这是一个相当快的解决方案,没有依赖性。它的工作原理如下:
为每个 subsists 分配一个唯一的参考号(在本例中为子列表的初始索引)
为每个子列表以及每个子列表中的每个项目创建引用元素的字典。
重复以下过程,直到没有任何变化:
3a。浏览每个子列表中的每个项目。如果该项目的当前引用编号与其子列表的引用编号不同,则该元素必须是两个列表的一部分。合并两个列表(从引用中删除当前子列表)并将当前子列表中所有项目的引用编号设置为新子列表的引用编号。
当此过程没有引起任何变化时,这是因为所有元素都属于一个列表。由于工作集的大小在每次迭代中都会减小,因此算法必然终止。
以下是对此代码的一组测试:
请注意,返回值是一个集合列表。
This is a fairly fast solution with no dependencies. It works as follows:
Assign a unique reference number to each of your subsists (in this case, the initial index of the sublist)
Create a dictionary of the reference elements for each sublist, and for each item in each sublist.
Repeat the following procedure until it causes no changes:
3a. Go through each item in each sublist. If that item's current reference number is different from the reference number of its sublist, then the element must be part of two lists. Merge the two lists (removing the current sublist from the reference) and set the reference number of all items in the current sublist to be the reference number of the new sublist.
When this procedure causes no changes, it is because all elements are part of exactly one list. Since working set is decreasing in size on every iteration, the algorithm necessarily terminates.
Here are a set of tests for this code:
Note that the return value is a list of sets.
在不太了解你想要什么的情况下,我决定猜测你的意思:我想只找到每个元素一次。
输出看起来像:
Without knowing quite what you want, I decided to just guess you meant: I want to find every element just once.
Output looks like:
这可能是一个更简单/更快的算法,并且似乎运行良好 -
This is perhaps a simpler/faster algorithm and seems to work well -
简单来说,你可以使用快速查找。
关键是要使用两个临时列表。
第一个称为元素,它存储所有组中存在的所有元素。
第二个名为标签。我从sklearn的kmeans算法中得到了灵感。 “labels”存储元素的标签或质心。这里我简单地让簇中的第一个元素作为质心。最初,这些值从 0 到 length-1,按升序排列。
对于每个组,我在“元素”中获取他们的“索引”。
然后,我根据索引获取组标签。
我计算标签的最小值,这将是它们的新标签。
我将组标签中带有标签的所有元素替换为新标签。
或者说,对于每次迭代,
我尝试合并两个或多个现有小组。
如果该组的标签为 0 和 2
我找到了新标签 0,这是两者中的最小值。
我将它们替换为 0。
我打印出了您问题的详细结果:
请有关详细信息,请参阅我的github jupyter笔记本
Simply speaking, you can use quick-find.
The key is to use two temporary list.
The first is called elements, which stores all elements existing in all groups.
The second is named labels. I got the inspiration from sklearn's kmeans algorithm. 'labels' stores the labels or the centroids of the elements. Here I simply let the first element in the cluster be the centroid. Initially, the values are from 0 to length-1, ascendingly.
For each group, I get their 'indices' in 'elements'.
I then get the labels for group according to indices.
And I calculate the min of the labels, that will be the new label for them.
I replace all elements with labels in labels for group with the new label.
Or to say, for each iteration,
I try to combine two or more existing groups.
If the group has labels of 0 and 2
I find out the new label 0, that is the min of the two.
I than replace them with 0.
I printed out the detailed result for your question:
Please refer to my github jupyter notebook for details
这是我的答案。
Here's my answer.