如何在 MATLAB 中的一组元胞数组中高效查找唯一元胞数组？

发布于 2024-10-22 11:36:57 字数 463 浏览 10 评论 0原文

我需要在一组元胞数组中仅查找唯一的元胞数组。例如，如果这是我的输入：

I = {{'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c'} ...
     {'d' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'c' 'e'}};

那么我希望我的输出如下所示：

I_unique = {{'a' 'b' 'c' 'd' 'e'} ...
            {'a' 'b' 'c'} ...
            {'d' 'e'} ...
            {'a' 'c' 'e'}};

你知道如何做到这一点吗？输出中元素的顺序并不重要，但效率很重要，因为元胞数组 I 可能非常大。

原文

I need to find only unique cell arrays within a set of cell arrays. For example, if this is my input:

I = {{'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c'} ...
     {'d' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'c' 'e'}};

Then I would want my output to look like this:

I_unique = {{'a' 'b' 'c' 'd' 'e'} ...
            {'a' 'b' 'c'} ...
            {'d' 'e'} ...
            {'a' 'c' 'e'}};

Do you have any idea how to do this? The order of elements in the output doesn't matter, but efficiency does since the cell array I could be very large.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

断舍离 2024-10-29 11:36:57

如果您的单元格仅包含排序的单个字符，那么您可以使用以下方法仅保留唯一序列：

>> I = {{'a' 'b' 'c' 'd' 'e'} {'a' 'b' 'c'} {'d' 'e'} {'a' 'b' 'c' 'd' 'e'} {'a' 'b' 'c' 'd' 'e'} {'a' 'c' 'e'}};
>> I_unique = cellfun(@char, I, 'uniformoutput', 0);
>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique = unique(I_unique)

I_unique = 

    'abc'    'abcde'    'ace'    'de'

然后您可以再次将结果单元格拆分为单个字符：

>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique = cellfun(@cellstr, I_unique, 'uniformoutput', 0);
>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique{:}

ans = 

    'a'    'b'    'c'


ans = 

    'a'    'b'    'c'    'd'    'e'


ans = 

    'a'    'c'    'e'


ans = 

    'd'    'e'

If your cells contain only sorted single characters then you can retain just the unique sequences using:

>> I = {{'a' 'b' 'c' 'd' 'e'} {'a' 'b' 'c'} {'d' 'e'} {'a' 'b' 'c' 'd' 'e'} {'a' 'b' 'c' 'd' 'e'} {'a' 'c' 'e'}};
>> I_unique = cellfun(@char, I, 'uniformoutput', 0);
>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique = unique(I_unique)

I_unique = 

    'abc'    'abcde'    'ace'    'de'

You can then split the resulting cells into single characters again:

>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique = cellfun(@cellstr, I_unique, 'uniformoutput', 0);
>> I_unique = cellfun(@transpose, I_unique, 'uniformoutput', 0);
>> I_unique{:}

ans = 

    'a'    'b'    'c'


ans = 

    'a'    'b'    'c'    'd'    'e'


ans = 

    'a'    'c'    'e'


ans = 

    'd'    'e'

回复收藏 0 原文

江湖正好 2024-10-29 11:36:57

编辑： 更新为使用更高效的算法。

如果效率等同于 I 中的大量集合，那么您最好的选择是可能会滚动你自己的优化循环。这个问题与之前关于如何有效地删除属于或等于另一个集合的子集。这里的区别在于，您不关心删除子集，只关心重复，因此我对另一个问题的回答可以修改以进一步减少进行比较的次数。

首先，我们可以认识到比较具有不同数量元素的集合是没有意义的，因为在这种情况下它们不可能匹配。因此，第一步是计算每个集合中字符串的数量，然后循环遍历具有相同数量字符串的每组集合。

对于每个组，我们将有两个嵌套循环：一个从集合末尾开始的每个集合的外部循环，以及该集合之前的每个集合的内部循环。如果/当找到第一个匹配项时，我们可以将该集合标记为“不唯一”并中断内部循环以避免额外的比较。在集合末尾开始外循环给我们带来了额外的好处，即 I_unique 中的集合将保持 I 中的原始出现顺序。

这是生成的代码：

I = {{'a' 'b' 'c' 'd' 'e'} ...  %# The sample cell array of cell arrays of
     {'a' 'b' 'c'} ...          %#   strings from the question
     {'d' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'c' 'e'}};
nSets = numel(I);                    %# The number of sets
nStrings = cellfun('prodofsize',I);  %# The number of strings per set
uniqueIndex = true(1,nSets);         %# A logical index of unique elements

for currentSize = unique(nStrings)   %# Loop over each unique number of strings

  subIndex = find(nStrings == currentSize);  %# Get the subset of I with the
  subSet = I(subIndex);                      %#   given number of strings

  for currentIndex = numel(subSet):-1:2      %# Outer loop
    for compareIndex = 1:currentIndex-1      %# Inner loop
      if isequal(subSet{currentIndex},subSet{compareIndex})  %# Check equality
        uniqueIndex(subIndex(currentIndex)) = false;  %# Mark as "not unique"
        break                                %# Break the inner loop
      end
    end
  end

end

I_unique = I(uniqueIndex);  %# Get the unique values

EDIT: Updated to use a more efficient algorithm.

If efficiency is tantamount due to a large number of sets in I, then your best option is probably to roll your own optimized loops. This problem bears some similarity to a previous question about how to efficiently remove sets that are subsets of or equal to another. The difference here is that you are not concerned with removing subsets, just duplicates, so the code in my answer to the other question can be modified to further reduce the number of comparisons made.

First we can recognize that there's no point in comparing sets that have different numbers of elements, since they can't possibly match in that case. So, the first step is to count the number of strings in each set, then loop over each group of sets that have the same number of strings.

For each of these groups, we will have two nested loops: an outer loop over each set starting at the end of the sets, and an inner loop over every set preceding that one. If/When the first match is found, we can mark that set as "not unique" and break the inner loop to avoid extra comparisons. Starting the outer loop at the end of the sets gives us the added bonus that sets in I_unique will maintain the original order of appearance in I.

And here is the resulting code:

I = {{'a' 'b' 'c' 'd' 'e'} ...  %# The sample cell array of cell arrays of
     {'a' 'b' 'c'} ...          %#   strings from the question
     {'d' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'b' 'c' 'd' 'e'} ...
     {'a' 'c' 'e'}};
nSets = numel(I);                    %# The number of sets
nStrings = cellfun('prodofsize',I);  %# The number of strings per set
uniqueIndex = true(1,nSets);         %# A logical index of unique elements

for currentSize = unique(nStrings)   %# Loop over each unique number of strings

  subIndex = find(nStrings == currentSize);  %# Get the subset of I with the
  subSet = I(subIndex);                      %#   given number of strings

  for currentIndex = numel(subSet):-1:2      %# Outer loop
    for compareIndex = 1:currentIndex-1      %# Inner loop
      if isequal(subSet{currentIndex},subSet{compareIndex})  %# Check equality
        uniqueIndex(subIndex(currentIndex)) = false;  %# Mark as "not unique"
        break                                %# Break the inner loop
      end
    end
  end

end

I_unique = I(uniqueIndex);  %# Get the unique values

回复收藏 0 原文

~没有更多了~