在 Smalltalk 中使用其他集合选择对象的最快方法

发布于 2024-11-27 07:36:59 字数 2930 浏览 0 评论 0原文

给定两个集合:

srcCollection := #('Lorem' 'ipsum' 'dolor' 'sit' 'amet,' 'consectetur' 'adipisicing' 'elit,' 'sed' 'do' 'eiusmod' 'tempor' 'incididunt' 'ut' 'labore' 'et' 'dolore' 'magna' 'aliqua.' 'Ut' 'enim' 'ad' 'minim' 'veniam,' 'quis' 'nostrud' 'exercitation' 'ullamco' 'laboris' 'nisi' 'ut' 'aliquip' 'ex' 'ea' 'commodo' 'consequat.' 'Duis' 'aute' 'irure' 'dolor' 'in' 'reprehenderit' 'in' 'voluptate' 'velit' 'esse' 'cillum' 'dolore' 'eu' 'fugiat' 'nulla' 'pariatur.' 'Excepteur' 'sint' 'occaecat' 'cupidatat' 'non' 'proident,' 'sunt' 'in' 'culpa' 'qui' 'officia' 'deserunt' 'mollit' 'anim' 'id' 'est' 'laborum').      
objCollection := #('Lorem' 'numquam' 'eius' 'modi' 'tempora' 'incidunt' 'ut' 'labore' 'et' 'dolore' 'magnam' 'aliquam' 'ipsum' 'dolor' 'ex' 'ea' 'commodo' 'consequat.' 'Duis' 'aute' 'irure' 'dolor' 'in' 'reprehenderit' 'in' 'voluptate' 'velit' 'esse' 'cillum' 'dolore' 'eu' 'fugiat' 'nulla' 'pariatur.' 'Excepteur' 'sint' 'occaecat' 'cupidatat' 'non' 'proident,' 'sunt' 'in' 'culpa' 'qui' 'officia' 'deserunt' 'mollit' 'anim' 'id' 'est' 'laborum' 'Sed' 'ut' 'perspiciatis' 'unde' 'omnis' 'iste' 'natus' 'error' 'sit' 'voluptatem' 'accusantium' 'doloremque' 'laudantium,' 'totam' 'rem' 'aperiam,' 'eaque' 'ipsa' 'quae' 'ab' 'illo' 'inventore' 'veritatis' 'et' 'quasi' 'architecto' 'sit' 'amet,' 'consectetur' 'adipisicing' 'elit,' 'sed' 'do' 'eiusmod' 'tempor' 'incididunt' 'ut' 'labore' 'et' 'dolore' 'magna' 'aliqua.' 'Ut' 'enim' 'ad' 'minim' 'veniam,' 'quis' 'nostrud' 'exercitation' 'ullamco' 'laboris' 'nisi' 'ut' 'aliquip' 'beatae' 'vitae' 'dicta' 'sunt' 'explicabo.' 'Nemo' 'enim' 'ipsam' 'voluptatem' 'quia' 'voluptas' 'sit' 'aspernatur' 'aut' 'odit' 'aut' 'fugit,' 'sed' 'quia' 'consequuntur' 'magni' 'dolores' 'eos' 'qui' 'ratione' 'voluptatem' 'sequi' 'nesciunt.' 'Neque' 'porro' 'quisquam' 'est,' 'qui' 'dolorem' 'ipsum' 'quia' 'dolor' 'sit' 'amet,' 'consectetur,' 'adipisci' 'velit,' 'sed' 'quia' 'non' 'quaerat' 'voluptatem.' 'Ut' 'enim' 'ad' 'minima' 'veniam,' 'quis' 'nostrum' 'exercitationem' 'ullam' 'corporis' 'suscipit' 'laboriosam,' 'nisi' 'ut' 'aliquid' 'ex' 'ea' 'commodi' 'consequatur?' 'Quis' 'autem' 'vel' 'eum' 'iure' 'reprehenderit' 'qui' 'in' 'ea' 'voluptate' 'velit' 'esse' 'quam' 'nihil' 'molestiae' 'consequatur,' 'vel' 'illum' 'qui' 'dolorem' 'eum' 'fugiat' 'quo' 'voluptas' 'nulla' 'pariatur?').

其中 objCollection 保证包含 srcCollection 中的所有元素。注意:在我的应用程序中,objCollection 实际上是复杂的对象,包含这些字符串作为标识符,没有重复项。

我一直在测量并尝试优化选择 objCollection 中也在 srcCollection 中的所有对象。以下时间以毫秒为单位,在具有 Stack VM 的 Pharo 1.2 和具有 2Gb 内存的 Windows XP 中使用 [ 1000 timesRepeat: [ ... ] ] timeToRun 。这些是我的尝试:

objCollection intersection: srcCollection
7537
7507

objCollection select: [: str | srcCollection includes: str ]
7471
7507

srcCollection collect: [: str | objCollection detect: [: obj | obj = str ] ]
4227
4323

有没有更快的方法?

Given two collections:

srcCollection := #('Lorem' 'ipsum' 'dolor' 'sit' 'amet,' 'consectetur' 'adipisicing' 'elit,' 'sed' 'do' 'eiusmod' 'tempor' 'incididunt' 'ut' 'labore' 'et' 'dolore' 'magna' 'aliqua.' 'Ut' 'enim' 'ad' 'minim' 'veniam,' 'quis' 'nostrud' 'exercitation' 'ullamco' 'laboris' 'nisi' 'ut' 'aliquip' 'ex' 'ea' 'commodo' 'consequat.' 'Duis' 'aute' 'irure' 'dolor' 'in' 'reprehenderit' 'in' 'voluptate' 'velit' 'esse' 'cillum' 'dolore' 'eu' 'fugiat' 'nulla' 'pariatur.' 'Excepteur' 'sint' 'occaecat' 'cupidatat' 'non' 'proident,' 'sunt' 'in' 'culpa' 'qui' 'officia' 'deserunt' 'mollit' 'anim' 'id' 'est' 'laborum').      
objCollection := #('Lorem' 'numquam' 'eius' 'modi' 'tempora' 'incidunt' 'ut' 'labore' 'et' 'dolore' 'magnam' 'aliquam' 'ipsum' 'dolor' 'ex' 'ea' 'commodo' 'consequat.' 'Duis' 'aute' 'irure' 'dolor' 'in' 'reprehenderit' 'in' 'voluptate' 'velit' 'esse' 'cillum' 'dolore' 'eu' 'fugiat' 'nulla' 'pariatur.' 'Excepteur' 'sint' 'occaecat' 'cupidatat' 'non' 'proident,' 'sunt' 'in' 'culpa' 'qui' 'officia' 'deserunt' 'mollit' 'anim' 'id' 'est' 'laborum' 'Sed' 'ut' 'perspiciatis' 'unde' 'omnis' 'iste' 'natus' 'error' 'sit' 'voluptatem' 'accusantium' 'doloremque' 'laudantium,' 'totam' 'rem' 'aperiam,' 'eaque' 'ipsa' 'quae' 'ab' 'illo' 'inventore' 'veritatis' 'et' 'quasi' 'architecto' 'sit' 'amet,' 'consectetur' 'adipisicing' 'elit,' 'sed' 'do' 'eiusmod' 'tempor' 'incididunt' 'ut' 'labore' 'et' 'dolore' 'magna' 'aliqua.' 'Ut' 'enim' 'ad' 'minim' 'veniam,' 'quis' 'nostrud' 'exercitation' 'ullamco' 'laboris' 'nisi' 'ut' 'aliquip' 'beatae' 'vitae' 'dicta' 'sunt' 'explicabo.' 'Nemo' 'enim' 'ipsam' 'voluptatem' 'quia' 'voluptas' 'sit' 'aspernatur' 'aut' 'odit' 'aut' 'fugit,' 'sed' 'quia' 'consequuntur' 'magni' 'dolores' 'eos' 'qui' 'ratione' 'voluptatem' 'sequi' 'nesciunt.' 'Neque' 'porro' 'quisquam' 'est,' 'qui' 'dolorem' 'ipsum' 'quia' 'dolor' 'sit' 'amet,' 'consectetur,' 'adipisci' 'velit,' 'sed' 'quia' 'non' 'quaerat' 'voluptatem.' 'Ut' 'enim' 'ad' 'minima' 'veniam,' 'quis' 'nostrum' 'exercitationem' 'ullam' 'corporis' 'suscipit' 'laboriosam,' 'nisi' 'ut' 'aliquid' 'ex' 'ea' 'commodi' 'consequatur?' 'Quis' 'autem' 'vel' 'eum' 'iure' 'reprehenderit' 'qui' 'in' 'ea' 'voluptate' 'velit' 'esse' 'quam' 'nihil' 'molestiae' 'consequatur,' 'vel' 'illum' 'qui' 'dolorem' 'eum' 'fugiat' 'quo' 'voluptas' 'nulla' 'pariatur?').

where objCollection is guaranteed to contain all elements in srcCollection. Note: In my application objCollection are actually complex objects containing these strings as identifiers with no duplicates.

I've been measuring and trying to optimize selecting all the objects in objCollection that are also in srcCollection. The times below are in milliseconds using [ 1000 timesRepeat: [ ... ] ] timeToRun in Pharo 1.2 with Stack VM and Windows XP with 2Gb of memory. These are my attempts:

objCollection intersection: srcCollection
7537
7507

objCollection select: [: str | srcCollection includes: str ]
7471
7507

srcCollection collect: [: str | objCollection detect: [: obj | obj = str ] ]
4227
4323

Is there a faster way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

秋风の叶未落 2024-12-04 07:36:59

前两个做同样的事情:Collection>>> #intersection: 的实现是self select: [:each | aCollection 包括:每个]

收藏>> #intersection: 最终使用 self anySatisfy: [:x | x = mySearchObj] 来完成其工作,即使用 #do: 迭代集合。 #detect: 最终会做同样的事情。

我怀疑您所看到的差异并不是因为三者中的任何一个比另一个更有效,而是垃圾收集等事物的产物。

鉴于此,我会选择 #intersection: 因为它的语义清晰。它说的是你想要什么,而不是其他两个,你只能看到如何获得你想要的东西,并且必须推断/推断出意图。

The first two do the same thing: Collection >> #intersection:'s implementation is self select: [:each | aCollection includes: each].

Collection >> #intersection: ultimately uses self anySatisfy: [:x | x = mySearchObj ] to do its job, which iterates through the collection using #do:. #detect: ultimately does the same thing.

I suspect the difference you're seeing's not because any of the three is more efficient than the other, but rather a product of things like garbage collection.

Given that, I'd choose #intersection: because of its semantic clarity. It says what you want, rather than the other two, where you only see how you get what you want and have to infer/deduce the intent.

怀里藏娇 2024-12-04 07:36:59

如果您有能力将 asSet 添加到这两个集合中,您可能会做得更快。我从 1480 毫秒变成了 197 毫秒。

If you can afford to add asSet to the two collections, you might do faster. I went from 1480 to 197 ms.

木落 2024-12-04 07:36:59

两个集合都有重复项。

srcCollection size ~= srcCollection asSet size.
objCollection size ~= objCollection asSet size.

如果您想处理重复项,假设您有一个可用的总订单 <<另一种可能性是使用这种方法(成本 n1*log(n1) + n2*log(n2) + n1 + n2 而不是 n1*n2 进行朴素交集)

Collection>>sortedIntersection: aCollection
    "Answer the intersection of two collections, sorted by < and accounting duplicates."
    | intersection obj objStream src srcStream |
    srcStream := self sorted readStream.
    objStream := aCollection sorted readStream.
    intersection := (Array new: self size) writeStream.
    [srcStream atEnd | objStream atEnd] whileFalse:
        [src := srcStream next.
        obj := objStream next.
        [src = obj] whileFalse:
            [[src < obj] whileTrue: [srcStream atEnd ifTrue: [^intersection contents]. src := srcStream next].
            [obj < src] whileTrue: [objStream atEnd ifTrue: [^intersection contents]. obj := objStream next]].
        intersection nextPut: src].
    ^intersection contents

这也可以使用堆,并删除第一个元素,但速度很慢。

heapSortedIntersection: aCollection
    "Answer the intersection of two collections, sorted by < and accounting duplicates."

    | intersection obj src objHeap srcHeap |
    srcHeap := Heap withAll: self.
    objHeap := Heap withAll: aCollection.
    intersection := (Array new: self size) writeStream.
    [srcHeap isEmpty | objHeap isEmpty] whileFalse:
        [src := srcHeap removeFirst.
        obj := objHeap removeFirst.
        [src = obj] whileFalse:
            [[src < obj] whileTrue: [srcHeap isEmpty ifTrue: [^intersection contents]. src := srcHeap removeFirst].
            [obj < src] whileTrue: [objHeap isEmpty ifTrue: [^intersection contents]. obj := objHeap removeFirst]].
        intersection nextPut: src].
    ^intersection contents

最后,如果您没有可用的总订单,您可以简单地使用 Bag,但仍会考虑重复项

bagIntersection: aCollection
    "Answer the intersection of two collections, accounting duplicates."

    | objBag absentTag |
    objBag := Bag withAll: aCollection.
    absentTag := Object new.
    ^self reject: [:each | (objBag remove: each ifAbsent: [absentTag]) == absentTag]

Both collections have duplicates.

srcCollection size ~= srcCollection asSet size.
objCollection size ~= objCollection asSet size.

If you want to handle duplicates, assuming you have a total order available via < another possibility is to use this method (cost n1*log(n1) + n2*log(n2) + n1 + n2 instead of n1*n2 for naive intersection)

Collection>>sortedIntersection: aCollection
    "Answer the intersection of two collections, sorted by < and accounting duplicates."
    | intersection obj objStream src srcStream |
    srcStream := self sorted readStream.
    objStream := aCollection sorted readStream.
    intersection := (Array new: self size) writeStream.
    [srcStream atEnd | objStream atEnd] whileFalse:
        [src := srcStream next.
        obj := objStream next.
        [src = obj] whileFalse:
            [[src < obj] whileTrue: [srcStream atEnd ifTrue: [^intersection contents]. src := srcStream next].
            [obj < src] whileTrue: [objStream atEnd ifTrue: [^intersection contents]. obj := objStream next]].
        intersection nextPut: src].
    ^intersection contents

That would also work using a Heap, and removing first element, but slowly.

heapSortedIntersection: aCollection
    "Answer the intersection of two collections, sorted by < and accounting duplicates."

    | intersection obj src objHeap srcHeap |
    srcHeap := Heap withAll: self.
    objHeap := Heap withAll: aCollection.
    intersection := (Array new: self size) writeStream.
    [srcHeap isEmpty | objHeap isEmpty] whileFalse:
        [src := srcHeap removeFirst.
        obj := objHeap removeFirst.
        [src = obj] whileFalse:
            [[src < obj] whileTrue: [srcHeap isEmpty ifTrue: [^intersection contents]. src := srcHeap removeFirst].
            [obj < src] whileTrue: [objHeap isEmpty ifTrue: [^intersection contents]. obj := objHeap removeFirst]].
        intersection nextPut: src].
    ^intersection contents

Finally, if you haven't a total order available, you can simply use a Bag, still accounting for duplicates

bagIntersection: aCollection
    "Answer the intersection of two collections, accounting duplicates."

    | objBag absentTag |
    objBag := Bag withAll: aCollection.
    absentTag := Object new.
    ^self reject: [:each | (objBag remove: each ifAbsent: [absentTag]) == absentTag]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文