对可迭代的大型笛卡尔积进行随机采样

发布于 2025-01-11 06:25:37 字数 1092 浏览 3 评论 0原文

我有多个可迭代对象，我需要创建这些可迭代对象的笛卡尔积，然后从生成的元组池中随机采样。问题是这些可迭代的组合总数约为 1e19 左右，所以我不可能将所有这些加载到内存中。

我的想法是使用 itertools.product 与随机数生成器结合来跳过随机数量的项目，然后一旦我到达随机选择的项目，我就会执行计算并继续，直到用完发电机。计划是做类似的事情：

from itertools import product
from random import randint

iterables = () # tuple of 18 iterables
versions = product(iterables)

def do_stuff():
    # do stuff

STEP_SIZE = int(1e6)

# start both counts from 0. 
# First value to be taken is start + step
# after that increment start to be equal to count and repeat
start = 0
count = 0

while True:
    try:
        step = randint(1, 100) * STEP_SIZE

        for v in versions:
            # if the count is less than required skip values while incrementing count
            if count < start + step:
                versions.next()
                count += 1
            else:
                do_stuff(*v)
                start = count             
    except StopIteration:
        break

不幸的是，itertools.product对象没有next()方法，所以这不起作用。还有什么其他方法可以遍历如此大量的元组并随机采样或直接对值进行计算？

原文

I have multiple iterables and I need to create the Cartesian product of those iterables and then randomly sample from the resulting pool of tuples. The problem is that the total number of combinations of these iterables is somewhere around 1e19, so I can't possibly load all of this into memory.

What I thought was using itertools.product in combination with a random number generator to skip random number of items, then once I get to the randomly selected item, I perform my calculations and continue until I run out of the generator. The plan was to do something like:

from itertools import product
from random import randint

iterables = () # tuple of 18 iterables
versions = product(iterables)

def do_stuff():
    # do stuff

STEP_SIZE = int(1e6)

# start both counts from 0. 
# First value to be taken is start + step
# after that increment start to be equal to count and repeat
start = 0
count = 0

while True:
    try:
        step = randint(1, 100) * STEP_SIZE

        for v in versions:
            # if the count is less than required skip values while incrementing count
            if count < start + step:
                versions.next()
                count += 1
            else:
                do_stuff(*v)
                start = count             
    except StopIteration:
        break

Unfortunately, itertools.product objects don't have the next() method, so this doesn't work. What other way is there to go through this large number of tuples and either take a random sample or directly run calculations on the values?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟燃烟灭 2025-01-18 06:25:37

不要尝试生成笛卡尔积。使用random.choice()一次从一个迭代中采样以生成结果。所有可迭代对象的元素数量都很小，因此您可以将所有元素直接存储在内存中。

下面是一个使用 18 个迭代器的示例，每个迭代器有 10 个元素（如注释中指定）：

import random

iterables = [list(range(i, i + 10)) for i in range(0, 180, 10)]
result = [random.choice(iterable) for iterable in iterables]
print(result)

Don't try to generate the Cartesian product. Sample from one iterable at a time to generate your result using random.choice(). The number of elements across all iterables is small, so you can store all the elements in memory directly.

Here's an example using 18 iterables with 10 elements each (as specified in the comment):

import random

iterables = [list(range(i, i + 10)) for i in range(0, 180, 10)]
result = [random.choice(iterable) for iterable in iterables]
print(result)

回复收藏 0 原文

睡美人的小仙女 2025-01-18 06:25:37

您使用的是哪个版本的 Python？在此过程中，.next() 方法被弃用，取而代之的是新的 next() 内置函数。这对所有迭代器都适用。例如，在当前发布的 3.10.1 下：

>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
...     ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)

除此之外，您没有向我们展示代码中最重要的部分：如何构建产品。

如果没有看到这一点，我只能猜测，从传递给 product() 的第一个序列中进行随机选择，然后从第二个序列中进行随机选择，效率会远，等等。一次从一个组件构建产品的随机元素。

有效地选择随机乘积元组

也许有些过分了，但是这个类展示了一种特别有效的方法来做到这一点。 .index() 方法将整数 i 映射到乘积中的第 i 个元组（从 0 开始）。然后从产品中选择一个随机元组只需将 .index() 应用于范围（产品中元素的总数） 中的随机整数。

from math import prod
from random import randrange

class RanProduct:
    def __init__(self, iterables):
        self.its = list(map(list, iterables))
        self.n = prod(map(len, self.its))

    def index(self, i):
        if i not in range(self.n):
            raise ValueError(f"index {i} not in range({self.n})")
        result = []
        for it in reversed(self.its):
            i, r = divmod(i, len(it))
            result.append(it[r])
        return tuple(reversed(result))

    def pickran(self):
        return self.index(randrange(self.n))

进而

>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
...     print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

Which version of Python are you using? Somewhere along the way .next() methods were deprecated in favor a new next() built-in function. That works fine with all iterators. Here, for example, under the current released 3.10.1:

>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
...     ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)

Beyond that, you didn't show us the most important part of your code: how you build your product.

Without seeing that, I can only guess that it would be far more efficient to make a random choice from the first sequence passed to product(), then another from the second, and so on. Build a random element of the product from one component at a time.

Picking a random product tuple efficiently

Perhaps overkill, but this class shows an especially efficient way to do this. The .index() method maps an integer i to the i'th tuple (0-based) in the product. Then picking a random tuple from the product is simply applying .index() to a random integer in range(total number of elements in the product).

from math import prod
from random import randrange

class RanProduct:
    def __init__(self, iterables):
        self.its = list(map(list, iterables))
        self.n = prod(map(len, self.its))

    def index(self, i):
        if i not in range(self.n):
            raise ValueError(f"index {i} not in range({self.n})")
        result = []
        for it in reversed(self.its):
            i, r = divmod(i, len(it))
            result.append(it[r])
        return tuple(reversed(result))

    def pickran(self):
        return self.index(randrange(self.n))

and then

>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
...     print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

回复收藏 0 原文

~没有更多了~