统一洗牌两个 numpy 数组的更好方法
我有两个不同形状的 numpy 数组,但长度相同(主维)。我想对它们中的每一个进行洗牌,以便相应的元素继续对应——即根据它们的前导索引一致地对它们进行洗牌。
这段代码有效,并说明了我的目标:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
例如:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
然而,这感觉笨重、低效且缓慢,并且需要制作数组的副本——我宁愿将它们就地洗牌,因为它们会很麻烦。大的。
有更好的方法来解决这个问题吗?更快的执行速度和更低的内存使用量是我的主要目标,但优雅的代码也很好。
我的另一个想法是:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
这有效……但这有点可怕,因为我几乎看不到它会继续有效的保证——它看起来不像那种能保证在 numpy 版本中生存的东西,例如。
I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(18)
我认为最短、最简单的方法是使用 seed:
Shortest and easiest way in my opinion, use seed:
举个例子,这就是我正在做的事情:
With an example, this is what I'm doing:
我扩展了 python 的 random.shuffle() 以获取第二个参数:
这样我就可以确保洗牌就地发生,并且函数不会太长或太复杂。
I extended python's random.shuffle() to take a second arg:
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
只需使用 numpy ...
首先合并两个输入数组,一维数组是 labels(y) ,二维数组是 data(x) ,然后使用 NumPy
shuffle
方法对它们进行洗牌。最后分头返回。Just use
numpy
...First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy
shuffle
method. Finally split them and return.您可以使用 NumPy 的数组索引:
这将导致创建单独的统一打乱的数组。
Your can use NumPy's array indexing:
This will result in creation of separate unison-shuffled arrays.
要了解更多信息,请参阅 http://scikit-learn.org/稳定/模块/生成/sklearn.utils.shuffle.html
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
你的“可怕”解决方案对我来说并不可怕。对两个相同长度的序列调用 shuffle() 会导致对随机数生成器的调用次数相同,而这些是 shuffle 算法中唯一的“随机”元素。通过重置状态,您可以确保对随机数生成器的调用将在第二次调用
shuffle()
时给出相同的结果,因此整个算法将生成相同的排列。如果您不喜欢这样,另一种解决方案是将数据存储在一个数组中,而不是从一开始就存储在两个数组中,并在这个数组中创建两个视图来模拟您现在拥有的两个数组。您可以使用单个数组进行混洗,并将视图用于所有其他目的。
示例:假设数组
a
和b
如下所示:我们现在可以构造一个包含所有数据的数组:
现在我们创建模拟原始
a 的视图
和b
:a2
和b2
的数据与c
共享。要同时打乱两个数组,请使用 numpy.random.shuffle(c)。在生产代码中,您当然会尽量避免创建原始的
a
和b
并立即创建c
、a2
和b2
。该解决方案可以适用于
a
和b
具有不同数据类型的情况。Your "scary" solution does not appear scary to me. Calling
shuffle()
for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call toshuffle()
, so the whole algorithm will generate the same permutation.If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays
a
andb
look like this:We can now construct a single array containing all the data:
Now we create views simulating the original
a
andb
:The data of
a2
andb2
is shared withc
. To shuffle both arrays simultaneously, usenumpy.random.shuffle(c)
.In production code, you would of course try to avoid creating the original
a
andb
at all and right away createc
,a2
andb2
.This solution could be adapted to the case that
a
andb
have different dtypes.非常简单的解决方案:
两个数组 x,y 现在都以相同的方式随机洗牌
Very simple solution:
the two arrays x,y are now both randomly shuffled in the same way
James 在 2015 年编写了一个 sklearn 解决方案,很有帮助。但他添加了一个随机状态变量,这是不需要的。在下面的代码中,自动假定 numpy 的随机状态。
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
仅使用 NumPy 将任意数量的数组就地混合在一起。
并且可以像这样使用
需要注意的一些事项:
他们的第一维度。
洗牌后,可以使用 np.split 分割数据或使用切片引用数据 - 具体取决于应用程序。
Shuffle any number of arrays together, in-place, using only NumPy.
And can be used like this
A few things to note:
their first dimension.
After the shuffle, the data can be split using
np.split
or referenced using slices - depending on the application.您可以创建一个数组,如下所示:
然后对其进行洗牌:
现在使用此作为数组的参数。相同的打乱参数返回相同的打乱向量。
you can make an array like:
then shuffle it:
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
有一个众所周知的函数可以处理这个问题:
只需将 test_size 设置为 0 即可避免拆分并为您提供打乱的数据。
虽然它通常用于分割训练数据和测试数据,但它也会对它们进行洗牌。
来自文档
There is a well-known function that can handle this:
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
这看起来是一个非常简单的解决方案:
This seems like a very simple solution:
对连接列表进行就地混排的一种方法是使用种子(可以是随机的)并使用 numpy.random.shuffle 进行混排。
就是这样。这将以完全相同的方式对 a 和 b 进行洗牌。这也是就地完成的,这总是一个优点。
编辑,不要使用 np.random.seed() 而是使用 np.random.RandomState 调用
它时只需传入任何种子来提供随机状态:
输出:
编辑:固定代码以重新播种随机状态
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
When calling it just pass in any seed to feed the random state:
Output:
Edit: Fixed code to re-seed the random state
假设我们有两个数组:a 和 b。
我们可以首先通过排列第一维来获得行索引,
然后使用高级索引。
在这里,我们使用相同的索引来统一洗牌两个数组。
这相当于
Say we have two arrays: a and b.
We can first obtain row indices by permutating first dimension
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
This is equivalent to
上面的大多数解决方案都有效,但是如果您有列向量,则必须首先转置它们。这是一个例子
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
如果您想避免复制数组,那么我建议您不要生成排列列表,而是遍历数组中的每个元素,并将其随机交换到数组中的另一个位置
。这实现了 Knuth-Fisher-Yates 洗牌算法。
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
This implements the Knuth-Fisher-Yates shuffle algorithm.