从列表列中随机选择值,以便选择列表中的所有元素

发布于 2025-01-11 07:01:35 字数 528 浏览 3 评论 0原文

比如说,我有一个带有列表列“event_ids”的 pandas 数据框,

code    canceled  event_ids
xxx     [1.0]     [107385, 128281, 133015]
xxS     [0.0]     [108664, 110515, 113556]
ssD     [1.0]     [134798, 133499, 125396, 114298, 133915]
cvS     [0.0]     [107611]
eeS     [5.0]     [113472, 115236, 108586, 128043, 114106, 10796...
544W    [44.0]    [107650, 128014, 127763, 118036, 116247, 12802.

如何充分随机地选择 k 行,以便在样本中表示“event_ids”中的所有元素?我的意思是样本中的事件词汇应该与总体中的事件词汇相同。我所说的“足够”随机是指是否可以进行某种重要性采样,以便最初样本是随机的,并根据某种条件添加或拒绝。

Say, I had a pandas dataframe with a list column 'event_ids'

code    canceled  event_ids
xxx     [1.0]     [107385, 128281, 133015]
xxS     [0.0]     [108664, 110515, 113556]
ssD     [1.0]     [134798, 133499, 125396, 114298, 133915]
cvS     [0.0]     [107611]
eeS     [5.0]     [113472, 115236, 108586, 128043, 114106, 10796...
544W    [44.0]    [107650, 128014, 127763, 118036, 116247, 12802.

How to select k rows sufficiently randomly so that all elements across 'event_ids' are represented in the sample? By that I mean the event vocabulary in samples should be same as that of the population. By 'sufficiently' random I mean if some sort of importance sampling is possible so that initially the samples are random and added or rejected according to some condition.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情深缘浅 2025-01-18 07:01:36

目前尚不清楚您是否要选择 events_ids 列表中的每个元素,或者每个列表是否应被视为唯一元素。
在后一种情况下,这可能有效(不确定性能!)

给定此数据集:

x = np.random.randint(1,100, 5000)
y = [np.random.choice(['A','B','C','D','E','F']) for i in range(5000)]

df = pd.DataFrame({'x':x,'y':y})
df.head()

Output:
    x   y
0   42  A
1   88  B
2   80  A
3   69  B
4   72  B

“x”列中有 99 个唯一值。您希望进行采样,以便 df['x'] 中的每个唯一值都位于获得的样本中。

idxs = []

for i in df.x.unique():
    idxs.extend(np.random.choice(df.loc[df['x']==i].index, size=1))


sample = df.loc[idxs]
len(sample.x.unique())

Output:
99

您可以更改首选大小以获得样本中的更多值。

如果您想要 events_ids 中每个列表中的每个唯一元素,那么您可以使用explode,然后使用相同的代码。

df

Out:

   x    y   z
0   84  D   [14805, 9243, 14838, 10204]
1   70  D   [6901, 1117, 3918, 8607, 1912]
2   7   F   [9853, 12519, 13011, 13279]
3   45  A   [6344, 14646, 9633, 4517, 9432, 11187]
4   41  A   [1104, 10318, 12531, 9443, 8347] 

df = df.explode('z').reset_index()
df.head()
Out:
    x   y   z
0   13  D   1876
1   13  D   2437
2   13  D   2681
3   13  D   1748
4   37  E   10155

It is not clear if you want to select each element within the list in events_ids, or if each list should be considered as a unique element.
In the latter case, this could work (not sure about the performance!)

Given this dataset:

x = np.random.randint(1,100, 5000)
y = [np.random.choice(['A','B','C','D','E','F']) for i in range(5000)]

df = pd.DataFrame({'x':x,'y':y})
df.head()

Output:
    x   y
0   42  A
1   88  B
2   80  A
3   69  B
4   72  B

There are 99 unique values in column 'x'. You want to sample so that every unique value in df['x'] is in the obtained sample.

idxs = []

for i in df.x.unique():
    idxs.extend(np.random.choice(df.loc[df['x']==i].index, size=1))


sample = df.loc[idxs]
len(sample.x.unique())

Output:
99

You can change the preferred size to obtain more values in your sample.

If you want each unique element in each list in events_ids, then you can use explode and then use the same code.

df

Out:

   x    y   z
0   84  D   [14805, 9243, 14838, 10204]
1   70  D   [6901, 1117, 3918, 8607, 1912]
2   7   F   [9853, 12519, 13011, 13279]
3   45  A   [6344, 14646, 9633, 4517, 9432, 11187]
4   41  A   [1104, 10318, 12531, 9443, 8347] 

df = df.explode('z').reset_index()
df.head()
Out:
    x   y   z
0   13  D   1876
1   13  D   2437
2   13  D   2681
3   13  D   1748
4   37  E   10155

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文