熊猫矢量查找而没有弃用查找()

发布于 2025-02-10 06:30:45 字数 1861 浏览 1 评论 0原文

我的问题涉及fookup(),这将被弃用。所以我正在寻找另一种选择。文档建议使用loc()(似乎不适用于矢量化方法)或melt()(似乎很复杂)。此外,该文档建议fireverize()(我认为)对我的设置不起作用。

这是问题: 我有一个带有x,y值的2列数据框。

k = 20
y = random.choices(range(1,4),k=k)
x = random.choices(range(1,7),k=k)
tuples = list(zip(x,y))
df = pd.DataFrame(tuples, columns=["x", "y"])
df

而且我在df的Crosstab格式中有几个数据范围。例如,一个称为cij

Concordance table (Cij):
x     1     2     3    4     5     6  RTotal
y                                           
1   16     15    13  NaN     5   NaN     108
2   NaN    12   NaN   15   NaN   NaN      87
3   NaN   NaN     6  NaN    13    14     121

我现在想在df中从cij中执行矢量化查找,以生成新的列CRCdf中。到目前为止,看起来像这样(简单而简单):

df["Crc"] = Cij.lookup(df["y"],df["x"])

如果没有lookup(),我该如何实现同一件事?还是我只是不明白建议的替代方案?

提前致谢!

附录:根据要求的工作代码示例。

data = [[1,1],[1,1],[1,2],[1,2],[1,2],[1,3],[1,3],[1,5],[2,2],[2,4],[2,4],[2,4],[2,4],[2,4],[3,3],[3,3],[3,5],[3,5],[3,5],[3,6],[3,6],[3,6],[3,6],[3,6]]
df = pd.DataFrame(data, columns=["y", "x"])

# crosstab of df
ct_a = pd.crosstab(df["y"], df["x"])
Cij = pd.DataFrame([], index=ct_a.index, columns=ct_a.columns) #one of several dfs in ct_a layout

#row-wise, than column-wise filling of Cij
for i in range(ct_a.shape[0]):           
  for j in range(ct_a.shape[1]):
    if ct_a.iloc[i,j] != 0:
      Cij.iloc[i,j]= ct_a.iloc[i+1:,j+1:].sum().sum()+ct_a.iloc[:i,:j].sum().sum()

#vectorized lookup, to be substituted with future-proof method
df["Crc"] = Cij.lookup(df["y"],df["x"])

注意:在这种情况下,cij的基于循环的“填充”很好,因为df的crosstab总是很小。但是,df本身可能很大,因此矢量化查找是必需的。

My problem concerns lookup(), which is to be deprecated. So I'm looking for an alternative. Documentation suggests using loc() (which does not seem to work with a vectorized approach) or melt() (which seems quite convoluted). Furthermore, the documentation suggests factorize() which (I think) does not work for my setup.

Here is the problem:
I have a 2-column DataFrame with x,y-values.

k = 20
y = random.choices(range(1,4),k=k)
x = random.choices(range(1,7),k=k)
tuples = list(zip(x,y))
df = pd.DataFrame(tuples, columns=["x", "y"])
df

And I have several DataFrames in crosstab-format of df. For example one called Cij:

Concordance table (Cij):
x     1     2     3    4     5     6  RTotal
y                                           
1   16     15    13  NaN     5   NaN     108
2   NaN    12   NaN   15   NaN   NaN      87
3   NaN   NaN     6  NaN    13    14     121

I now want to perform a vectorized lookup in Cij from xy-pairs in df to generate a new column CrC in df. Which so far looked like this (plain and simple):

df["Crc"] = Cij.lookup(df["y"],df["x"])

How can I achieve the same thing without lookup()? Or did I just not understand the suggested alternatives?

Thanks in advance!

Addendum: Working code example as requested.

data = [[1,1],[1,1],[1,2],[1,2],[1,2],[1,3],[1,3],[1,5],[2,2],[2,4],[2,4],[2,4],[2,4],[2,4],[3,3],[3,3],[3,5],[3,5],[3,5],[3,6],[3,6],[3,6],[3,6],[3,6]]
df = pd.DataFrame(data, columns=["y", "x"])

# crosstab of df
ct_a = pd.crosstab(df["y"], df["x"])
Cij = pd.DataFrame([], index=ct_a.index, columns=ct_a.columns) #one of several dfs in ct_a layout

#row-wise, than column-wise filling of Cij
for i in range(ct_a.shape[0]):           
  for j in range(ct_a.shape[1]):
    if ct_a.iloc[i,j] != 0:
      Cij.iloc[i,j]= ct_a.iloc[i+1:,j+1:].sum().sum()+ct_a.iloc[:i,:j].sum().sum()

#vectorized lookup, to be substituted with future-proof method
df["Crc"] = Cij.lookup(df["y"],df["x"])

Note: In this case loop-based "filling" of Cij is fine, since crosstabs of df are always small. However, df itself can be very large so vectorized lookup is a necessity.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夕嗳→ 2025-02-17 06:30:45

iiuc,您可以stack cij,然后reidindex基于使用zip

df['Crc'] = Cij.stack().reindex(zip(df['y'], df['x'])).to_numpy()
print(df)

输出:输出:

    y  x   Crc
0   1  1  16.0
1   1  1  16.0
2   1  2  15.0
3   1  2  15.0
4   1  2  15.0
5   1  3  13.0
6   1  3  13.0
7   1  5   5.0
8   2  2    12
9   2  4    15
10  2  4    15
11  2  4    15
12  2  4    15
13  2  4    15
14  3  3   6.0
15  3  3   6.0
16  3  5  13.0
17  3  5  13.0
18  3  5  13.0
19  3  6  14.0
20  3  6  14.0
21  3  6  14.0
22  3  6  14.0
23  3  6  14.0

IIUC, you can stack Cij and then reindex based on a list of tuples created by using zip:

df['Crc'] = Cij.stack().reindex(zip(df['y'], df['x'])).to_numpy()
print(df)

Output:

    y  x   Crc
0   1  1  16.0
1   1  1  16.0
2   1  2  15.0
3   1  2  15.0
4   1  2  15.0
5   1  3  13.0
6   1  3  13.0
7   1  5   5.0
8   2  2    12
9   2  4    15
10  2  4    15
11  2  4    15
12  2  4    15
13  2  4    15
14  3  3   6.0
15  3  3   6.0
16  3  5  13.0
17  3  5  13.0
18  3  5  13.0
19  3  6  14.0
20  3  6  14.0
21  3  6  14.0
22  3  6  14.0
23  3  6  14.0
叹倦 2025-02-17 06:30:45

使用文档中的路径,您可以复制查找功能:

x_index, x_uniques = pd.factorize(df.x)

arrays = (Cij
          .reindex(columns = x_uniques)
          .to_numpy()[df.y.factorize()[0], x_index]
         )

df['r'] = arrays

df
    y  x     r   Crc
0   1  1  16.0  16.0
1   1  1  16.0  16.0
2   1  2  15.0  15.0
3   1  2  15.0  15.0
4   1  2  15.0  15.0
5   1  3  13.0  13.0
6   1  3  13.0  13.0
7   1  5   5.0   5.0
8   2  2    12  12.0
9   2  4    15  15.0
10  2  4    15  15.0
11  2  4    15  15.0
12  2  4    15  15.0
13  2  4    15  15.0
14  3  3   6.0   6.0
15  3  3   6.0   6.0
16  3  5  13.0  13.0
17  3  5  13.0  13.0
18  3  5  13.0  13.0
19  3  6  14.0  14.0
20  3  6  14.0  14.0
21  3  6  14.0  14.0
22  3  6  14.0  14.0
23  3  6  14.0  14.0

Using the factorize path in the docs, you can replicate the lookup functionality:

x_index, x_uniques = pd.factorize(df.x)

arrays = (Cij
          .reindex(columns = x_uniques)
          .to_numpy()[df.y.factorize()[0], x_index]
         )

df['r'] = arrays

df
    y  x     r   Crc
0   1  1  16.0  16.0
1   1  1  16.0  16.0
2   1  2  15.0  15.0
3   1  2  15.0  15.0
4   1  2  15.0  15.0
5   1  3  13.0  13.0
6   1  3  13.0  13.0
7   1  5   5.0   5.0
8   2  2    12  12.0
9   2  4    15  15.0
10  2  4    15  15.0
11  2  4    15  15.0
12  2  4    15  15.0
13  2  4    15  15.0
14  3  3   6.0   6.0
15  3  3   6.0   6.0
16  3  5  13.0  13.0
17  3  5  13.0  13.0
18  3  5  13.0  13.0
19  3  6  14.0  14.0
20  3  6  14.0  14.0
21  3  6  14.0  14.0
22  3  6  14.0  14.0
23  3  6  14.0  14.0
溺ぐ爱和你が 2025-02-17 06:30:45

如果您已经检查了df [“ crc”] = cij.loc [df [y“”],df [x“”]],您会注意到它返回一个数组。通过将其与df [“ crc”] = cij..lookup(df [“ y”],df [“ x”])进行比较,您还会注意到领先的对角线是相同的(哪个有意义)。因此,您可以添加np.diagonal以返回所需的内容:

df["Crc"] = np.diagonal(Cij.loc[df["y"], df["x"]])

If you have checked df["Crc"] = Cij.loc[df["y"], df["x"]], you will notice that it returns an array. By comparing this with df["Crc"] = Cij.lookup(df["y"],df["x"]), you will also notice that the leading diagonal is the same (which makes sense). Therefore, you can add np.diagonal to return what you need:

df["Crc"] = np.diagonal(Cij.loc[df["y"], df["x"]])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文