编写自定义熊猫杂质而不制作所有dtypes对象

发布于 2025-02-04 18:15:13 字数 2167 浏览 2 评论 0原文

我(认为我)需要为 geopandas.geodataframe.dissolve()操作。在合并多个多边形时,我想将多边形的信息与最大的区域保持,这也符合其他条件。该操作运行良好,但是之后我的GeodataFrame的所有属性均为dtype object

常规pandas groupy()发生了同样的问题,因此我简化了下面的示例。有人可以告诉我是否应该以不同的方式编写我的custom_sort(),以保持dtypes完整吗?

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'ints': [1, 2, 3, 4],
    'floats': [1.0, 2.0, 2.2, 3.2],
    'strings': ['foo', 'bar', 'baz', 'qux'],
    'bools': [True, True, True, False],
    'test': ['drop this', 'keep this', 'keep this', 'drop this'],
    })


def custom_sort(df):
    """Define custom aggregation function with special sorting."""
    df = df.sort_values(by=['bools', 'floats'], ascending=False)
    return df.iloc[0]


print(df)
print(df.dtypes)
print()
grouped = df.groupby(by='group').agg(custom_sort)
print(grouped)
print(grouped.dtypes)  # Issue: All dtypes are object
print()
print(grouped.convert_dtypes().dtypes)  # Possible solution, but not for me

# Please note that I cannot use convert_dtypes(). I actually need this for
# geopandas.GeoDataFrame.dissolve() and I think convert_dtypes() messes up
# the geometry information

输出:

  group  ints  floats strings  bools       test
0     A     1     1.0     foo   True  drop this
1     A     2     2.0     bar   True  keep this
2     B     3     2.2     baz   True  keep this
3     B     4     3.2     qux  False  drop this
group       object
ints         int64
floats     float64
strings     object
bools         bool
test        object
dtype: object

      ints floats strings bools       test
group                                     
A        2    2.0     bar  True  keep this
B        3    2.2     baz  True  keep this
ints       object
floats     object
strings    object
bools      object
test       object
dtype: object

ints         Int64
floats     Float64
strings     string
bools      boolean
test        string
dtype: object

I (think I) need to write a custom aggregation function for the geopandas.GeoDataFrame.dissolve() operation. When merging multiple polygons, I want to keep the information of the polygon with the largest area, that also fulfils other criteria. The operation works fine, but afterwards all attributes of my GeoDataFrame are of dtype object.

The same issue happens with regular pandas groupy(), so I have simplified the example below. Can someone tell me if I should write my custom_sort() differently, to keep the dtypes intact?

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'ints': [1, 2, 3, 4],
    'floats': [1.0, 2.0, 2.2, 3.2],
    'strings': ['foo', 'bar', 'baz', 'qux'],
    'bools': [True, True, True, False],
    'test': ['drop this', 'keep this', 'keep this', 'drop this'],
    })


def custom_sort(df):
    """Define custom aggregation function with special sorting."""
    df = df.sort_values(by=['bools', 'floats'], ascending=False)
    return df.iloc[0]


print(df)
print(df.dtypes)
print()
grouped = df.groupby(by='group').agg(custom_sort)
print(grouped)
print(grouped.dtypes)  # Issue: All dtypes are object
print()
print(grouped.convert_dtypes().dtypes)  # Possible solution, but not for me

# Please note that I cannot use convert_dtypes(). I actually need this for
# geopandas.GeoDataFrame.dissolve() and I think convert_dtypes() messes up
# the geometry information

Output:

  group  ints  floats strings  bools       test
0     A     1     1.0     foo   True  drop this
1     A     2     2.0     bar   True  keep this
2     B     3     2.2     baz   True  keep this
3     B     4     3.2     qux  False  drop this
group       object
ints         int64
floats     float64
strings     object
bools         bool
test        object
dtype: object

      ints floats strings bools       test
group                                     
A        2    2.0     bar  True  keep this
B        3    2.2     baz  True  keep this
ints       object
floats     object
strings    object
bools      object
test       object
dtype: object

ints         Int64
floats     Float64
strings     string
bools      boolean
test        string
dtype: object

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

池木 2025-02-11 18:15:13

问题的来源是df.iloc [0]返回PANDAS系列。该系列具有多个值,具有不同的dtypes。自动,熊猫可以将系列的dtype转换为对象如果我没记错的话,这取决于您正在使用的熊猫库的版本。随着时间的流逝,这种行为已经改变了。

解决问题的解决方案在很大程度上取决于您在自定义agg函数中所做的操作。

在您的玩具示例中,我建议事先操纵您的数据框架,并使用类似可能的聚合功能。

例如,预期复杂的逻辑给出了一个简单的head作为agg:

(df.sort_values(by=['bools', 'floats'], 
               ascending=False)
   .groupby(by='group')
   .agg('first')

对于价值,我还建议您使用更多最新的pandas版本。

The source of the problem is that df.iloc[0] returns a pandas series. This series has multiple values in it, with different dtypes. Automatically, pandas may convert the dtype of the series to object. If I recall correctly, this depends on the version of the pandas library you're working with. Changes have been made to this behavior over time.

The solution to your problem heavily depends on the operations you're doing in your custom agg function.

In your toy example, I would suggest manipulating your dataframe beforehand, and using the simples possible aggregating function.

For example, anticipating the complex logic gives a simple head as agg:

(df.sort_values(by=['bools', 'floats'], 
               ascending=False)
   .groupby(by='group')
   .agg('first')

For what is worth, I'd also suggest you use more recent pandas versions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文