当列包含列表时，检测熊猫中的重复项

发布于 2025-02-01 23:50:25 字数 1691 浏览 2 评论 0原文

当列包含列表或numpy nd数组时，是否有一种合理的方法来检测pandas dataframe中的重复项，如下面的示例吗？我知道我可以将列表转换为字符串，但是来回转换的行为感到不错。另外，列表似乎更清晰，更方便〜我如何到达这里（在线代码）我要去。

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

两种方法（后者此替代发现重复的答案）

typeerror：不可用的类型：'list'。

当然，如果数据帧看起来像这样：

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

这让我想知道像整数编码之类的东西是否合理？这与从字符串转换/从字符串转换没有什么不同，但至少它是清晰的。另外，建议直接从将不胜感激（即，完全避免列表）。

原文

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.

import pandas as pd

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "ingredients": [
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredC"],
            ["ingredA", "ingredB", "ingredD"],
            ["ingredA", "ingredB", "ingredD", "ingredE"],
            ["ingredB", "ingredC", "ingredF"],
        ],
    }
)

# Traditional find duplicates
# df[df.duplicated(keep=False)]

# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]

Both methods (the latter from this alternative find duplicates answer) result in

TypeError: unhashable type: 'list'.

They would work, of course, if the dataframe looked like this:

df = pd.DataFrame(
    {
        "author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
        "date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
        "recipe": [
            "recipeC",
            "recipeC",
            "recipeD",
            "recipeE",
            "recipeF",
        ],
    }
)

Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无人问我粥可暖 2025-02-08 23:50:25

使用MAP 元组

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]

With map tuple

out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]: 
   author        date                       rating
0  Jefe98  1423112400  [ingredA, ingredB, ingredC]
1  Jefe98  1423112400  [ingredA, ingredB, ingredC]

回复收藏 0 原文

~没有更多了~