当列包含列表时,检测熊猫中的重复项
当列包含列表或numpy nd数组时,是否有一种合理的方法来检测pandas dataframe中的重复项,如下面的示例吗?我知道我可以将列表转换为字符串,但是来回转换的行为感到不错。另外,列表似乎更清晰,更方便〜我如何到达这里(在线代码) 我要去。
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
两种方法(后者此替代发现重复的答案)
typeerror:不可用的类型:'list'。
当然,如果数据帧看起来像这样:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
这让我想知道像整数编码之类的东西是否合理?这与从字符串转换/从字符串转换没有什么不同,但至少它是清晰的。另外,建议直接从将不胜感激(即,完全避免列表)。
Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
Both methods (the latter from this alternative find duplicates answer) result in
TypeError: unhashable type: 'list'.
They would work, of course, if the dataframe looked like this:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用
MAP
元组
With
map
tuple