pandas 中复杂的部分字符串匹配

发布于 2025-01-11 17:30:05 字数 1269 浏览 1 评论 0原文

给定具有以下结构和值的数据框 json_path -

json_path报告组实体/分组
data.attributes.total.children.[0]基督教家庭亚伯拉罕家庭
data.attributes.total.children.[0].children.[0]基督教家庭庄园
data.attributes.total.children.[0].children.[0].children.[0].children.[0]基督教家庭现金
data.attributes.total.children.[0].children.[0].children.[1].children.[0]基督教家庭投资级固定收益

我如何过滤包含四次childrenjson_path行?即,我想过滤索引位置 2-3 -

json_path报告组实体/分组
data.attributes.total.children。[0 ].children.[0].children.[0].children.[0]基督教家庭现金
data.attributes.total.children.[0].children.[0].children.[1].children.[0]基督教家庭投资级固定收益

我知道如何获得部分匹配,但是方括号中的整数会不一致,所以我的直觉告诉我以某种方式拥有计算 children 实例的逻辑(即,children 出现 4x)并以此为基础进行过滤。

关于如何实现这一目标有什么建议或资源吗?

Given a dataframe with the following structure and values json_path -

json_pathReporting GroupEntity/Grouping
data.attributes.total.children.[0]Christian FamilyAbraham Family
data.attributes.total.children.[0].children.[0]Christian FamilyIn Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]Christian FamilyCash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]Christian FamilyInvestment Grade Fixed Income

How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -

json_pathReporting GroupEntity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]Christian FamilyCash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]Christian FamilyInvestment Grade Fixed Income

I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.

Any suggestions or resources on how I can achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一向肩并 2025-01-18 17:30:05

正如您所说,一种简单的方法是计算 .children 的出现次数,并将计数与 4 进行比较,以创建可用于过滤行的布尔掩码。

df[df['json_path'].str.count(r'\.children').eq(4)]

更可靠的方法是检查连续出现4个孩子

df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows

df[df['json_path'].str.count(r'\.children').eq(4)]

A more robust approach would be to check for the consecutive occurrence of 4 children

df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文