pandas 中复杂的部分字符串匹配

发布于 2025-01-11 17:30:05 字数 1269 浏览 1 评论 0原文

给定具有以下结构和值的数据框 json_path -

json_path	报告组	实体/分组
data.attributes.total.children.[0]	基督教家庭	亚伯拉罕家庭
data.attributes.total.children.[0].children.[0]	基督教家庭	庄园
data.attributes.total.children.[0].children.[0].children.[0].children.[0]	基督教家庭	现金
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	基督教家庭	投资级固定收益

我如何过滤包含四次children的json_path行？即，我想过滤索引位置 2-3 -

json_path	报告组	实体/分组
data.attributes.total.children。[0 ].children.[0].children.[0].children.[0]	基督教家庭	现金
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	基督教家庭	投资级固定收益

我知道如何获得部分匹配，但是方括号中的整数会不一致，所以我的直觉告诉我以某种方式拥有计算 children 实例的逻辑（即，children 出现 4x）并以此为基础进行过滤。

关于如何实现这一目标有什么建议或资源吗？

原文

Given a dataframe with the following structure and values json_path -

json_path	Reporting Group	Entity/Grouping
data.attributes.total.children.[0]	Christian Family	Abraham Family
data.attributes.total.children.[0].children.[0]	Christian Family	In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]	Christian Family	Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	Christian Family	Investment Grade Fixed Income

How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -

json_path	Reporting Group	Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]	Christian Family	Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	Christian Family	Investment Grade Fixed Income

I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.

Any suggestions or resources on how I can achieve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一向肩并 2025-01-18 17:30:05

正如您所说，一种简单的方法是计算 .children 的出现次数，并将计数与 4 进行比较，以创建可用于过滤行的布尔掩码。

df[df['json_path'].str.count(r'\.children').eq(4)]

更可靠的方法是检查连续出现4个孩子

df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows

df[df['json_path'].str.count(r'\.children').eq(4)]

A more robust approach would be to check for the consecutive occurrence of 4 children

df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income

回复收藏 0 原文

~没有更多了~