如何在Python中的多行字符串上使用正则表达式向后搜索

发布于 2025-01-18 08:49:48 字数 947 浏览 1 评论 0原文

我想知道是否有一种有效的方法来执行以下操作：

我有一个python脚本，将整个文件读为一个字符串。然后，鉴于感兴趣的令牌的位置，考虑到该令牌，我想找到线的开始的字符串索引。

file_str = read_file("foo.txt")
token_pos = re.search("token",file_str).start()

#this does not work, as str.rfind does not take regex, and you cannot specify re.M:
beginning_of_line = file_str.rfind("^",0,token_pos)

我可以使用贪婪的正则表达式来查找线路的最后一个开始，但这必须做很多次，所以我担心我不想阅读每次迭代的整个文件。有一个好方法吗？

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------》（---

）需要。这是我要做的一件事的更好示例：

file_str = """
{
   blah {  
      {} {{}  "string with unmatched }" }
   }
}"""

我碰巧知道blah的括号的开口位置在哪里。我需要在牙套之间获得界限（非包容）。因此，鉴于闭合支架的位置，我需要找到包含它的行的开始。我想做类似于反正正则的事情来找到它。当然，我可以写一个特殊的功能来做到这一点，但是我认为还有更多python-hish的方法。为了进一步使事情复杂化，我必须每个文件几次执行此操作，并且文件字符串可能会在迭代之间发生变化，因此预先索引也无法真正起作用...

原文

I'm wondering if there's an efficient way of doing the following:

I have a python script that reads an entire file into a single string. Then, given the location of a token of interest, I'd like to find the string index of the beginning of the line given that token.

file_str = read_file("foo.txt")
token_pos = re.search("token",file_str).start()

#this does not work, as str.rfind does not take regex, and you cannot specify re.M:
beginning_of_line = file_str.rfind("^",0,token_pos)

I could use a greedy regex to find the last beginning of line, but this has to be done many times, so I'm concerned that I don't want to read the whole file on each iteration. Is there a good way to do this?

----------------- EDIT ----------------

I tried to post as simple of a question, but it looks like more details are required. Here's a better example of one of the things I'm trying to do:

file_str = """
{
   blah {  
      {} {{}  "string with unmatched }" }
   }
}"""

I happen to know where the opening an closing positions of blah's braces are. I need to get the lines between the braces (non-inclusive). So, given the position of the closing brace, I need to find the beginning of the line containing it. I'd like to do something akin to a reverse regex to find it. I can, of course, write a special function to do this, but I was thinking there would be some more python-ish way of going about it. To further complicate things, I would have to do this several times per file, and the file string can potentially change between iterations, so pre-indexing doesn't really work either...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

°如果伤别离去 2025-01-25 08:49:48

不要只匹配关键字，而是匹配从行开头到关键字的所有内容。您可以使用 re.finditer()^docs 来获取一个迭代器，该迭代器在找到匹配项时不断产生匹配项。

file_str = """Lorem ipsum dolor sit amet, consectetur adipiscing elit amet.
Vestibulum vestibulum mollis enim, eu tristique est rhoncus et.
Curabitur sem nisi, ornare eu pellentesque at, interdum at lectus.
Phasellus molestie, turpis id ornare efficitur, ex tellus aliquet ipsum, vitae ullamcorper tellus diam a velit.
Nulla eget eleifend nisl.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nullam finibus, velit non euismod faucibus, dolor orci maximus lacus, sed mattis nisi erat eget turpis.
Maecenas ut pharetra lorem.
Curabitur nec dui sed velit euismod bibendum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Pellentesque tempor dolor at placerat aliquet.
Duis laoreet, est vitae tempor porta, risus leo ullamcorper risus, quis vestibulum massa orci ut felis.
In finibus purus ac nulla congue mattis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis efficitur dui ac nisi lobortis, a bibendum felis volutpat.
Aenean consectetur diam at risus hendrerit, in vestibulum erat porttitor.
Quisque fringilla accumsan neque, sed efficitur nunc tristique maximus.
Maecenas gravida lectus et porttitor ultrices.
Nam lobortis, massa et porta vulputate, nulla turpis maximus sapien, sit amet finibus libero mauris eu sapien.
Donec sollicitudin vulputate neque, in tempor nisi suscipit quis.
"""

keyword = "amet"
for match_obj in re.finditer(f"^.*{keyword}", file_str, re.MULTILINE):
    beginning_of_line = match_obj.start()
    print(beginning_of_line, match_obj)

这给出了：

0 <re.Match object; span=(0, 60), match='Lorem ipsum dolor sit amet, consectetur adipiscin>
331 <re.Match object; span=(331, 357), match='Lorem ipsum dolor sit amet'>
566 <re.Match object; span=(566, 592), match='Lorem ipsum dolor sit amet'>
815 <re.Match object; span=(815, 841), match='Lorem ipsum dolor sit amet'>
1129 <re.Match object; span=(1129, 1206), match='Nam lobortis, massa et porta vulputate, nulla tur>

请注意，即使第一行包含两个 amet，它也只会匹配一次，因为我们对 . 进行了贪婪匹配，因此第一个 amet该行的 code> 由 .* 消耗

Instead of matching just the keyword, match everything from the start of the line to the keyword. You could use re.finditer()^docs to get an iterator that keeps yielding matches as it finds them.

file_str = """Lorem ipsum dolor sit amet, consectetur adipiscing elit amet.
Vestibulum vestibulum mollis enim, eu tristique est rhoncus et.
Curabitur sem nisi, ornare eu pellentesque at, interdum at lectus.
Phasellus molestie, turpis id ornare efficitur, ex tellus aliquet ipsum, vitae ullamcorper tellus diam a velit.
Nulla eget eleifend nisl.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nullam finibus, velit non euismod faucibus, dolor orci maximus lacus, sed mattis nisi erat eget turpis.
Maecenas ut pharetra lorem.
Curabitur nec dui sed velit euismod bibendum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Pellentesque tempor dolor at placerat aliquet.
Duis laoreet, est vitae tempor porta, risus leo ullamcorper risus, quis vestibulum massa orci ut felis.
In finibus purus ac nulla congue mattis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Duis efficitur dui ac nisi lobortis, a bibendum felis volutpat.
Aenean consectetur diam at risus hendrerit, in vestibulum erat porttitor.
Quisque fringilla accumsan neque, sed efficitur nunc tristique maximus.
Maecenas gravida lectus et porttitor ultrices.
Nam lobortis, massa et porta vulputate, nulla turpis maximus sapien, sit amet finibus libero mauris eu sapien.
Donec sollicitudin vulputate neque, in tempor nisi suscipit quis.
"""

keyword = "amet"
for match_obj in re.finditer(f"^.*{keyword}", file_str, re.MULTILINE):
    beginning_of_line = match_obj.start()
    print(beginning_of_line, match_obj)

Which gives:

0 <re.Match object; span=(0, 60), match='Lorem ipsum dolor sit amet, consectetur adipiscin>
331 <re.Match object; span=(331, 357), match='Lorem ipsum dolor sit amet'>
566 <re.Match object; span=(566, 592), match='Lorem ipsum dolor sit amet'>
815 <re.Match object; span=(815, 841), match='Lorem ipsum dolor sit amet'>
1129 <re.Match object; span=(1129, 1206), match='Nam lobortis, massa et porta vulputate, nulla tur>

Note that the first line gets matched only once even though it contains two amets because we do a greedy match on . so the first amet on the line is consumed by the .*

回复收藏 0 原文

ゞ花落谁相伴 2025-01-25 08:49:48

您不需要使用正则表达式来查找带有标记的行的开头

这将逐行迭代文件，使用文件的内容创建字符串 foo 并记录换行符在名为 line_pos_with_token 的列表中的位置

token = "token"
foo = ''
line_pos_with_token = []

with open("foo.txt", "r") as f:
    for line in f:
        if token in line:
            line_pos_with_token.append(len(foo))
        foo += line

print(line_pos_with_token)

You don't need use regex to find the beginning of lines with the token

This will iterate the file line by line, create the string foo with the file's content and record where the newlines are in list named line_pos_with_token

token = "token"
foo = ''
line_pos_with_token = []

with open("foo.txt", "r") as f:
    for line in f:
        if token in line:
            line_pos_with_token.append(len(foo))
        foo += line

print(line_pos_with_token)

回复收藏 0 原文

~没有更多了~