在 python 中导航文本文件搜索

发布于 2025-01-02 20:47:40 字数 427 浏览 2 评论 0原文

这是我正在使用的文本文件的示例：

<Opera>

Tristan/NNP
and/CC
Isolde/NNP
and/CC
the/DT
fatalistic/NN
horns/VBZ
The/DT
passionate/JJ
violins/NN
And/CC
ominous/JJ
clarinet/NN
;/:

正斜杠后面的大写字母是奇怪的标签。我希望能够在文件中搜索诸如 "NNP,CC,NNP" 之类的内容，并让程序返回此段 "Tristan and Isolde"（这三个词）与这三个标签相匹配的一行。

我遇到的问题是我希望用户输入搜索字符串，因此它总是不同的。
我可以读取文件并找到一个匹配项，但我不知道如何从该点向后计数以打印第一个单词或如何查找下一个标签是否匹配。

原文

here is sample of the text file I am working with:

<Opera>

Tristan/NNP
and/CC
Isolde/NNP
and/CC
the/DT
fatalistic/NN
horns/VBZ
The/DT
passionate/JJ
violins/NN
And/CC
ominous/JJ
clarinet/NN
;/:

The capital letters after the forward slashes are weird tags. I want to be able to search the file for something like "NNP,CC,NNP" and have the program return for this segment "Tristan and Isolde", the three words in a row that match those three tags in a row.

The problem I am having is I want the search string to be user inputed so it will always be different.
I can read the file and find one match but I do not know how to count backwards from that point to print the first word or how to find whether the next tag matches.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九厘米的零° 2025-01-09 20:47:40

从要搜索的标签列表动态构建正则表达式：

text = ("Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ "
    "The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN")

tags = ["NNP", "CC", "NNP"]
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"

from re import findall
print(findall(tags_pattern, text))

Build a regular expression dynamically from a list of tags you want to search:

text = ("Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ "
    "The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN")

tags = ["NNP", "CC", "NNP"]
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"

from re import findall
print(findall(tags_pattern, text))

回复收藏 0 原文

把人绕傻吧 2025-01-09 20:47:40

>>> import re 
>>> s = "Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:"
>>> re.findall("(\w+)/NNP (\w+)/CC (\w+)/NNP", s)
[('Tristan', 'and', 'Isolde')]

同样，您可以做您需要的事情。

编辑：更普遍。

>>> import re
>>> pattern = 'NNP,CC,NNP'
>>> pattern = pattern.split(",")
>>> p = ""
>>> for i in pattern:
...     p = p + r"(\w+)/"+i+ r"\n"
>>> f = open("yourfile", "r")
>>> s = f.read()
>>> f.close()
>>> found = re.findall(p, s, re.MULTILINE)
>>> found #Saved in found
[('Tristan', 'and', 'Isolde')]
>>> found_str = " ".join(found[0]) #Converted to string
>>> f = open("written.txt", "w")
>>> f.write(found_str)
>>> f.close()

>>> import re 
>>> s = "Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:"
>>> re.findall("(\w+)/NNP (\w+)/CC (\w+)/NNP", s)
[('Tristan', 'and', 'Isolde')]

Similarly, you can do what you need.

EDIT: More generalized.

>>> import re
>>> pattern = 'NNP,CC,NNP'
>>> pattern = pattern.split(",")
>>> p = ""
>>> for i in pattern:
...     p = p + r"(\w+)/"+i+ r"\n"
>>> f = open("yourfile", "r")
>>> s = f.read()
>>> f.close()
>>> found = re.findall(p, s, re.MULTILINE)
>>> found #Saved in found
[('Tristan', 'and', 'Isolde')]
>>> found_str = " ".join(found[0]) #Converted to string
>>> f = open("written.txt", "w")
>>> f.write(found_str)
>>> f.close()

回复收藏 0 原文

飞烟轻若梦 2025-01-09 20:47:40

看来您的源文本可能是由自然语言工具包 (nltk)。

使用 nltk，您可以对文本进行标记，将标记拆分为 (word,part_of_speech) 元组，然后迭代 ngram 以查找与模式匹配的内容：

import nltk
pattern = 'NNP,CC,NNP'
pattern = [pat.strip() for pat in pattern.split(',')]
text = '''Tristan/NNP and/CC Isolde/NNP and/CC the/DT fatalistic/NN horns/VBZ
          The/DT passionate/JJ violins/NN And/CC ominous/JJ clarinet/NN ;/:'''
tagged_token = [nltk.tag.str2tuple(word) for word in nltk.word_tokenize(text)]
for ngram in nltk.ingrams(tagged_token,len(pattern)):
    if all(gram[1] == pat for gram,pat in zip(ngram,pattern)):
        print(' '.join(word for word, pos in ngram))

产生