查找句子中名词和动词的位置 Python

发布于 2025-01-12 22:41:01 字数 101 浏览 5 评论 0原文

有没有办法在Python的句子中找到带有后标记“NN”和“VB”的单词的位置？

csv 文件中的句子示例： “男人走进一家酒吧。” “警察开枪了。” “孩子开车掉进沟里了”

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雪花飘飘的天空 2025-01-19 22:41:02

您可以使用一些现有的 NLP 框架（例如 Spacy 或 NLTK。处理文本后，您可以迭代每个标记并检查 pos 标记是否是您要查找的内容，然后获取该标记在文本中的开始/结束位置。

Spacy

使用 spacy，实现您想要的代码将如下所示：

import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Man walks into a bar.")  # Your text here

words = []
for token in doc:
    if token.pos_ == "NOUN" or token.pos_ == "VERB":
        start = token.idx  # Start position of token
        end = token.idx + len(token)  # End position = start + len(token)
        words.append((token.text, start, end, token.pos_))

print(words)

简而言之，我从字符串构建一个新文档，迭代所有标记并仅保留那些 post 标记为 VERB 的标记或名词。最后，我将令牌信息添加到列表中以进行进一步处理。我强烈建议您阅读以下spacy 教程以获取更多信息。

NLTK

使用 NLTK 我认为也非常简单，使用 NLTK tokenizer< /a> 和 pos 标记器。其余的几乎与我们使用 spacy 的方式类似。

我不确定获取每个标记的起始位置的最正确方法。请注意，为此，我使用由 WhitespaceTokenizer().tokenize() 方法创建的标记化助手，该方法返回包含每个标记的开始和结束位置的元组列表。也许有一种更简单且类似 NLTK 的方法。

import nltk
from nltk.tokenize import WhitespaceTokenizer

text = "Man walks into a bar."  # Your text here
tokens_positions = list(WhitespaceTokenizer().span_tokenize(text))  # Tokenize to spans to get start/end positions: [(0, 3), (4, 9), ... ]
tokens = WhitespaceTokenizer().tokenize(text)  # Tokenize on a string lists: ["man", "walks", "into", ... ]

tokens = nltk.pos_tag(tokens) # Run Part-of-Speech tager

# Iterate on each token
words = []
for i in range(len(tokens)):
    text, tag = tokens[i]  # Get tag
    start, end = tokens_positions[i]  # Get token start/end
    if tag == "NN" or tag == "VBZ":
        words.append((start, end, tag))

print(words)

我希望这对你有用！

You can find positions for certein PoS tags on a text using some of the existing NLP frameworks such us Spacy or NLTK. Once you process the text you can iterate for each token and check if the pos tag is what you are looking for, then get the start/end position of that token in your text.

Spacy

Using spacy, the code to implement what you want would be something like this:

import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Man walks into a bar.")  # Your text here

words = []
for token in doc:
    if token.pos_ == "NOUN" or token.pos_ == "VERB":
        start = token.idx  # Start position of token
        end = token.idx + len(token)  # End position = start + len(token)
        words.append((token.text, start, end, token.pos_))

print(words)

In short, I build a new document from the string, iterate over all the tokens and keep only those whose post tag is VERB or NOUN. Finally I add the token info to a list for further processing. I strongly recommend that you read the following spacy tutorial for more information.

NLTK

Using NLTK I think is pretty straightforward too, using NLTK tokenizer and pos tagger. The rest is almost analogous to how we do it using spacy.

What I'm not sure about is the most correct way to get the start-end positions of each token. Note that for this I am using a tokenization helper created by WhitespaceTokenizer().tokenize() method, which returns a list of tuples with the start and end positions of each token. Maybe there is a simpler and NLTK-like way of doing it.

import nltk
from nltk.tokenize import WhitespaceTokenizer

text = "Man walks into a bar."  # Your text here
tokens_positions = list(WhitespaceTokenizer().span_tokenize(text))  # Tokenize to spans to get start/end positions: [(0, 3), (4, 9), ... ]
tokens = WhitespaceTokenizer().tokenize(text)  # Tokenize on a string lists: ["man", "walks", "into", ... ]

tokens = nltk.pos_tag(tokens) # Run Part-of-Speech tager

# Iterate on each token
words = []
for i in range(len(tokens)):
    text, tag = tokens[i]  # Get tag
    start, end = tokens_positions[i]  # Get token start/end
    if tag == "NN" or tag == "VBZ":
        words.append((start, end, tag))

print(words)

I hope this works for you!

回复收藏 0 原文

败给现实 2025-01-19 22:41:02

你应该看看nltk。

来自文档：

import nltk
text = nltk.tokenize.word_tokenize("They refuse to permit us to obtain the refuse permit")


nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

you should take a look at nltk.

From the doc:

import nltk
text = nltk.tokenize.word_tokenize("They refuse to permit us to obtain the refuse permit")


nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

回复收藏 0 原文

~没有更多了~