查找句子中名词和动词的位置 Python
有没有办法在Python的句子中找到带有后标记“NN”和“VB”的单词的位置?
csv 文件中的句子示例: “男人走进一家酒吧。” “警察开枪了。” “孩子开车掉进沟里了”
Is there a way to find the position of the words with pos-tag 'NN' and 'VB' in a sentence in Python?
example of a sentences in a csv file:
"Man walks into a bar."
"Cop shoots his gun."
"Kid drives into a ditch"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用一些现有的 NLP 框架(例如 Spacy 或 NLTK。处理文本后,您可以迭代每个标记并检查 pos 标记是否是您要查找的内容,然后获取该标记在文本中的开始/结束位置。
Spacy
使用 spacy,实现您想要的代码将如下所示:
简而言之,我从字符串构建一个新文档,迭代所有标记并仅保留那些 post 标记为 VERB 的标记或名词。最后,我将令牌信息添加到列表中以进行进一步处理。我强烈建议您阅读以下spacy 教程以获取更多信息。
NLTK
使用 NLTK 我认为也非常简单,使用 NLTK tokenizer< /a> 和 pos 标记器。其余的几乎与我们使用 spacy 的方式类似。
我不确定获取每个标记的起始位置的最正确方法。请注意,为此,我使用由
WhitespaceTokenizer().tokenize()
方法创建的标记化助手,该方法返回包含每个标记的开始和结束位置的元组列表。也许有一种更简单且类似 NLTK 的方法。我希望这对你有用!
You can find positions for certein PoS tags on a text using some of the existing NLP frameworks such us Spacy or NLTK. Once you process the text you can iterate for each token and check if the pos tag is what you are looking for, then get the start/end position of that token in your text.
Spacy
Using spacy, the code to implement what you want would be something like this:
In short, I build a new document from the string, iterate over all the tokens and keep only those whose post tag is VERB or NOUN. Finally I add the token info to a list for further processing. I strongly recommend that you read the following spacy tutorial for more information.
NLTK
Using NLTK I think is pretty straightforward too, using NLTK tokenizer and pos tagger. The rest is almost analogous to how we do it using spacy.
What I'm not sure about is the most correct way to get the start-end positions of each token. Note that for this I am using a tokenization helper created by
WhitespaceTokenizer().tokenize()
method, which returns a list of tuples with the start and end positions of each token. Maybe there is a simpler and NLTK-like way of doing it.I hope this works for you!
你应该看看nltk。
来自文档:
you should take a look at nltk.
From the doc: