获取不带标点符号的句子中的字数NLTK python
我正在尝试使用 python 中的 nltk 获取句子中的字数
这是我编写的代码
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
for i in nltk.sent_tokenize(data):
print(nltk.word_tokenize(i))
这是输出
['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']
有没有办法删除标点符号,防止 isn't
分成两部分单词并将 easy-task
分成两部分?
我需要的答案是这样的:
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']
我可以通过使用像输出这样的停用词来管理标点符号
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
stopwords = [',', '.', '?', '!']
for i in nltk.sent_tokenize(data):
for j in nltk.word_tokenize(i):
if j not in stopwords:
print(j, ', ', end="")
print('\n')
:
Sample , sentence , for , checking ,
Here , is , an , exclamation , mark ,
Here , is , a , question ,
This , is , n't , an , easy-task ,
但这并不能解决 isn't
和 easy-task
问题。有办法做到这一点吗? 谢谢
I am trying to get the word count in a sentence with nltk in python
This is the code I wrote
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
for i in nltk.sent_tokenize(data):
print(nltk.word_tokenize(i))
This was the output
['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']
Is there any way to remove the punctuation marks, prevent isn't
from splitting into two words and split easy-task
into two?
The answer I need is something like ths:
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']
I can kind of manage punctuation marks by using stopwords like:
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
stopwords = [',', '.', '?', '!']
for i in nltk.sent_tokenize(data):
for j in nltk.word_tokenize(i):
if j not in stopwords:
print(j, ', ', end="")
print('\n')
output:
Sample , sentence , for , checking ,
Here , is , an , exclamation , mark ,
Here , is , a , question ,
This , is , n't , an , easy-task ,
but this does not fix isn't
and easy-task
. Is there a way to do this?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用不同的标记器来满足您的要求。
you can use different tokenizer which can take care of your requirement.