获取不带标点符号的句子中的字数NLTK python

发布于 2025-01-12 18:57:38 字数 1404 浏览 1 评论 0原文

我正在尝试使用 python 中的 nltk 获取句子中的字数

这是我编写的代码

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

for i in nltk.sent_tokenize(data):
    print(nltk.word_tokenize(i))

这是输出

['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']

有没有办法删除标点符号,防止 isn't 分成两部分单词并将 easy-task 分成两部分?

我需要的答案是这样的:

['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']

我可以通过使用像输出这样的停用词来管理标点符号

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

stopwords = [',', '.', '?', '!']

for i in nltk.sent_tokenize(data):
    for j in nltk.word_tokenize(i):
        if j not in stopwords:
            print(j, ', ', end="")
    print('\n')

Sample , sentence , for , checking , 

Here , is , an , exclamation , mark , 

Here , is , a , question , 

This , is , n't , an , easy-task , 

但这并不能解决 isn'teasy-task 问题。有办法做到这一点吗? 谢谢

I am trying to get the word count in a sentence with nltk in python

This is the code I wrote

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

for i in nltk.sent_tokenize(data):
    print(nltk.word_tokenize(i))

This was the output

['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']

Is there any way to remove the punctuation marks, prevent isn't from splitting into two words and split easy-task into two?

The answer I need is something like ths:

['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']

I can kind of manage punctuation marks by using stopwords like:

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

stopwords = [',', '.', '?', '!']

for i in nltk.sent_tokenize(data):
    for j in nltk.word_tokenize(i):
        if j not in stopwords:
            print(j, ', ', end="")
    print('\n')

output:

Sample , sentence , for , checking , 

Here , is , an , exclamation , mark , 

Here , is , a , question , 

This , is , n't , an , easy-task , 

but this does not fix isn't and easy-task. Is there a way to do this?
Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

原来分手还会想你 2025-01-19 18:57:38

您可以使用不同的标记器来满足您的要求。

import nltk
import string
tokenizer = nltk.TweetTokenizer()

for i in nltk.sent_tokenize(data):
    print(i)
    print([x for x in tokenizer.tokenize(i) if x not in string.punctuation])

#op
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy-task']

you can use different tokenizer which can take care of your requirement.

import nltk
import string
tokenizer = nltk.TweetTokenizer()

for i in nltk.sent_tokenize(data):
    print(i)
    print([x for x in tokenizer.tokenize(i) if x not in string.punctuation])

#op
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy-task']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文