如何统计文件中的句子数、单词数和字符数?
我编写了以下代码来标记来自文件 samp.txt 的输入段落。谁能帮我查找并打印文件中的句子数、单词数和字符数?为此,我在 python 中使用了 NLTK。
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
尝试这种方式(此程序假设您正在使用
dirpath
指定的目录中的一个文本文件):希望这会有所帮助
Try it this way (this program assumes that you are working with one text file in the directory specified by
dirpath
):Hope this helps
借助 nltk,您还可以使用 FreqDist(请参阅 O'Reillys Book Ch3.1)
并且在你的情况下:
With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
如果有人来到这里,那是值得的。我认为这解决了OP提出的所有问题。如果使用
textstat
包,计算句子和字符是非常容易的。每个句子末尾的标点符号有一定的重要性。For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the
textstat
package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.我相信这是正确的解决方案,因为它正确地计算了“...”和“??”等内容。作为一个句子
I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
对于单词和句子,您可能需要清楚地说明对句子、单词和程序的定义。
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
不是100%正确,但我只是尝试了一下。我没有考虑@wilhelmtell 的所有观点。一旦有时间我就会尝试...
这里 1.txt 是文件名。
Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...
Here 1.txt is the file name.
解决这个问题的唯一方法是创建一个使用N自然L语言P的人工智能程序>处理这不是很容易做到的。
输入:
“这是一段关于图灵机的段落。艾伦·图灵博士发明了图灵机。它解决了一个问题,该问题的解决率变化了 0.1%。”
查看 OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
已经有一个计算单词和字符的程序——
wc
。There's already a program to count words and characters--
wc
.