如何统计文件中的句子数、单词数和字符数？

发布于 2024-10-18 07:03:37 字数 680 浏览 6 评论 0原文

我编写了以下代码来标记来自文件 samp.txt 的输入段落。谁能帮我查找并打印文件中的句子数、单词数和字符数？为此，我在 python 中使用了 NLTK。

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

原文

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

差↓一点笑了 2024-10-25 07:03:37

尝试这种方式（此程序假设您正在使用 dirpath 指定的目录中的一个文本文件）：

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

希望这会有所帮助

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

Hope this helps

回复收藏 0 原文

愿与i 2024-10-25 07:03:37

借助 nltk，您还可以使用 FreqDist（请参阅 O'Reillys Book Ch3.1）

并且在你的情况下：

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)

And in your case:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

回复收藏 0 原文

梦在夏天 2024-10-25 07:03:37

如果有人来到这里，那是值得的。我认为这解决了OP提出的所有问题。如果使用 textstat 包，计算句子和字符是非常容易的。每个句子末尾的标点符号有一定的重要性。

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

回复收藏 0 原文

屌丝范 2024-10-25 07:03:37

我相信这是正确的解决方案，因为它正确地计算了“...”和“??”等内容。作为一个句子

len(re.findall(r"[^?!.][?!.]", paragraph))

I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence

len(re.findall(r"[^?!.][?!.]", paragraph))

回复收藏 0 原文

青丝拂面 2024-10-25 07:03:37

字符很容易计算。
段落通常也很容易计算。每当您看到两个连续的换行符时，您可能就有一个段落。您可能会说枚举或无序列表是一个段落，即使它们的条目各自可以由两个换行符分隔。标题或标题后面也可以有两个换行符，即使它们显然不是段落。还要考虑文件中单个段落的情况，后面有一个换行符或没有换行符。
句子很棘手。您可能会选择句号、感叹号或问号，后跟空格或文件结尾。这很棘手，因为有时冒号标志着句子的结束，有时则不然。通常，在英语的情况下，下一个非空白字符将是大写的。但有时不是；例如，如果它是一个数字。有时，左括号标志着句子的结束（但这是有争议的，就像在本例中一样）。
言语也很棘手。通常单词由空格或标点符号分隔。有时破折号分隔单词，有时则不然。例如，连字符就是这种情况。

对于单词和句子，您可能需要清楚地说明对句子、单词和程序的定义。

回复收藏 0 原文

风吹短裙飘 2024-10-25 07:03:37

不是100%正确，但我只是尝试了一下。我没有考虑@wilhelmtell 的所有观点。一旦有时间我就会尝试...

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

这里 1.txt 是文件名。

Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

Here 1.txt is the file name.

回复收藏 0 原文