如何统计文件中的句子数、单词数和字符数?

发布于 2024-10-18 07:03:37 字数 680 浏览 6 评论 0原文

我编写了以下代码来标记来自文件 samp.txt 的输入段落。谁能帮我查找并打印文件中的句子数、单词数和字符数?为此,我在 python 中使用了 NLTK。

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

差↓一点笑了 2024-10-25 07:03:37

尝试这种方式(此程序假设您正在使用 dirpath 指定的目录中的一个文本文件):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

希望这会有所帮助

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

Hope this helps

愿与i 2024-10-25 07:03:37

借助 nltk,您还可以使用 FreqDist(请参阅 O'Reillys Book Ch3.1

并且在你的情况下:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)

And in your case:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
梦在夏天 2024-10-25 07:03:37

如果有人来到这里,那是值得的。我认为这解决了OP提出的所有问题。如果使用 textstat 包,计算句子和字符是非常容易的。每个句子末尾的标点符号有一定的重要性。

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
屌丝范 2024-10-25 07:03:37

我相信这是正确的解决方案,因为它正确地计算了“...”和“??”等内容。作为一个句子

len(re.findall(r"[^?!.][?!.]", paragraph))

I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence

len(re.findall(r"[^?!.][?!.]", paragraph))
青丝拂面 2024-10-25 07:03:37
  • 字符很容易计算。
  • 段落通常也很容易计算。每当您看到两个连续的换行符时,您可能就有一个段落。您可能会说枚​​举或无序列表是一个段落,即使它们的条目各自可以由两个换行符分隔。标题或标题后面也可以有两个换行符,即使它们显然不是段落。还要考虑文件中单个段落的情况,后面有一个换行符或没有换行符。
  • 句子很棘手。您可能会选择句号、感叹号或问号,后跟空格或文件结尾。这很棘手,因为有时冒号标志着句子的结束,有时则不然。通常,在英语的情况下,下一个非空白字符将是大写的。但有时不是;例如,如果它是一个数字。有时,左括号标志着句子的结束(但这是有争议的,就像在本例中一样)。
  • 言语也很棘手。通常单词由空格或标点符号分隔。有时破折号分隔单词,有时则不然。例如,连字符就是这种情况。

对于单词和句子,您可能需要清楚地说明对句子、单词和程序的定义。

  • Characters are easy to count.
  • Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
  • Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
  • Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.

For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.

风吹短裙飘 2024-10-25 07:03:37

不是100%正确,但我只是尝试了一下。我没有考虑@wilhelmtell 的所有观点。一旦有时间我就会尝试...

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

这里 1.txt 是文件名。

Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

Here 1.txt is the file name.

终难愈 2024-10-25 07:03:37

解决这个问题的唯一方法是创建一个使用N自然L语言P的人工智能程序>处理这不是很容易做到的。

输入:

“这是一段关于图灵机的段落。艾伦·图灵博士发明了图灵机。它解决了一个问题,该问题的解决率变化了 0.1%。”

查看 OpenNLP

https://sourceforge.net/projects/opennlp/

http://opennlp.apache.org/

The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.

Input:

"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."

Checkout OpenNLP

https://sourceforge.net/projects/opennlp/

http://opennlp.apache.org/

最初的梦 2024-10-25 07:03:37

已经有一个计算单词和字符的程序——wc

There's already a program to count words and characters-- wc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文