使用正则表达式在 Python 中进行字数统计

发布于 2024-11-07 12:19:46 字数 196 浏览 4 评论 0原文

使用正则表达式计算文档中英文单词的正确方法是什么？

我尝试过：

words=re.findall('\w+', open('text.txt').read().lower())
len(words)

但似乎我遗漏了几个单词（与 gedit 中的字数相比）。我做得对吗？

多谢！

原文

What is the correct way to count English words in a document using regular expression?

I tried with:

words=re.findall('\w+', open('text.txt').read().lower())
len(words)

but it seems I am missing few words (compares to the word count in gedit).
Am I doing it right?

Thanks a lot!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

坏尐絯℡ 2024-11-14 12:19:46

使用 \w+ 将无法正确计算包含撇号或连字符的单词，例如“can't”将被计为 2 个单词。它还会计算数字（数字串）； “12,345”和“6.7”各算作 2 个单词（“12”和“345”、“6”和“7”）。

回复收藏 0 原文

赏烟花じ飞满天 2024-11-14 12:19:46

这似乎按预期工作。

>>> import re
>>> words=re.findall('\w+', open('/usr/share/dict/words').read().lower())
>>> len(words)
234936
>>> 
bash-3.2$ wc /usr/share/dict/words
  234936  234936 2486813 /usr/share/dict/words

你为什么要小写你的话？这和计数有什么关系？

我认为以下内容会更有效：

words=re.findall(r'\w+', open('/usr/share/dict/words').read())

This seems to work as expected.

>>> import re
>>> words=re.findall('\w+', open('/usr/share/dict/words').read().lower())
>>> len(words)
234936
>>> 
bash-3.2$ wc /usr/share/dict/words
  234936  234936 2486813 /usr/share/dict/words

Why are you lowercasing your words? What does that have to do with the count?

I'd submit that the following would be more efficient:

words=re.findall(r'\w+', open('/usr/share/dict/words').read())

回复收藏 0 原文

蹲墙角沉默 2024-11-14 12:19:46

一旦您通过 _words_list = Words.split() 获得单词列表或通过正则表达式或其他方法进行所需的处理，您可以使用以下方法轻松获得单词计数：

import numpy as NP
import pandas as PD

_counted_words = PD.Series(NP.array(_words_list)).value_counts()

Once you have list of words by _words_list = words.split() or required processing through regex or other methods, you can easily get a count of words with the following method:

import numpy as NP
import pandas as PD

_counted_words = PD.Series(NP.array(_words_list)).value_counts()

回复收藏 0 原文

~没有更多了~

关于作者

我们的影子

暂无简介

文章

26 人气

关注发私信

饮湿

文章 0 评论 0

关注

明月

文章 0 评论 0

关注

02

文章 0 评论 0

关注

hs1283

文章 0 评论 0

关注

风向决定发型

文章 0 评论 0

关注

落花浅忆

文章 0 评论 0

友情链接

文江博客

使用正则表达式在 Python 中进行字数统计

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

使用正则表达式在 Python 中进行字数统计

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。