Python - 字典查找每个字符的频率是否很慢?
我试图使用 O(n) 复杂度的算法找到任何给定文本中每个符号的频率。我的算法看起来像:
s = len(text)
P = 1.0/s
freqs = {}
for char in text:
try:
freqs[char]+=P
except:
freqs[char]=P
但我怀疑这个字典方法是否足够快,因为它取决于字典方法的底层实现。这是最快的方法吗?
更新:如果使用集合和整数,速度不会增加。这是因为该算法已经具有 O(n) 复杂度,因此不可能实现本质上的加速。
例如,1MB 文本的结果:
without collections:
real 0m0.695s
with collections:
real 0m0.625s
I am trying to find a frequency of each symbol in any given text using an algorithm of O(n) complexity. My algorithm looks like:
s = len(text)
P = 1.0/s
freqs = {}
for char in text:
try:
freqs[char]+=P
except:
freqs[char]=P
but I doubt that this dictionary-method is fast enough, because it depends on the underlying implementation of the dictionary methods. Is this the fastest method?
UPDATE: there is no increase in speed if collections and integers are used. It is because the algorithm is already of O(n) complexity, so no essential speedup is possible.
For example, results for 1MB text:
without collections:
real 0m0.695s
with collections:
real 0m0.625s
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
性能对比
注:表中的时间不包括加载文件所花费的时间。
使用的文件:
'/usr/share/dict/american-english'
和'big.txt'
。比较 'Counter'、'setdefault'、'list'、'try/ except'、'defaultdict'、'numpy'、'cython' 和 @S.Mark 解决方案的脚本位于 http://gist.github.com/347000
最快的解决方案是用 Cython 编写的 Python 扩展:
之前的比较:
输入数据:
python (计数器):0.5秒
运行它:
perl:0.5秒
输出:
python(numpy):0.07秒
基于Ants Aasma 的答案(已修改以支持 unicode):
输出:
c++:0.05 秒
结果:
c (ascii):0.0079 秒
Performance comparison
Note: time in the table doesn't include the time it takes to load files.
The files used:
'/usr/share/dict/american-english'
and'big.txt'
.The script that compares 'Counter', 'setdefault', 'list', 'try/except', 'defaultdict', 'numpy', 'cython' -based, and @S.Mark's solutions is at http://gist.github.com/347000
The fastest solution is Python extension written in Cython:
Previous comparison:
Input data:
python (Counter): 0.5 seconds
Run it:
perl: 0.5 seconds
Output:
python (numpy): 0.07 seconds
Based on Ants Aasma's answer (modified to support unicode):
Output:
c++: 0.05 seconds
Result:
c (ascii): 0.0079 seconds
如何避免循环内的浮动操作并在所有操作完成后执行?
这样的话,你每次都可以+1,而且速度应该会更快。
最好按照 S.Lott 的建议使用 collections.defaultdict 。
或者你可能想尝试,collections.Counter in python 2.7+
或者
你可以尝试 psyco,它为 python 进行即时编译。
你有循环,所以我认为你会通过 psyco 获得一些性能增益
编辑1:
我基于 big.txt (~6.5 MB) 进行了一些基准测试,该文件用于 拼写纠正器 作者:peternorvig
CPU 规格:1.6GHz Intel Mobile Atom CPU
据此,
dict.get
是最慢,collections.defaultdict
是最快的,try/ except
也是最快的。编辑2:
添加了
collections.Counter
基准,它比dict.get
慢,在我的笔记本电脑上花了15秒How about avoiding float operations inside the loop and do it after everything is done?
By that way, you could just do +1 everytime, and its should be faster.
And better use collections.defaultdict as S.Lott advised.
Or You may want to try, collections.Counter in python 2.7+
Or
You may try psyco, which do just-in-time compiling for python.
You have loops, so I think you would get some performance gain with psyco
Edit 1:
I did some benchmarks base on big.txt (~6.5 MB) which is used in spelling corrector by peter norvig
CPU Specs: 1.6GHz Intel Mobile Atom CPU
According to that,
dict.get
isslowestandcollections.defaultdict
is fastest,try/except
is also the fast one.Edit 2:
Added
collections.Counter
benchmarks, Its slower thandict.get
and took 15s in my laptop我已经编写了Python的字符计数器C扩展,看起来比
collections.Counter
快300x,比collections.default快150x (int)
这是字符计数器 C 扩展代码
,其中 char_counter 和 char_list 在模块级别进行 malloc,因此无需每次函数调用时都进行 malloc。
它返回一个包含元组的列表
要转换为 dict 格式,只需
dict()
即可。PS: Benchmark 包含转换为 dict
CharCounter
的时间仅接受 Python Unicode String
u""
,如果文本是 utf8,需要执行.decode("utf8 “)提前。
输入支持 Unicode 直至基本多语言平面 (BMP) - 0x0000 至 0xFFFF
I've written Char Counter C Extension to Python, looks like 300x faster than
collections.Counter
and 150x faster thancollections.default(int)
Here is Char Counter C Extension Codes
Where char_counter, and char_list are malloc-ated at module level, so no need to malloc every time when function calls.
It returns a List with Tuples
To convert to dict format, just
dict()
will do.PS: Benchmark included the time converting to dict
CharCounter
accept only Python Unicode Stringu""
, if the text is utf8, need to do.decode("utf8")
in advance.Input Supports Unicode until Basic Multilingual Plane (BMP) - 0x0000 to 0xFFFF
不,它不是最快的,因为您知道字符的范围有限,您可以使用列表和直接索引,使用字符的数字表示来存储频率。
No it's not the fastest, because you know that the characters have a limited range and you could use a list and direct indexing, using the numeric representation of the character, to store the frequencies.
打败 dict 是非常非常困难的。它经过高度调整,因为 Python 中的几乎所有内容都是基于字典的。
It is very, very hard to beat
dict
. It is very highly tuned since almost everything in Python is dict-based.我对 python 不熟悉,但是要查找频率,除非您知道频率范围(在这种情况下您可以使用数组),否则字典是最佳选择。
如果您知道 unicode、ASCII 等范围内的字符,则可以定义具有正确数量的值的数组。
然而,这会将其空间复杂度从 O(n) 更改为 O(可能 n),但时间复杂度将从 O(n*(字典提取/插入时间)) 提高到 O(n)。
I'm not familiar with python, but for finding frequencies, unless you know the range of frequencies (in which case you can use an array), dictionary is the way to go.
If you know your characters in a unicode, ASCII, etc. range, you can define an array with the correct number of values.
However, this will change the space complexity of this from O(n) to O(possible n), but you will earn a time complexity improvement from O(n*(dictionary extraction/insertion time)) to O(n).
如果您确实关心速度,您可能会考虑首先用整数计算字符,然后通过(浮点)除法获得频率。
以下是数字:
If you are really concerned about speed, you might consider first counting characters with integers and then obtaining frequencies through (float) division.
Here are the numbers:
好吧,你可以用老式的风格来做......因为我们知道没有太多不同的字符并且它们是连续的,我们可以使用普通数组(或此处的列表)并使用字符序号进行索引:
这可能会更快,但只是通过一个常数因子,两种方法应该具有相同的复杂性。
well, you can do it in the old fashioned style... as we know that there are not too many different characters and they are contiguous, we can use a plain array (or list here) and use the characters ordinal numbering for indexing:
This will be probably faster, but just by a constant factor, both methods should have the same complexity.
在古腾堡计划的《爱丽丝梦游仙境》(163793 个字符)和《圣经,Douay-Rheims 版本》(5649295 个字符)上使用此代码:
我得到:
两本书的字符数之间的比率为
0.029
并且时间之间的比率为0.030
,因此,该算法的复杂度为 O(n),常数因子非常小。我想,对于大多数(所有?)目的来说,速度足够快了。Using this code on Alice in Wonderland (163793 chars) and "The Bible, Douay-Rheims Version" (5649295 chars) from Project Gutenberg:
I get:
The ratio between the number of chars for both books is
0.029
and the ratio between the times is0.030
, so, the algorithm is O(n) with a very small constant factor. Fast enough for most (all?) purposes, I should think.如果数据采用单字节编码,您可以使用 numpy 来加速计数过程:
一些快速基准测试表明,这比聚合到 defaultdict(int) 快大约 10 倍。
If the data is in a single byte encoding you can use numpy to accelerate the count process:
Some quick benchmarking shows that this is about 10x faster than aggregating to a defaultdict(int).
为了避免 try except 开销,您可以使用 defaultdict。
To avoid the try except overhead you can use a defaultdict.
使用
dict.setdefault
< 会带来小幅加速/a> 方法,这样您就不会为每个新遇到的字符付出相当大的代价:作为旁注:仅使用
except:
被认为是非常糟糕的做法。Small speed up will be usage of
dict.setdefault
method, that way you will not pay rather big price for every new encountered character:As a sidenote: having bare
except:
is considered very bad practice.