如何在没有循环的情况下计算APL或J中元素的频率
假设我有两个列表,一个是文本 t
,一个是字符 c
列表。我想计算每个字符在文本中出现的次数。
这可以通过以下 APL 代码轻松完成。
+⌿t∘.=c
然而它很慢。它采用外积,然后对每一列求和。
它是一个 O(nm) 算法,其中 n 和 m 是 t 和 c 的大小。
当然,我可以用 APL 编写一个过程程序,逐个字符地读取 t
并在 O(n+m) 内解决这个问题(假设完美的散列)。
有没有办法在没有循环(或条件)的情况下在 APL 中更快地完成此操作?我也接受 J 中的解决方案。
编辑: 实际上,我这样做的地方是文本比字符列表短得多(字符是非 ASCII 字符)。我正在考虑文本长度为 20 且字符列表长度为数千的情况。
有一个简单的优化给定 n 小于 m。
w ← (∪t)∩c
f ← +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r
w 仅包含 t 中的字符,因此表大小仅取决于 t 而不是 c。该算法的运行时间为 O(n^2+m log m)。其中m log m是进行交集运算的时间。
然而,次二次算法仍然是首选,以防有人提供巨大的文本文件。
Assume I have two lists, one is the text t
, one is a list of characters c
. I want to count how many times each character appears in the text.
This can be done easily with the following APL code.
+⌿t∘.=c
However it is slow. It take the outer product, then sum each column.
It is a O(nm) algorithm where n and m are the size of t
and c
.
Of course I can write a procedural program in APL that read t
character by character and solve this problem in O(n+m) (assume perfect hashing).
Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.
Edit:
Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.
There is a simple optimization given n is smaller than m.
w ← (∪t)∩c
f ← +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r
w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.
However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
注意。使用“关键”(/.) 副词和计数 (#) 动词
很重要。计数的项目是字符串的核心。
注意。因此,如果我们将目标与字符串
NB 一起计数。我们为每个目标项目额外获得一个。
注意。这会从 xs 的每个计数中减去 1 (<:)。
注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.43914 (10) 6!:2 '''1'' countKey2 1e8 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.466857注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.432938NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.466857注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.43964 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.466857注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.432938NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
'1234567890''' 0.466857注意。默认版本
NB。乍一看似乎有点快
NB。但重复计时 10 次表明它们是相同的。
NB. Using "key" (/.) adverb w/tally (#) verb counts
NB. the items counted are the nub of the string.
NB. So, if we count the target along with the string
NB. We get an extra one for each of the target items.
NB. This subtracts 1 (<:) from each count of the xs.
NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.43914 (10) 6!:2 '''1'' countKey2 1e8 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.466857NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.432938NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.466857NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.43964 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.466857NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.432938NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
'1234567890''' 0.466857NB. A tacit version
NB. appears to be a little faster at first
NB. But repeating the timing 10 times shows they are the same.
Dyalog v14 引入了键运算符 (
⌸
):操作数函数将字母作为
⍺
并将该字母(索引向量)的出现次数作为⍵.
Dyalog v14 introduced the key operator (
⌸
):The operand function takes a letter as
⍺
and the occurrences of that letter (vector of indices) as⍵
.我觉得这个用J写的例子很符合你的要求。字符列表比文本长(但为了开发过程中的方便,两者都保持较短。)我没有检查时间,但我的直觉是它会很快。仅参考文本中实际出现的字符进行计数,并且仅查找长字符集以关联文本中出现的字符。
I think this example, written in J, fits your request. The character list is longer than the text (but both are kept short for convenience during development.) I have not examined timing but my intuition is that it will be fast. The tallying is done only with reference to characters that actually occur in the text, and the long character set is looked across only to correlate characters that occur in the text.
正如其他答案中所述,关键操作员直接执行此操作。然而解决这个问题的经典 APL 方法仍然值得了解。
经典的解决方案是“排序、移位和比较”:
对于最终答案:
这段代码是我的想法,还没有准备好用于生产。必须寻找空的情况 - 布尔移位可能并不适合所有情况......
As noted in other answers, the key operator does this directly. However the classic APL way of solving this problem is still worth knowing.
The classic solution is "sort, shift, and compare":
And for the final answer:
This code is off the top of my head, not ready for production. Have to look for empty cases - the boolean shift is probably not right for all cases....
J 中的“暴力”:
用法:
不确定它是如何在内部实现的,但这里是不同输入大小的时间:
我们不会在“自分类”之前过滤输入日期,因此:
'1234567890''' 0.673034 6!:2 '''1234567890'' count 10000000 'abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 0.00803909 6!:2 '''abcdefg'' count 1000000我们不会在“自分类”之前过滤输入日期,因此:
'abdaaaerbfqeiurbouebjkvwek''' 0.0845451 6!:2 '''abcdefg'' count 10000000我们不会在“自分类”之前过滤输入日期,因此:
'abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7 0.862423我们不会在“自分类”之前过滤输入日期,因此:
'1234567890''' 0.673864 'abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 0.00803909 6!:2 '''abcdefg'' count 1000000我们不会在“自分类”之前过滤输入日期,因此:
'abdaaaerbfqeiurbouebjkvwek''' 0.0845451 6!:2 '''abcdefg'' count 10000000我们不会在“自分类”之前过滤输入日期,因此:
'abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7 0.862423我们不会在“自分类”之前过滤输入日期,因此:
"Brute force" in J:
Usage:
Not sure how it's implemented internally, but here are the timings for different input sizes:
We don't filter input date prior to 'self-classify' so:
'1234567890''' 0.673034 6!:2 '''1234567890'' count 10000000 'abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 0.00803909 6!:2 '''abcdefg'' count 1000000We don't filter input date prior to 'self-classify' so:
'abdaaaerbfqeiurbouebjkvwek''' 0.0845451 6!:2 '''abcdefg'' count 10000000We don't filter input date prior to 'self-classify' so:
'abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7 0.862423We don't filter input date prior to 'self-classify' so:
'1234567890''' 0.673864 'abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 0.00803909 6!:2 '''abcdefg'' count 1000000We don't filter input date prior to 'self-classify' so:
'abdaaaerbfqeiurbouebjkvwek''' 0.0845451 6!:2 '''abcdefg'' count 10000000We don't filter input date prior to 'self-classify' so:
'abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7 0.862423We don't filter input date prior to 'self-classify' so:
我在 APL (NARS2000) 中的实现:
示例:
注意:仅显示 c 中存在于 t 中的那些字符
My implementation in APL (NARS2000):
Example:
Note: showing only those characters in c that exist in t
我最初的想法是,这是 Find 运算符的情况:
使用的字符是:
它们各自的频率是:
我只运行过玩具案例,所以我不知道性能如何,但我的直觉告诉我它应该比外部产品便宜。
有什么想法吗?
My initial thought was that this was a case for the Find operator:
The used characters are:
Their respective frequencies are:
I've only run toy cases so I have no idea what the performance is, but my intuition tells me it should be cheaper that the outer product.
Any thoughts?