如何在没有循环的情况下计算APL或J中元素的频率

发布于 2024-11-28 23:24:13 字数 684 浏览 2 评论 0原文

假设我有两个列表，一个是文本 t，一个是字符 c 列表。我想计算每个字符在文本中出现的次数。

这可以通过以下 APL 代码轻松完成。

+⌿t∘.=c

然而它很慢。它采用外积，然后对每一列求和。

它是一个 O(nm) 算法，其中 n 和 m 是 t 和 c 的大小。

当然，我可以用 APL 编写一个过程程序，逐个字符地读取 t 并在 O(n+m) 内解决这个问题（假设完美的散列）。

有没有办法在没有循环（或条件）的情况下在 APL 中更快地完成此操作？我也接受 J 中的解决方案。

编辑： 实际上，我这样做的地方是文本比字符列表短得多（字符是非 ASCII 字符）。我正在考虑文本长度为 20 且字符列表长度为数千的情况。

有一个简单的优化给定 n 小于 m。

w  ← (∪t)∩c
f ←  +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r

w 仅包含 t 中的字符，因此表大小仅取决于 t 而不是 c。该算法的运行时间为 O(n^2+m log m)。其中m log m是进行交集运算的时间。

然而，次二次算法仍然是首选，以防有人提供巨大的文本文件。

原文

Assume I have two lists, one is the text t, one is a list of characters c. I want to count how many times each character appears in the text.

This can be done easily with the following APL code.

+⌿t∘.=c

However it is slow. It take the outer product, then sum each column.

It is a O(nm) algorithm where n and m are the size of t and c.

Of course I can write a procedural program in APL that read t character by character and solve this problem in O(n+m) (assume perfect hashing).

Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.

Edit:
Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.

There is a simple optimization given n is smaller than m.

w  ← (∪t)∩c
f ←  +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r

w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.

However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

万劫不复 2024-12-05 23:24:13

注意。使用“关键”(/.) 副词和计数 (#) 动词

   #/.~ 'abdaaa'
4 1 1

很重要。计数的项目是字符串的核心。

   ~. 'abdaaa'
abd

注意。因此，如果我们将目标与字符串

   #/.~ 'abc','abdaaa'
5 2 1 1

NB 一起计数。我们为每个目标项目额外获得一个。

   countKey2=: 4 : '<:(#x){.#/.~ x,y'

注意。这会从 xs 的每个计数中减去 1 (<:)。

   6!:2 '''1'' countKey2 10000000
注意。默认版本
   countKey=. [: <: ([: # [) {. [: #/.~ ,

NB。乍一看似乎有点快
   6!:2 '''1'' countKey 1e8
NB。但重复计时 10 次表明它们是相同的。
   (10) 6!:2 '''1'' countKey 1e8
'1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7
注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8
注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.432938

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7
注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8
注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890'''
0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.43914 (10) 6!:2 '''1'' countKey2 1e8 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.432938

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.43964 '1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.432938

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0451088 6!:2 '''1'' countKey2 1e7

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.0441849 6!:2 '''1'' countKey2 1e8

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

'1234567890''' 0.466857

注意。默认版本

NB。乍一看似乎有点快

NB。但重复计时 10 次表明它们是相同的。

NB. Using "key" (/.) adverb w/tally (#) verb counts

   #/.~ 'abdaaa'
4 1 1

NB. the items counted are the nub of the string.

   ~. 'abdaaa'
abd

NB. So, if we count the target along with the string

   #/.~ 'abc','abdaaa'
5 2 1 1

NB. We get an extra one for each of the target items.

   countKey2=: 4 : '<:(#x){.#/.~ x,y'

NB. This subtracts 1 (<:) from each count of the xs.

   6!:2 '''1'' countKey2 10000000
NB. A tacit version
   countKey=. [: <: ([: # [) {. [: #/.~ ,

NB. appears to be a little faster at first
   6!:2 '''1'' countKey 1e8
NB. But repeating the timing 10 times shows they are the same.
   (10) 6!:2 '''1'' countKey 1e8
'1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7
NB. A tacit version

NB. appears to be a little faster at first

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8
NB. A tacit version

NB. appears to be a little faster at first

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.466857

NB. A tacit version

NB. appears to be a little faster at first

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.432938

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7
NB. A tacit version

NB. appears to be a little faster at first

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8
NB. A tacit version

NB. appears to be a little faster at first

NB. But repeating the timing 10 times shows they are the same.

'1234567890'''
0.466857

NB. A tacit version

NB. appears to be a little faster at first