比较 wc 和 Smalltalk 之间的换行计数速度

发布于 2024-12-13 21:31:42 字数 563 浏览 3 评论 0原文

我正在比较读取文件包含多少行的性能。

我首先使用 wc 命令行工具完成：

$ time wc -l bigFile.csv
1673820 bigFile.csv

real    0m0.157s
user    0m0.124s
sys     0m0.062s

然后在干净的 Pharo Core Smalltalk 最新 1.3 中

| file lineCount |
Smalltalk garbageCollect.
( Duration milliSeconds: [ file := FileStream readOnlyFileNamed: 'bigFile.csv'.
lineCount := 0.
[ file atEnd ] whileFalse: [
    file nextLine.
    lineCount := lineCount + 1 ].
file close.
lineCount. ] timeToRun ) asSeconds. 
15

如何加速 Smalltalk 代码，使其比 wc 性能更快或更接近？

原文

I am comparing performance for reading how many lines contains a file.

I did it first using the wc command line tool:

$ time wc -l bigFile.csv
1673820 bigFile.csv

real    0m0.157s
user    0m0.124s
sys     0m0.062s

and then in a clean Pharo Core Smalltalk latest 1.3

| file lineCount |
Smalltalk garbageCollect.
( Duration milliSeconds: [ file := FileStream readOnlyFileNamed: 'bigFile.csv'.
lineCount := 0.
[ file atEnd ] whileFalse: [
    file nextLine.
    lineCount := lineCount + 1 ].
file close.
lineCount. ] timeToRun ) asSeconds. 
15

How can I speed up the Smalltalk code to be faster or closer than the wc performance?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

随遇而安 2024-12-20 21:31:42

[ (PipeableOSProcess waitForCommand: 'wc -l /path/to/bigfile2.csv') output ] timeToRun.

上面的报告~207毫秒，其中时间报告：

real    0m0.160s
user    0m0.131s
sys     0m0.029s

我在开玩笑，但也是认真的。无需重新发明轮子。 FFI、OSProcess、Zinc 等提供了充足的机会来利用经过数十年考验的 UNIX 实用程序等工具。

如果您的问题确实更多关于 Smalltalk 本身，那么开始将是：

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | | endings count |
        count := 0.
        file binary.
        file contents do: [ :c | c = 10 ifTrue: [ count := count + 1 ] ].
        count ]
] timeToRun.

That will get you down to 2.5秒：

使流二进制文件保存约10秒
readOnlyFileNamed:do：保存约1秒
手动查找行结尾而不是使用#nextLine节省了约 4 秒

一个更干净、但长 1/2 秒的操作将是：

file contentoccurrencesOf: 10.

当然，如果需要更好的性能，并且您不想使用FFI/OSProcess，然后您将编写一个插件。

[ (PipeableOSProcess waitForCommand: 'wc -l /path/to/bigfile2.csv') output ] timeToRun.

The above reports ~207 milliseconds, where time reported:

real    0m0.160s
user    0m0.131s
sys     0m0.029s

I'm kidding, but also serious. No need to reinvent the wheel. FFI, OSProcess, Zinc, etc. provide ample opportunity to utilize things like UNIX utilities that have been battle-tested over decades.

If your question was really more about Smalltalk itself, a start would be:

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | | endings count |
        count := 0.
        file binary.
        file contents do: [ :c | c = 10 ifTrue: [ count := count + 1 ] ].
        count ]
] timeToRun.

That will get you down to 2.5 seconds:

making the stream binary saved ~10 seconds
readOnlyFileNamed:do: saved ~1 second
finding the line endings manually instead of using #nextLine saved ~4 seconds

A cleaner, but 1/2 second longer op would be:

file contents occurrencesOf: 10.

Of course, if better performance is needed, and you don't want to use FFI/OSProcess, you would then write a plugin.

回复收藏 0 原文

兰花执着 2024-12-20 21:31:42

如果您有能力读取内存中的整个文件，那么最简单的代码是

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | file contents lineCount ]
] timeToRun.

这将处理 LF（Linux）、CR（旧 Mac）、CR-LF（您能想到的）的动物园。
Sean 的代码仅处理 LF，成本大致相同。
我想说，对于此类基本操作，Smalltalk 与 C 相比预计会增加 10 倍，因此我怀疑如果不添加自己的原语，您是否会获得更高的效率。

If you can afford reading the whole file in memory, then the simplest code is

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | file contents lineCount ]
] timeToRun.

This will handle the zoo of LF (Linux), CR (Old Mac), CR-LF (you name it).
The code from Sean only handles LF, for approximately the same cost.
I'd say a factor 10 for Smalltalk vs C is expected for such basic operations, so I doubt you get much more efficiency without adding your own primitives.

回复收藏 0 原文

~没有更多了~