Lua 挑战:你能提高 mandelbrot 实现的性能吗?
状态:到目前为止,最佳答案的程序执行时间是原始程序的 33%! 但可能还有其他方法可以优化它。
Lua 是目前最快的脚本语言,但是 Lua 在一些针对 C/C++ 的基准测试中得分非常差。
其中之一是 mandelbrot 测试(生成 Mandelbrot 设置便携式位图文件 N=16,000),其得分为可怕的 1:109(多核)或 1:28(单核),
因为速度增量相当大,这是一个很好的优化候选者。 另外,我确信那些知道 Mike Pall 的人可能会认为不可能进一步优化这一点,但这是明显错误的。 任何做过优化的人都知道,总是可以做得更好。 此外,我确实通过一些调整获得了一些额外的性能,所以我知道它是可能的:)
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall
local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
write("P4\n", width, " ", height, "\n")
for y=0,height-1 do
local Ci = 2*y / height - 1
for xb=0,width-1,8 do
local bits = 0
local xbb = xb+7
for x=xb,xbb < width and xbb or width-1 do
bits = bits + bits
local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
local Cr = x * wscale - 1.5
for i=1,m do
local Zri = Zr*Zi
Zr = Zrq - Ziq + Cr
Zi = Zri + Zri + Ci
Zrq = Zr*Zr
Ziq = Zi*Zi
if Zrq + Ziq > limit2 then
bits = bits + 1
break
end
end
end
if xbb >= width then
for x=width,xbb do bits = bits + bits + 1 end
end
write(char(255-bits))
end
end
那么如何优化它(当然,与任何优化一样,你必须测量你的实现以确保它更快)。 并且你不能为此改变 Lua 的 C 核心,或者使用 LuaJit,它是为了寻找方法来优化 Lua 的弱点之一。
编辑:为此设置赏金以使挑战更有趣。
Status: So far the best answer's program executes in 33% of the time of the original program! But there is probably still other ways to optimize it.
Lua is currently the fastest scripting language out there, however Lua scores really bad in a few benchmarks against C/C++.
One of those is the mandelbrot test (Generate Mandelbrot set portable bitmap file N=16,000), where it scores a horrible 1:109(Multi Core) or 1:28(Single Core)
Since the Delta in speed is quite large, this is a good candidate for optimizations. Also I'm sure some that those who know who Mike Pall is might believe its not possible to optimize this any further, but that's blatantly wrong. Anyone who has done optimizations knows it is always possible to do better. Besides I did manage to get some extra performance with a few tweaks, so I know its possible :)
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall
local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
write("P4\n", width, " ", height, "\n")
for y=0,height-1 do
local Ci = 2*y / height - 1
for xb=0,width-1,8 do
local bits = 0
local xbb = xb+7
for x=xb,xbb < width and xbb or width-1 do
bits = bits + bits
local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
local Cr = x * wscale - 1.5
for i=1,m do
local Zri = Zr*Zi
Zr = Zrq - Ziq + Cr
Zi = Zri + Zri + Ci
Zrq = Zr*Zr
Ziq = Zi*Zi
if Zrq + Ziq > limit2 then
bits = bits + 1
break
end
end
end
if xbb >= width then
for x=width,xbb do bits = bits + bits + 1 end
end
write(char(255-bits))
end
end
So how could this be optimized (of course as with any optimization you have to measure your implementation to be sure its faster). And you aren't allowed to alter the C-core of Lua for this, or use LuaJit, its about finding ways to optimizing one of Lua's weak weak points.
Edit: Putting a Bounty on this as to make the challenge more fun.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我不知道 Lua 是否适合生成工作代码,但您应该能够通过使用一些数学技巧来大幅提高 Mandelbrot 性能。 有人建议使用对称性来加速它,使用此优化可以完成另一个重大改进:
使用使用 Mandelbrot 部分的矩形坐标的递归函数。 然后,它计算矩形边界线和中间分割的两条线处的值。 之后,有 4 个子矩形。 如果其中一个边框像素颜色全部相同,则可以简单地用该颜色填充它,如果不是,则递归调用该部分的函数。
我搜索了该算法的另一种解释,并在此处找到了 -以及一个漂亮的可视化。 旧的 DOS 程序 FRACTINT 将此优化称为“Tesseral 模式”。
I don't know Lua that good to produce working code, but you should be able to heavily increase Mandelbrot performance by using some math tricks. There was a suggestion about using symmetry to speed it up, another big improvement could be done using this optimization:
Use a recursive function that uses rectangle coordinates of a Mandelbrot portion. It then calculates the values at the rectangles border lines and the two lines that split in the middle. After this, there are 4 sub-rectangles. If one of it has all the same border pixel colors, you can simply fill it with this color, if not, you recursively call the function on this portion.
I searched for another explanation of this algorithm and found one here - along with a nice visualization. The old DOS program FRACTINT calls this optimization "Tesseral mode".
在基准测试游戏中,经过的时间为 通过生成更多进程来使用可用核心,从 674 秒减少到 211 秒。
On the benchmarks game the elapsed time has been reduced from 674s to 211s by spawning more processes to use the available cores.
为什么要使用局部变量Zri? 可以通过重新排序接下来的两个语句来避免使用它:
Zi = 2*Zr*Zi + Ci
Zr = Zrq - Ziq + Cr
也可以使用简单的周期性检查,但加速取决于 m。 “m”越大,周期性检查获得的加速效果越好。
Why to use local variable Zri? It is possible to avoid its use by reordering next two statements:
Zi = 2*Zr*Zi + Ci
Zr = Zrq - Ziq + Cr
There's also possible to use simple periodicity checking, but speedup depends on m. The larger "m" is, the better is speedup gained from periodicity checking.
罗伯特·古尔德 其中之一是 mandelbrot 测试(生成 Mandelbrot 设置便携式位图文件 N=16,000),其得分为可怕的 1:109
当您引用基准测试游戏中的数字时,请说明这些数字的来源,以便读者有一定的背景。
在这种情况下,您似乎已经在四核机器上测量了数据,其中最快的程序已被重写以利用多个核心。 而不是查看经过的时间 排序依据CPU 时间,您会看到比率下降到 1:28。
或者查看中位数和四分位数以获得更好的印象 C++ 测量集与 Lua 测量集的比较。
或者有一整套测量程序被迫仅使用一个核心 - Lua 与 C++ 的比较 - 如果你看一下 那些 Lua pi-digits 程序 你会看到它们使用 C 语言 GNU GMP图书馆。
Robert Gould > One of those is the mandelbrot test (Generate Mandelbrot set portable bitmap file N=16,000), where it scores a horrible 1:109
When you quote numbers from the benchmarks game please show where those numbers come from so readers have some context.
In this case you seem to have taken numbers measured on the quadcore machine where the fastest programs have been re-written to exploit multiple cores. Instead of looking at elapsed time sort by CPU time and you'll see the ratio drop to 1:28.
Or look at the median and quartiles to get a better impression of how the set of C++ measurements compares to the set of Lua measurements.
Or there's a whole set of measurements where programs are forced to use just one core - Lua compared with C++ - and if you take a look at those Lua pi-digits programs you'll see that they use the C language GNU GMP library.
接下来我做的就是把一遍又一遍计算出来的东西缓存起来,并将bit+bit替换为bit*2,这些简单的优化功能相当强大……
这个优化让程序的运行时间是原来的34%,但 Markus Q 的优化仍然击败了我的;)
Next step I did was cache the stuff that was calculated over and over, and replace bit+bit to bit*2, These simple optimizations are quite powerful...
This optimization makes the program run in 34% of the time of the original, but Markus Q's optimization still beat mine ;)
这是另一种尝试,但结果比本地访问变量要慢(我想象使用干净的环境会更快地找到变量,但事实并非如此,本地的虚拟寄存器稍微快一些)将运行时间降低至 41%。
This was another attempt, but it turned out to be slower than local access of variables (I imagined using a clean environment would have made it faster to find the variables, but it wasn't the case, local's virtual registers is slightly faster) This brought the runtime down to 41%.
通过 2,(在我的机器上)比之前的好大约 30%。 主要的节省来自展开内循环以分摊开销。
还包括(已注释掉)的是当您卡在中央心形线时,通过提前退出(并将像素设置为黑色)来节省时间的中止尝试。 它有效,但无论我如何调整它,它都会变慢。
我得走了,但我会留下一个分手建议。 通过对结果进行游程编码可能会进行一些优化(因此,您可以保存一个列表(白点的数量、黑点的数量、白点的数量等),而不是保存一堆位扭曲的字符。 )。 这将:
不知道它是否可以编码得足够紧密以便能够飞行,但如果我有更多时间,这就是我下一步会尝试的地方。
Pass 2, about 30% better (on my machine) than my previous. The main saving came from unrolling the inner loop to amortize the overhead.
Also included (commented out) is an aborted attempt to save time by exiting early (& set the pixel black) when you are stuck in the central cardioid. It works, but it's slower no matter how I jiggered it.
I've got to run, but I'll leave a parting suggestion. There may be some optimization possible by run-length encoding the results (so instead of saving a bunch of bit-twiddled chars you'd save a list (number of white dots, number of black dots, number of white dots, etc.)). This would:
No idea if it could be coded tight enough to fly, but that is where I would try next if I had more time.
所以这里是 ~40% 的开始:
--MarkusQ
So here's ~40% for a start:
-- MarkusQ
现在至少有一个答案比我的解决方案更快,我将发布我的答案。
技巧是在将值发送到输出之前将值存储到表中。
像这样简单的事情将运行时间减少到 58%。
Now that there is at least one answer faster than my solution I'll post my answer
The trick is storing values to a table before sending them to the output.
Something as simple as this reduced the run time to 58%.