编译 > 2 GB 代码时如何修复 GCC 编译错误？

发布于 2024-11-15 05:38:45 字数 8249 浏览 6 评论 0原文

我有大量的函数，总计约 2.8GB 的目标代码（不幸的是，没有办法解决，科学计算......）

当我尝试链接它们时，我得到（预期的）重定位被截断以适应：R_X86_64_32S 错误，我希望通过指定编译器标志 -mcmodel=medium 来避免这些错误。除我控制之外的所有链接库均使用 -fpic 标志进行编译。

尽管如此，错误仍然存在，并且我假设我链接到的某些库不是用 PIC 编译的。

这是错误：

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

我链接的系统库：

-lgfortran -lm -lrt -lpthread

有什么线索可以在哪里寻找问题吗？

编辑：

首先，感谢您的讨论...

为了澄清一点，我有数百个函数（每个函数在单独的对象文件中大小约为 1 MB），如下所示：

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

对象 < code>s 相对较小，保留所需的常量 x14、x15、...、ds0、...等，而 ti 仅从外部库返回双精度值。正如您所看到的，csc[] 是一个预先计算的值映射，它也在单独的对象文件中进行评估（同样有数百个，每个文件大小约为 1 MB），其形式如下

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...
    
       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

：。最后一步就是调用所有这些 func[i] 并对结果求和。

关于这是一个相当特殊和不寻常的案例：是的，确实如此。这是人们在尝试对粒子物理进行高精度计算时必须应对的问题。

编辑2：

我还应该补充一点，x12、x13等并不是真正的常量。它们被设置为特定值，运行所有这些函数并返回结果，然后选择一组新的 x12、x13 等来生成下一个值。这必须完成 10⁵ 到 10⁶ 次...

EDIT3：

感谢您迄今为止的建议和讨论。老实说，我会尝试以某种方式在代码生成时滚动循环，但不确定如何具体执行此操作，但这是最好的选择。

顺便说一句，我并没有试图隐藏在“这是科学计算——无法优化”背后。
只是这段代码的基础是来自“黑匣子”的东西，我无法真正访问它，而且，整个事情通过简单的例子运行得很好，我主要对真实发生的事情感到不知所措。世界应用程序...

EDIT4：

因此，我通过简化计算机代数系统中的表达式（Mathematica）。我现在还看到了一些方法可以通过在生成代码之前应用一些其他技巧（这将使这部分减少到大约 100MB）来将其减少另一个数量级左右，我希望这个想法能够奏效。

现在与您的答案相关：

我正在尝试在 func 中再次滚动循环，其中 CAS 不会有太大帮助，但我已经有了一些想法。例如，按 x12、x13、... 等变量对表达式进行排序，使用 Python 解析 csc 并生成将它们相互关联的表。然后我至少可以将这些部分生成为循环。由于这似乎是迄今为止最好的解决方案，因此我将其标记为最佳答案。

不过，我还要赞扬 VJo。 GCC 4.6 确实工作得更好，生成的代码更小并且速度更快。使用大型模型可以按原样处理代码。所以从技术上来说这是正确的答案，但改变整个概念是一个更好的方法。

感谢大家的建议和帮助。如果有人感兴趣，我会在准备好后立即发布最终结果。

备注：

只是对其他一些答案的一些评论：我尝试运行的代码并非源于简单函数/算法的扩展和愚蠢的不必要的展开。实际发生的情况是，我们开始的东西是相当复杂的数学对象，并将它们转化为可计算的数字形式会生成这些表达式。问题实际上在于底层的物理理论。中间表达式的复杂性按阶乘缩放，这是众所周知的，但是当将所有这些东西组合成物理可测量的东西（可观察的东西）时，它只是归结为构成表达式基础的少数非常小的函数。（在这方面，一般且仅可用的ansatz 被称为“微扰理论”）我们试图将这个 ansatz 带到另一个层次，这在分析上不再可行，并且所需函数的基础未知。所以我们尝试像这样暴力破解它。这不是最好的方法，但希望最终能帮助我们理解手头的物理现象...

最后编辑：

感谢您的所有建议，我已经设法大大减少了代码大小，使用 Mathematica 并修改了 func 的代码生成器，有点类似于最上面的答案:)

我用 Mathematica 简化了 csc 函数，将其降低到92MB。这是不可约的部分。第一次尝试花了很长时间，但经过一些优化后，现在在单个 CPU 上运行大约 10 分钟。

对 func 的影响是巨大的：它们的整个代码大小降至大约 9 MB，因此代码现在总计在 100 MB 范围内。现在打开优化是有意义的，并且执行速度相当快。

再次感谢大家的建议，我学到了很多。

原文

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

And system libraries I link against:

-lgfortran -lm -lrt -lpthread

Any clues where to look for the problem?

EDIT:

First of all, thank you for the discussion...

To clarify a bit, I have hundreds of functions (each approx 1 MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...
    
       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

That's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2:

I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 10⁵ to 10⁶ times...

EDIT3:

Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize".
It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application...

EDIT4:

So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100 MB) and I hope this idea works.

Now related to your answers:

I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS:

Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measurable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT:

Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92 MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10 minutes on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你在看孤独的风景 2024-11-22 05:38:45

那么，您已经有了一个可以生成此文本的程序：

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

对

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

吧？

如果你的所有函数都有类似的“格式”（将n个数字相乘m次并添加结果 - 或类似的东西）那么我认为你可以这样做：

将生成器程序更改为输出偏移量而不是字符串（即而不是字符串“ s.ds0" 它将产生 offsetof(ProcessVars, ds0)
创建一个此类偏移量的数组
编写一个求值器，该求值器接受上面的数组和结构指针的基地址并产生

结果数组+求值器将表示与您的函数之一相同的逻辑，但只有求值器才是代码，并且可以在运行时生成或保存在磁盘上并读取 i 块或使用内存映射文件。

对于 func1 中的特定示例，想象一下，如果您有权访问 s 和 csc 的基地址以及类似向量的表示，您将如何通过求值器重写该函数常量和您需要的偏移量添加到基地址以获取 x14、ds8 和 csc[51370]

您需要创建一种新形式的“数据”，它将描述如何处理传递给大量函数的实际数据。

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
create an array of such offsets
write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

回复收藏 0 原文

╰つ倒转 2024-11-22 05:38:45

Linux 使用的 x86-64 ABI 定义了一个“大型模型”，专门用于避免这种大小限制，包括 GOT 和 PLT 的 64 位重定位类型。（请参阅第 4.4.2 节中的表格，以及 3.5.5 中的指令序列，其中显示了它们的使用方式。）

由于您的函数占用了 2.8 GB，所以您很不幸，因为 gcc 不支持大型模型。您可以做的是重新组织代码，以便将其拆分为可以动态链接的共享库。

如果这是不可能的，正如有人建议的那样，不要将数据放入代码中（编译和链接它），因为它很大，您可以在运行时加载它（作为普通文件，或者可以映射它）。

编辑

似乎 gcc 4.6 支持大型模型（请参阅此页面）。您可以尝试这样做，但是上面的内容仍然适用于重新组织代码。

回复收藏 0 原文

站稳脚跟 2024-11-22 05:38:45

对于该侧的程序，代码的缓存未命中很可能超过运行时循环的成本。我建议您返回代码生成器，让它为想要评估的内容生成一些紧凑表示（即，可能适合 D 缓存的表示），然后使用解释器执行它在你的程序中。您还可以看看是否可以分解出仍然具有大量操作的较小内核，然后在解释的代码中将它们用作“指令”。

回复收藏 0 原文

冷心人i 2024-11-22 05:38:45

发生错误是因为您的代码太多，而不是数据！例如，这可以通过从 _start 引用的 __libc_csu_fini （这是一个函数）来指示，并且重定位会被截断以适合。这意味着 _start （程序的真正入口点）正在尝试通过 SIGNED 32 位偏移量调用该函数，该偏移量的范围仅为 2 GB。由于目标代码的总量约为 2.8 GB，事实证明。

如果您可以重新设计数据结构，则可以通过将巨大的表达式重写为简单的循环来“压缩”您的大部分代码。

此外，您可以在不同的程序中计算 csc[]，将结果存储在文件中，并在需要时加载它们。

回复收藏 0 原文

萌酱 2024-11-22 05:38:45

我想每个人都同意应该有一种不同的方式来做你想做的事情。编译数百兆字节（千兆字节？）的代码，将其链接到多千兆字节大小的可执行文件并运行它，听起来效率很低。

如果我正确理解你的问题，你可以使用某种代码生成器 G 来生成一堆函数 func1...N ，这些函数需要一堆映射 csc1...M< /code> 作为输入。你想要做的是计算csc1...M，并对不同的输入运行1,000,000次循环，每次找到s = func1 + func2 + ... + funcN< /代码>。不过，您没有指定 fucn1...N 与 csc1...M 的关系。

如果所有这些都是真的，那么您似乎应该能够以不同的方式彻底解决问题，这可能更易于管理，甚至可能更快（即让您的计算机的缓存真正发挥作用）。

除了目标文件大小的实际问题之外，您当前的程序不会高效，因为它没有本地化对数据的访问（太多巨大的映射）并且没有本地化的代码执行（太多很长的函数）。

如何将您的程序分为 3 个阶段：第 1 阶段构建 csc1...M 并存储它们。第 2 阶段一次构建一个 func，对每个输入运行它 1,000,000 次并存储结果。第 3 阶段查找 1,000,000 次运行中每次运行所存储的 func1...N 结果的总和。该解决方案的优点在于它可以轻松地在多台独立机器上并行。

编辑：@bbtrb，你能在某个地方制作一个 func 和一个 csc 吗？它们似乎高度规则且可压缩。例如，func1 似乎只是表达式的总和，每个表达式由 1 个系数、2 个 s 中变量的索引和 1 个 csc 中的索引组成。所以它可以简化为一个很好的循环。如果您提供完整的示例，我确信可以找到将它们压缩为循环而不是长表达式的方法。

回复收藏 0 原文

薆情海 2024-11-22 05:38:45

如果我正确地读取了您的错误，那么使您超出限制的是初始化数据部分（如果是代码，恕我直言，您会遇到更多错误）。您是否拥有大量全球数据？如果是这样，我会重组程序，以便动态分配它们。如果数据已初始化，我将从配置文件中读取它。

顺便说一句，看到这个：

(.text+0x20): 对“main”的未定义引用

我认为您还有另一个问题。

回复收藏 0 原文

属性 2024-11-22 05:38:45

在我看来，代码正在使用某种自适应深度方法进行数值积分。不幸的是，代码生成器（或者更确切地说代码生成器的作者）是如此愚蠢，以至于为每个补丁生成一个函数，而不是为每个类型补丁生成一个函数。因此，它生成了太多需要编译的代码，即使可以编译，其执行也会很痛苦，因为没有任何东西可以在任何地方共享。（你能想象必须从磁盘加载每一页目标代码所带来的痛苦吗，因为没有任何内容是共享的，因此它始终是操作系统驱逐的候选者。更不用说指令缓存了，它们将毫无用处。）

解决办法是停止展开所有内容；对于此类代码，您希望最大化共享，因为以更复杂的模式访问数据的额外指令的开销无论如何都会被处理（大概）大型底层数据集的成本所吸收。代码生成器甚至可能默认执行此操作，并且科学家看到了一些展开选项（请注意，这些选项有时会提高速度）并立即将它们全部打开，现在坚持接受由此产生的混乱由计算机，而不是接受机器的实际限制并使用默认生成的数字正确的版本。但是，如果代码生成器无法做到这一点，请获取一个可以做到的代码生成器（或破解现有代码）。

底线：编译和链接 2.8GB 的代码不起作用，也不应该强迫它起作用。 寻找另一种方式。