Date Wed, 26 Feb 2003 09:22:15 -0800 Subject Re: Invalid compilation without -fno-strict-aliasing From Jean Tourrilhes <>
On Wed, Feb 26, 2003 at 04:38:10PM +0100, Horst von Brand wrote:
Jean Tourrilhes <> said:
It looks like a compiler bug to me... Some users have complained that when the following code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which mean a bogus len is mem-copied into the stream). Code (from linux/include/net/iw_handler.h) :
static inline char *
iwe_stream_add_event(char * stream, /* Stream of events */
char * ends, /* End of stream */
struct iw_event *iwe, /* Payload */
int event_len) /* Real size of payload */
{
/* Check if it's possible */
if((stream + event_len) < ends) {
iwe->len = event_len;
memcpy(stream, (char *) iwe, event_len);
stream += event_len;
}
return stream;
}
IMHO, the compiler should have enough context to know that the reordering is dangerous. Any suggestion to make this simple code more bullet proof is welcomed.
The compiler is free to assume char *stream and struct iw_event *iwe point to separate areas of memory, due to strict aliasing.
Which is true and which is not the problem I'm complaining about.
(Note with hindsight: this code is fine, but Linux's implementation of memcpywas a macro that cast to long * to copy in larger chunks. With a correctly-defined memcpy, gcc -fstrict-aliasing isn't allowed to break this code. But it means you need inline asm or __attribute__((aligned(1),may_alias)) (e.g. in a typedef) to define a kernel memcpy if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)
Why do you think the kernel uses "-fno-strict-aliasing"?
The gcc people are more interested in trying to find out what can be allowed by the c99 specs than about making things actually work. The aliasing code in particular is not even worth enabling, it's just not possible to sanely tell gcc when some things can alias.
Some users have complained that when the following code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which mean a bogus len is mem-copied into the stream).
The "problem" is that we inline the memcpy(), at which point gcc won't care about the fact that it can alias, so they'll just re-order everything and claim it's out own fault. Even though there is no sane way for us to even tell gcc about it.
I tried to get a sane way a few years ago, and the gcc developers really didn't care about the real world in this area. I'd be surprised if that had changed, judging by the replies I have already seen.
Type-based aliasing is stupid. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a "by-the-letter-of-the-law" thing that makes no sense.
...
I know for a fact that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that
unsigned long a;
a = 5;
*(unsigned short *)&a = 4;
could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say "nyaah, nyaah, the standards people said we can do this", with absolutely no introspection to ask whether it made any SENSE.
#include <stdio.h>
static void copy(int n, int a[][n], int b[][n]) {
int i, j;
for (i = 0; i < 2; i++) // 'n' not used in this example
for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
b[i][j] = a[i][j];
}
int main(int argc, char *argv[]) {
int a[2][2] = {{1, 2},{3, 4}};
int b[2][2];
copy(2, a, b);
printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
return 0;
}
static void zero(int n, int a[][n]) {
int i, j;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[i][j] = 0;
}
int main(void) {
int a[2][2] = {{1, 2},{3, 4}};
zero(2, a);
printf("%d\n", a[1][1]);
return 0;
}
gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:
#include <stdio.h>
static void copy(int n, int a[][n], int b[][n]) {
int i, j;
for (i = 0; i < 2; i++) // 'n' not used in this example
for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
b[i][j] = a[i][j];
}
int main(int argc, char *argv[]) {
int a[2][2] = {{1, 2},{3, 4}};
int b[2][2];
copy(2, a, b);
printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
return 0;
}
I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:
Use __attribute__((noinline)) for copy().
Use the gcc switch -fno-strict-aliasing.
Change the third parameter of copy() from b[][n] to b[][2].
Don't use -O2 or -O3.
Further notes:
This is an answer, after a year and a day, to my own question (and I'm a bit surprised there are only two other answers).
I lost several hours with this on my actual code, a Kalman filter. Seemingly small changes would have drastic effects, perhaps because of changing gcc's automatic inlining (this is a guess; I'm still uncertain). But it probably doesn't qualify as a horror story.
Yes, I know you wouldn't write copy() like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)
No gcc warning switches, include -Wstrict-aliasing=, did anything here.
1-D variable-length arrays seem to be OK.
Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.
I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.
And I have a slightly simpler example of a similar bug, with only one matrix. The code:
static void zero(int n, int a[][n]) {
int i, j;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[i][j] = 0;
}
int main(void) {
int a[2][2] = {{1, 2},{3, 4}};
zero(2, a);
printf("%d\n", a[1][1]);
return 0;
}
it caused certain shapes in a CAD program to be drawn incorrectly. thank goodness for the project's leaders work on creating a regression test suite.
the bug only manifested itself on certain platforms, with older versions of GCC and older versions of certain libraries. and then only with -O2 turned on. -fno-strict-aliasing solved it.
The Common Initial Sequence rule of C used to be interpreted as making it possible to write a function which could work on the leading portion of a wide variety of structure types, provided they start with elements of matching types. Under C99, the rule was changed so that it only applied if the structure types involved were members of the same union whose complete declaration was visible at the point of use.
The authors of gcc insist that the language in question is only applicable if the accesses are performed through the union type, notwithstanding the facts that:
There would be no reason to specify that the complete declaration must be visible if accesses had to be performed through the union type.
Although the CIS rule was described in terms of unions, its primary usefulness lay in what it implied about the way in which structs were laid out and accessed. If S1 and S2 were structures that shared a CIS, there would be no way that a function that accepted a pointer to an S1 and an S2 from an outside source could comply with C89's CIS rules without allowing the same behavior to be useful with pointers to structures that weren't actually inside a union object; specifying CIS support for structures would thus have been redundant given that it was already specified for unions.
发布评论
评论(6)
我自己没有恐怖故事,但这里有一些 Linus Torvalds 的引述(很抱歉,如果这些已经在问题的链接参考文献之一中):
http://lkml.org/lkml/2003/2/26/158:
(事后注意:这段代码很好,但是 Linux 的
memcpy
实现 是一个宏,转换为long *
以复制更大的块。使用正确定义的memcpy
,gcc -不允许使用 fstrict-aliasing
来破坏此代码,但这意味着您需要内联 asm 或__attribute__((aligned(1),may_alias))
(例如在typedef
中)如果您的编译器不知道如何将字节复制循环转换为高效的 asm(gcc7 之前的 gcc 就是这种情况),请定义内核memcpy
http://www.mail-archive.com /[电子邮件受保护]/msg01647.html:
No horror story of my own, but here are some quotes from Linus Torvalds (sorry if these are already in one of the linked references in the question):
http://lkml.org/lkml/2003/2/26/158:
(Note with hindsight: this code is fine, but Linux's implementation of
memcpy
was a macro that cast tolong *
to copy in larger chunks. With a correctly-definedmemcpy
,gcc -fstrict-aliasing
isn't allowed to break this code. But it means you need inline asm or__attribute__((aligned(1),may_alias))
(e.g. in atypedef
) to define a kernelmemcpy
if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)http://www.mail-archive.com/[email protected]/msg01647.html:
SWIG 生成的代码依赖于严格别名关闭,这可能会导致
SWIG generates code that depends on strict aliasing being off, which can cause all sorts of problems.
gcc、别名和二维可变长度数组:以下示例代码复制 2x2 矩阵:
在 CentOS 上使用 gcc 4.1.2,我得到:
我不'不知道这是否众所周知,也不知道这是一个错误还是一个功能。 我无法在 Cygwin 上使用 gcc 4.3.4 复制该问题,因此它可能已得到修复。一些解决方法:
__attribute__((noinline))
进行 copy()。-fno-strict-aliasing
。b[][n]
更改为b[][2]
。-O2
或-O3
。进一步说明:
copy()
。 (顺便说一句,我有点惊讶地发现 gcc 没有展开双循环。)更新:上面并没有真正回答OP的问题,因为他(即我)正在询问严格别名“合法”破坏代码的情况,而上面只是似乎是一个普通的编译器错误。
我向 GCC Bugzilla 报告了该问题,但他们对此不感兴趣旧的 4.1.2,尽管(我相信)它是价值 10 亿美元的 RHEL5 的关键。 4.2.4 及以上版本中不会出现这种情况。
我有一个类似错误的稍微简单的例子,只有一个矩阵。代码:
产生结果:
看来是
-fstrict-aliasing
与-finline
的组合导致了错误。gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:
With gcc 4.1.2 on CentOS, I get:
I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:
__attribute__((noinline))
for copy().-fno-strict-aliasing
.b[][n]
tob[][2]
.-O2
or-O3
.Further notes:
copy()
like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)-Wstrict-aliasing=
, did anything here.Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.
I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.
And I have a slightly simpler example of a similar bug, with only one matrix. The code:
produces the results:
It seems it is the combination
-fstrict-aliasing
with-finline
which causes the bug.这是我的:
http://forum.openscad.org/CGAL-3-6-1-causing-errors-but-CGAL-3-6-0-OK-tt2050.html
它导致了某些形状CAD 程序绘制不正确。感谢上帝,该项目的领导者致力于创建回归测试套件。
该错误仅在某些平台上表现出来,其中包括旧版本的 GCC 和旧版本的某些库。然后仅在 -O2 打开时。 -fno-strict-aliasing 解决了它。
here is mine:
http://forum.openscad.org/CGAL-3-6-1-causing-errors-but-CGAL-3-6-0-OK-tt2050.html
it caused certain shapes in a CAD program to be drawn incorrectly. thank goodness for the project's leaders work on creating a regression test suite.
the bug only manifested itself on certain platforms, with older versions of GCC and older versions of certain libraries. and then only with -O2 turned on. -fno-strict-aliasing solved it.
C 的通用初始序列规则过去被解释为
可以编写一个可以在 a 的前导部分工作的函数
各种各样的结构类型,前提是它们从匹配的元素开始
类型。在 C99 下,规则发生了变化,因此它仅适用于结构
涉及的类型是同一联合的成员,其完整声明在使用时可见。
gcc 的作者坚持认为,所讨论的语言仅适用于
尽管事实如此,但访问是通过联合类型执行的
可见
如果必须通过联合类型执行访问,则没有理由指定完整声明必须可见。
没有理由指定完整声明必须
尽管 CIS 规则是用工会来描述的,但其主要内容是
有用之处在于它暗示了结构的使用方式
布置并访问。如果 S1 和 S2 是共享 CIS 的结构,
接受指向 S1 的指针的函数不可能
来自外部来源的 S2 可以符合 C89 的 CIS 规则
不允许相同的行为对指向的指针有用
实际上不在联合对象内部的结构;指定 CIS
因此,鉴于它是
已经为工会指定了。
The Common Initial Sequence rule of C used to be interpreted as making it
possible to write a function which could work on the leading portion of a
wide variety of structure types, provided they start with elements of matching
types. Under C99, the rule was changed so that it only applied if the structure
types involved were members of the same union whose complete declaration was visible at the point of use.
The authors of gcc insist that the language in question is only applicable if
the accesses are performed through the union type, notwithstanding the facts
that:
There would be no reason to specify that the complete declaration must be visible if accesses had to be performed through the union type.
Although the CIS rule was described in terms of unions, its primary
usefulness lay in what it implied about the way in which structs were
laid out and accessed. If S1 and S2 were structures that shared a CIS,
there would be no way that a function that accepted a pointer to an S1
and an S2 from an outside source could comply with C89's CIS rules
without allowing the same behavior to be useful with pointers to
structures that weren't actually inside a union object; specifying CIS
support for structures would thus have been redundant given that it was
already specified for unions.
以下代码在 gcc 4.4.4 下返回 10。 union 方法或 gcc 4.4.4 有什么问题吗?
The following code returns 10, under gcc 4.4.4. Is anything wrong with the union method or gcc 4.4.4?