我们正在处理IBMEnterprise日语COBOL源代码。
准确描述 G 类型文字中允许的内容的规则,
标识符的允许范围尚不清楚。
IBM 手册指出 G'....' 文字
引号内的第一个字符必须为 SHIFT-OUT,
以及 SHIFT-IN 作为结束引号之前的最后一个字符。
我们的 COBOL 词法分析器“知道”这一点,但反对 G 文字
在真实代码中找到的。结论:IBM手册是错误的,
或者我们误读了它。客户不让我们看到代码,
所以诊断问题非常困难。
编辑:为了清晰起见,修改/扩展了以下文本:
有谁知道 G 文字形成的确切规则,
以及它们为何(不)符合 IBM 参考手册的内容?
理想的答案是 G 文字的正则表达式。
这就是我们现在使用的(由另一位作者编码,叹息)
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
:是另一个正则表达式的宏。想必他们
名称足够好,因此您可以猜出它们包含什么。
这是 IBM企业 COBOL 参考。
第 3 章“字符串”,第 32 页的小标题“DBCS 文字”是相关阅读。
我希望通过提供准确的参考,经验丰富的 IBM 员工可以告诉我们我们是如何误读的:-{ 我特别不清楚“DBCS 字符”一词的含义
当它显示“一个或多个字符在 X'00...X'FF 范围内的任一字节”时
DBCS 字符怎么可能不是对 8 位字符代码呢?
如果您检查现有的 RE,它会匹配 3 种类型的字符对。
下面的一个答案表明 配对错误。
好吧,我可能相信这一点,但这意味着 RE 只会拒绝
包含单个 的文字字符串。我不相信那是
当我们似乎遇到了 G 文字的每个实例时,我们遇到了这个问题。
类似地,COBOL 标识符显然可以组成
带有 DBCS 字符。标识符到底允许什么?
同样,正则表达式是理想的选择。
EDIT2:我开始认为问题可能不是RE。
我们正在读取 Shift-JIS 编码的文本。我们的读者将其转换为
文本转换为 Unicode。但 DBCS 角色确实
不是 Shift-JIS;相反,它们是二进制编码的数据。可能
正在发生的事情是 DBCS 数据正在被转换
就像 Shift-JIS 一样,这会破坏该功能
将“两个字节”识别为 DBCS 元素。例如,
如果 DBCS 字符对是 :81 :1F,则 ShiftJIS 读取器
会将这一对转换为单个 Unicode 字符,
然后它的两字节性质就丢失了。如果你不会数数,
你找不到最后的报价。如果你找不到最后的引言,
你无法识别字面意思。那么问题就会出现
我们需要在中间切换输入编码模式
词法分析过程。哎呀。
We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
发布评论
评论(2)
尝试在您的规则中添加单引号,看看它是否通过进行此更改,
如果我没记错的话,N 和 G 文字之间的一个区别是 G 允许单引号。您的正则表达式不允许这样做。
编辑:我以为你让所有其他 DBCS 文字都正常工作,只是 G 字符串有问题,所以我只是指出 N 和 G 之间的区别。现在我仔细看看你的 RE。它有问题。在我使用的 Cobol 中,您可以将 ASCII 与日语混合,例如,
You RE 假定仅使用 DBCS。我会放松这个限制并重试。
我认为不可能完全用正则表达式来处理 G 文字。仅使用有限状态机无法跟踪匹配的报价和 SO/SI。你的 RE 是如此复杂,因为它试图做不可能的事。我只是简化它并手动处理不匹配的标记。
您还可能面临编码问题。该代码可能采用 EBCDIC(片假名)或 UTF-16 格式,将其视为 ASCII 将不起作用。 SO/SI 有时在 Windows 上转换为 0x1E/0x1F。
我只是想帮助你在黑暗中拍摄而不看到实际的代码:)
Try to add a single quote in your rule to see if it passes by making this change,
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
我会检查所有其他宏的定义以确保。我能看到的唯一明显的问题是你似乎已经意识到了。
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.