在哪里可以找到 MSIL 字符串常量中的转义字符列表?

发布于 2025-01-02 00:26:25 字数 407 浏览 0 评论 0原文

我编写了一个程序(用 C# 编写),用于读取和操作从 C# 程序生成的 MSIL 程序。我错误地认为 MSIL 字符串常量的语法规则与 C# 相同,但后来遇到了以下情况:

此 C# 语句

string s = "Do you wish to send anyway?";

被编译为(以及其他 MSIL 语句),

IL_0128:  ldstr      "Do you wish to send anyway\?"

我没想到会出现反斜杠用于逃避问号。现在,我显然可以在处理过程中考虑这个反斜杠,但主要是出于好奇,我想知道当 C# 编译器将 C# 常量字符串转换为 MSIL 常量字符串时,是否有一个列表,其中的字符会被转义。

谢谢。

I've written a program (in C#) that reads and manipulates MSIL programs that have been generated from C# programs. I had mistakenly assumed that the syntax rules for MSIL string constants are the same as for C#, but then I ran into the following situation:

This C# statement

string s = "Do you wish to send anyway?";

gets compiled into (among other MSIL statements) this

IL_0128:  ldstr      "Do you wish to send anyway\?"

I wasn't expecting the backslash that is used to escape the question mark. Now I can obviously take this backslash into account as part of my processing, but mostly out of curiosity I'd like to know if there is a list somewhere of which characters get escaped when the C# compiler converts C# constant strings to MSIL constant strings.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

撧情箌佬 2025-01-09 00:26:25

更新

基于使用 C# 编译器 + ildasm.exe 进行的实验:也许没有转义字符列表的原因是因为转义字符太少:恰好是 6。

来自 ildasm 生成的 IL ,来自 Visual Studio 2010 编译的 C# 程序

  • IL 严格是ASCII
  • 三个传统空白字符被转义
    • \t:0x09:(制表符)
    • \n:0x0A:(换行符)
    • \r : 0x0D : (回车)
  • 三个标点字符被转义:
    • \" : 0x22 : (双引号)
    • \? : 0x3F : (问号)
    • \\ : 0x5C : (反斜杠)
  • 仅以下字符完整包含在文字字符串 0x20 - 0x7E 中(不包括三个标点符号)
  • 所有其他字符,包括 ASCII 控制符0x20 以下的字符以及从 0x7F 开始的所有字符都将转换为字节数组,或者更确切地说,包含除 92 个文字和上面的 6 个转义字符之外的任何字符的任何字符串都将转换为字节数组,其中字节是 。 UTF-16 字符串的小端字节序

示例 1: 0x7E 以上的 ASCII:简单的重音 é (U+00E9)。

C#:"é""\u00E9" 变为(E9 字节首先

ldstr      bytearray (E9 00 )

示例2: UTF-16:求和符号 Σ (U+2211)

C#: "Σ""\u2211" 变为 (11< /代码>字节首先

ldstr      bytearray (11 22 )

示例 3: UTF-32:双击数学

Update

Based on experimentation using the C# compiler + ildasm.exe: perhaps the reason there is no list of escaped characters is because there are so few: precisely 6.

Going from the IL generated by ildasm, from C# programs compiled by Visual Studio 2010:

  • IL is strictly ASCII.
  • Three traditional whitespace characters are escaped
    • \t : 0x09 : (tab)
    • \n : 0x0A : (newline)
    • \r : 0x0D : (carriage return)
  • Three punctuation characters are escaped:
    • \" : 0x22 : (double quote)
    • \? : 0x3F : (question mark)
    • \\ : 0x5C : (backslash)
  • Only the following characters are included intact in literal strings 0x20 - 0x7E, (not including the three punctuation characters)
  • All other characters, including the ASCII contrl characters below 0x20 and everything from 0x7F on up, are converted to byte arrays. Or rather, any string containing any character other than the 92 literal and 6 escaped characters above, is converted to a byte array, where the bytes are the little-endian bytes of a UTF-16 string.

Example 1: ASCII above 0x7E: A simple accented é (U+00E9)

C#: Either "é" or "\u00E9" becomes (E9 byte comes first)

ldstr      bytearray (E9 00 )

Example 2: UTF-16: Summation symbol ∑ (U+2211)

C#: Either "∑" or "\u2211" becomes (11 byte comes first)

ldstr      bytearray (11 22 )

Example 3: UTF-32: Double-struck mathematical ???? (U+1D538)

C#: Either "????" or UTF-16 surrogate pair "\uD835\uDD38" becomes (bytes within char reversed, but double-byte chars in overall order)

ldstr      bytearray (35 D8 38 DD )

Example 4: Byte array conversion is for an entire string containing a non-Ascii character

C#: "In the last decade, the German word \"über\" has come to be used frequently in colloquial English." becomes

ldstr      bytearray (49 00 6E 00 20 00 74 00 68 00 65 00 20 00 6C 00  
                      61 00 73 00 74 00 20 00 64 00 65 00 63 00 61 00  
                      64 00 65 00 2C 00 20 00 74 00 68 00 65 00 20 00  
                      47 00 65 00 72 00 6D 00 61 00 6E 00 20 00 77 00  
                      6F 00 72 00 64 00 20 00 22 00 FC 00 62 00 65 00  
                      72 00 22 00 20 00 68 00 61 00 73 00 20 00 63 00  
                      6F 00 6D 00 65 00 20 00 74 00 6F 00 20 00 62 00  
                      65 00 20 00 75 00 73 00 65 00 64 00 20 00 66 00  
                      72 00 65 00 71 00 75 00 65 00 6E 00 74 00 6C 00  
                      79 00 20 00 69 00 6E 00 20 00 63 00 6F 00 6C 00  
                      6C 00 6F 00 71 00 75 00 69 00 61 00 6C 00 20 00  
                      45 00 6E 00 67 00 6C 00 69 00 73 00 68 00 2E 00 )

Directly, "you can't" (find a list of MSIL string escapes), but here are some helpful tidbits...

ECMA-335, which contains the strict definition of CIL, does not specify which characters must be escaped in QSTRING literals, only that they may be escaped using the backslash \ character. The most important notes are:

  • Unicode literals are presented as octals, not hexadecimal (i.e. \042, not \u0022).
  • Strings can be spread accross multiple lines using the \ character--see below

The only explicitly mentioned escapes are tab \t, linefeed \n, and octal numeric escapes. This is a bit annoying for you purposes since C# does not have an octal literal -- you'll have to do your own extraction and conversion, such as by using the Convert.ToInt32([string], 8) method.

Beyond that the choice of escapes is "implementation-specific" to the "hypothetical IL assembler" described in the spec. So your question rightly asks about the rules for MSIL, which is Microsoft's strict implementation of CIL. As far as I can tell, MS has not documented their choice of escapes. It could be helpful at least to ask the Mono folks what they use. Beyond that, it may be a matter of generating the list yourself -- make a program that declares a string literal for every character \u0000 - whatever, and see what the compiled ldstr statements are. If I get to it first, I'll be sure to post my results.

Additional notes:

To properly parse *IL string literals -- known as QSTRINGS or SQSTRINGS -- you will have to account for more than just character escapes. Take in-code string concatenation, for example (and this is verbatim from Partition II::5.2):

The "+" operator can be used to concatenate string literals. This way, a long string can be broken across multiple lines by using "+" and a new string on each line. An alternative is to use "\" as the last character in a line, in which case, that character and the line break following it are not entered into the generated string. Any white space characters (space, line-feed, carriage-return, and tab) between the "\" and the first non-white space character on the next line are ignored. [Note: To include a double quote character in a QSTRING, use an octal escape sequence. end note]

Example: The following result in strings that are equivalent to "Hello World from CIL!":

ldstr "Hello " + "World " + "from CIL!"

ldstr "Hello World\ 
       \040from CIL!"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文