You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.
Study unicode -- learn to use ICU -- or a language with GOOD Unicode support.
Decide on and Build a VM (or use an existing one).
You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.
There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode
发布评论
评论(3)
您需要了解 Unicode 如何工作才能用国际语言构建解析器,是的,您确实需要是计算机科学专业的学生,或者具备自学自己编译器设计。
You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.
查看《编译器设计原理》
check out "Principles of Compiler Design"
您使用能够对扩展字符进行编码的字符集,例如 UTF8。 8 位以上的 Unicode 集以 UTF16 的双字节表示法或 UTF32 的四字节表示法编写。出现的问题是关于 dibi(双向表示法),其中使用不同 bidi 表示法的语言可能会以不同的顺序读取字节。双向问题的解决方案是通过在字符编码之前指定字节顺序,但问题仍然是之前关于双向差异的问题。因此,字节顺序是通过 Unicode 字符集的更具体的子集清楚地表述的。 UTF16BE,对于大端字节序,要求字节顺序规范在从右到左的解释中先于字符编码。相反的是 UTF16LE,或小端。
还有 UCS,通用字符集。这一术语仍在使用,但已被弃用,因为它对于上面提到的有关映射占用多个字节的字符的问题不够具体。有关 UCS 和 Unicode 之间差异的信息,请阅读以下内容:http://en.wikipedia.org /wiki/Universal_Character_Set#Differences_ Between_ISO_10646_and_Unicode
一些示例如下:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - 强制使用 UTF8 编码
邮件标记语言 - http://mailmarkup.org/ - 强制使用 UTF16BE 编码
You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.
There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode
Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding