2C10 是有效的 UTF-8 字符吗?

发布于 2024-11-06 06:02:40 字数 209 浏览 6 评论 0原文

我正在通过 SAX 解析器运行一些 XML,并注意到解析器无法正常使用某些字符作为数据内容。 XML 应该采用 UTF-8 编码,并且 SAX 解析器设置为处理该编码。

缩小有问题的字符串并在十六进制编辑器中查看 XML 文件,我可以看到 2C10 会导致问题,如果我将其更改为 C2A2(维基百科上给出的示例字符),则 SAX 解析器可以工作。那么2C10不是一个有效的UTF8字符吗?

I'm running some XML through a SAX parser and have noticed the parser is not functioning correctly with certain characters as data content.
The XML is supposed to be in UTF-8 encoding and the SAX parser is set to process that encoding.

Narrowing down problematic strings and looking at the XML file in a hex editor I can see for example that 2C10 causes a problem, if I change this instead to C2A2 (an example character given on wikipedia) then the SAX parser works. So is 2C10 not a valid UTF8 character?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

苏别ゝ 2024-11-13 06:02:40

U+2C10 是GLAGOLITIC 大写字母 NASHI。以下是它的属性:

U+2C10 ‹Ⱀ› \N{GLAGOLITIC CAPITAL LETTER NASHI}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InGlagolitic Glagolitic
   Is_Glagolitic Cased Cased_Letter LC Changes_When_Casefolded
   CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased
   CWL Changes_When_NFKC_Casefolded CWKCF Lu L Glag Gr_Base
   Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS
   Letter L_ Uppercase_Letter Print Upper Uppercase Word
   XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
   X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=4.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
   Block=Glagolitic Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR
   Canonical_Combining_Class=NR General_Category=Cased_Letter
   Decomposition_Type=None DT=None East_Asian_Width=Neutral
   GC=LC General_Category=L General_Category=Letter
   General_Category=L_ General_Category=LC GC=L
   General_Category=Lu General_Category=Uppercase_Letter GC=Lu
   Script=Glagolitic Grapheme_Cluster_Break=Other GCB=XX
   Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
   Hangul_Syllable_Type=Not_Applicable HST=NA
   Joining_Group=No_Joining_Group JG=NoJoiningGroup
   Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
   Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
   Numeric_Value=NaN NV=NaN Present_In=4.1 IN=4.1
   Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2
   IN=5.2 Present_In=6.0 IN=6.0 Script=Glag SC=Glag
   Sentence_Break=UP Sentence_Break=Upper SB=UP
   Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

U+2C10 is GLAGOLITIC CAPITAL LETTER NASHI. Here are its properties:

U+2C10 ‹Ⱀ› \N{GLAGOLITIC CAPITAL LETTER NASHI}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InGlagolitic Glagolitic
   Is_Glagolitic Cased Cased_Letter LC Changes_When_Casefolded
   CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased
   CWL Changes_When_NFKC_Casefolded CWKCF Lu L Glag Gr_Base
   Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS
   Letter L_ Uppercase_Letter Print Upper Uppercase Word
   XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
   X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=4.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
   Block=Glagolitic Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR
   Canonical_Combining_Class=NR General_Category=Cased_Letter
   Decomposition_Type=None DT=None East_Asian_Width=Neutral
   GC=LC General_Category=L General_Category=Letter
   General_Category=L_ General_Category=LC GC=L
   General_Category=Lu General_Category=Uppercase_Letter GC=Lu
   Script=Glagolitic Grapheme_Cluster_Break=Other GCB=XX
   Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
   Hangul_Syllable_Type=Not_Applicable HST=NA
   Joining_Group=No_Joining_Group JG=NoJoiningGroup
   Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL
   Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
   Numeric_Value=NaN NV=NaN Present_In=4.1 IN=4.1
   Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2
   IN=5.2 Present_In=6.0 IN=6.0 Script=Glag SC=Glag
   Sentence_Break=UP Sentence_Break=Upper SB=UP
   Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
溺ぐ爱和你が 2024-11-13 06:02:40

Wikipedia 上的 UTF-8 页面 暗示 2C10 将被解释为 , 后跟控制代码 DLE(显然是数据链路转义)。控制字符出现在 XML 中的 CDATA 块之外似乎是不合适的!

The UTF-8 page on Wikipedia implies that 2C10 is to be interpreted as , followed by the control code DLE (data link escape, apparently). A control character appearing outside of a CDATA block in XML would seem inappropriate!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文