信息分离器是否构成Unicode的线路断路?

发布于 2025-01-21 13:22:21 字数 879 浏览 0 评论 0 原文

this wikipedia> ,VT,FF,CR,NEL,LS,PS)。这里没有关于ASCII“信息分离器”字符(FS,GS,RS,US)。但令人惊讶的是fs,gs,rs'段他们的双向班级。这令人困惑。

现在,当我在文本中遇到这些“信息分离器”字符之一时,我是否应该认为它们是线路的?换句话说,如果我正在写一个在线断裂时分裂的函数,那么我应该在这三个字符上分开吗? 函数确实将它们视为线断裂。我不知道其他实现。

  1. string.splitlines() python中的 类数据库,LF被认为是线路破裂。因此,当我遇到该角色时,我可以打破线。

  2. 在链接的Wikipedia表和Unicode BIDI类数据库中,SP不被视为线路破坏。因此,当我遇到那个角色时,我无法打破一条线。 (假设没有单词包装)。

  3. 链接的Wikipedia表不提GS作为线路。但是Unicode BIDI类数据库确实将其提及为线路。我很困惑:在这种情况下我该怎么办?拜迪类在这种情况下是什么?

在这里,我只询问Unicode标准。但是,如果您知道,您也可以在ASCII标准中提及断线。

PS:我不确定链接的Wikipedia页面中的表是否正确。但是我找不到其他列出所有空格的好资源。

This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.

Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines() function in Python does consider them as line breaks. I don't know about other implementations.)

For example:

  1. Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.

  2. Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).

  3. The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?

Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.

PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

皓月长歌 2025-01-28 13:22:21

fs,gs,rs和us属于“断路”类 combining_mark cm )。此信息的Unicode字符数据库中的相关文件是

uax#14代码> cm 如下:

组合字符序列被视为单位
折断线。序列的断线行为是
基本字符。

换句话说:类 cm 字符禁止在之前 - 它们本质上是“胶水”到先前的字符上。但是,出于所有其他目的,线破坏算法完全忽略了类 cm 字符的存在。课程 cm 字符之后是否存在断路机会,这仅在于它已应用于基本字符的换行类别,即第一个不适合类 CM 。

*此规则有一些例外,涉及强制性休息和一些特殊的格式字符,但对于您的目的而言,它们不应与之相关。

FS, GS, RS, and US belong to the line break class Combining_Mark (CM). The relevant file in the Unicode Character Database for this information is LineBreak.txt.

UAX #14 (Unicode Line Breaking Algorithm) describes class CM as follows:

Combining character sequences are treated as units for the purpose of
line breaking. The line breaking behavior of the sequence is that of
the base character.

In other words: Class CM characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of class CM characters. Whether or not a line break opportunity exists after a class CM character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of class CM.

*There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文