this wikipedia> ,VT,FF,CR,NEL,LS,PS)。这里没有关于ASCII“信息分离器”字符(FS,GS,RS,US)。但令人惊讶的是fs,gs,rs'段他们的双向班级。这令人困惑。
现在,当我在文本中遇到这些“信息分离器”字符之一时,我是否应该认为它们是线路的?换句话说,如果我正在写一个在线断裂时分裂的函数,那么我应该在这三个字符上分开吗? 函数确实将它们视为线断裂。我不知道其他实现。
)
- (
string.splitlines()
python中的 类数据库,LF被认为是线路破裂。因此,当我遇到该角色时,我可以打破线。
-
在链接的Wikipedia表和Unicode BIDI类数据库中,SP不被视为线路破坏。因此,当我遇到那个角色时,我无法打破一条线。 (假设没有单词包装)。
-
链接的Wikipedia表不提GS作为线路。但是Unicode BIDI类数据库确实将其提及为线路。我很困惑:在这种情况下我该怎么办?拜迪类在这种情况下是什么?
在这里,我只询问Unicode标准。但是,如果您知道,您也可以在ASCII标准中提及断线。
PS:我不确定链接的Wikipedia页面中的表是否正确。但是我找不到其他列出所有空格的好资源。
This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.
Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines()
function in Python does consider them as line breaks. I don't know about other implementations.)
For example:
-
Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.
-
Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).
-
The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?
Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.
PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.
发布评论
评论(1)
fs,gs,rs和us属于“断路”类
combining_mark
(cm
)。此信息的Unicode字符数据库中的相关文件是。uax#14代码> cm 如下:
换句话说:类 CM 。
cm
字符禁止在之前 - 它们本质上是“胶水”到先前的字符上。但是,出于所有其他目的,线破坏算法完全忽略了类cm
字符的存在。课程cm
字符之后是否存在断路机会,这仅在于它已应用于基本字符的换行类别,即第一个不适合类*此规则有一些例外,涉及强制性休息和一些特殊的格式字符,但对于您的目的而言,它们不应与之相关。
FS, GS, RS, and US belong to the line break class
Combining_Mark
(CM
). The relevant file in the Unicode Character Database for this information is LineBreak.txt.UAX #14 (Unicode Line Breaking Algorithm) describes class
CM
as follows:In other words: Class
CM
characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of classCM
characters. Whether or not a line break opportunity exists after a classCM
character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of classCM
.*There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.