日语的空白字符是什么?

发布于 2024-10-05 02:27:24 字数 579 浏览 1 评论 0原文

我需要分割一个字符串并提取由空格字符分隔的单词。源代码可能是英语或日语。英语空白字符包括制表符和空格,日语文本也使用这些字符。 (IIRC,所有广泛使用的日语字符集都是 US-ASCII 的超集。)

因此,我需要用来分割字符串的字符集包括正常的 ASCII 空格和制表符。

但是,在日语中,还有另一个空格字符,通常称为“全角空格”。根据我的 Mac 的字符查看器实用程序,这是 U+3000“表意空间”。这(通常)是用户在日语输入模式下键入时按空格键的结果。

我还需要考虑其他角色吗?

我正在处理被告知“用空格分隔条目”的用户提交的文本数据。然而,用户正在使用各种计算机和手机操作系统来提交这些文本。我们已经看到,用户在输入这些数据时可能不知道自己是处于日语输入模式还是英语输入模式。

此外,即使在日语模式下,空格键的行为也因平台和应用程序的不同而不同(例如,Windows 7 将插入表意空格,但 iOS 将插入 ASCII 空格)。

所以我想要的基本上是“视觉上看起来像空格的所有字符的集合,并且可能在用户按下空格键或制表符键时生成,因为许多用户不知道空格和制表符之间的区别,在日语和/或英语”。

对于这样的问题有权威的答案吗?

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character sets are supersets of US-ASCII.)

So the set of characters I need to use to split my string includes normal ASCII space and tab.

But, in Japanese, there is another space character, commonly called a 'full-width space'. According to my Mac's Character Viewer utility, this is U+3000 "IDEOGRAPHIC SPACE". This is (usually) what results when a user presses the space bar while typing in Japanese input mode.

Are there any other characters that I need to consider?

I am processing textual data submitted by users who have been told to "separate entries with spaces". However, the users are using a wide variety of computer and mobile phone operating systems to submit these texts. We've already seen that users may not be aware of whether they are in Japanese or English input mode when entering this data.

Furthermore, the behavior of the space key differs across platforms and applications even in Japanese mode (e.g., Windows 7 will insert an ideographic space but iOS will insert an ASCII space).

So what I want is basically "the set of all characters that visually look like a space and might be generated when the user presses the space key, or the tab key since many users do not know the difference between a space and a tab, in Japanese and/or English".

Is there any authoritative answer to such a question?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

天煞孤星 2024-10-12 02:27:24

您需要 ASCII 制表符、空格和不间断空格 (U+00A0) 以及全角空格(您已将其正确标识为 U+3000)。您可能需要换行符和垂直空格字符。如果您的输入是 unicode(不是 Shift-JIS 等),那么这就是您所需要的。还有其他(控制)字符,例如 \0 NULL,有时用作信息分隔符,但它们不会在东亚文本中呈现为空格 - 即,它们不会显示为空白。

编辑:Matt Ball 在他的评论中有一个很好的观点,但是,正如他的示例所示,许多正则表达式实现不能很好地处理全角东亚标点符号。在这方面,值得一提的是 Python 的 string.whitespace 也不能满足要求。

You need the ASCII tab, space and non-breaking space (U+00A0), and the full-width space, which you've correctly identified as U+3000. You might possibly want newlines and vertical space characters. If your input is in unicode (not Shift-JIS, etc.) then that's all you'll need. There are other (control) characters such as \0 NULL which are sometimes used as information delimiters, but they won't be rendered as a space in East Asian text - i.e., they won't appear as white-space.

edit: Matt Ball has a good point in his comment, but, as his example illustrates, many regex implementations don't deal well with full-width East Asian punctuation. In this connection, it's worth mentioning that Python's string.whitespace won't cut the mustard either.

吐个泡泡 2024-10-12 02:27:24

我刚刚找到你的帖子。这是关于标准化 Unicode 字符的一个很好的解释。

http://en.wikipedia.org/wiki/Unicode_equivalence

我发现许多编程语言,例如Python 有可以实现 Unicode 标准这些规范化规则的模块。出于我的目的,我发现以下 python 代码工作得非常好。它将空白的所有 unicode 变体转换为 ascii 范围。规范化后,正则表达式命令可以将所有空格转换为 ascii \x32:

import unicodedata
# import re

ucode = u'大変、 よろしくお願い申し上げます。'

normalized = unicodedata.normalize('NFKC', ucode)

# old code
# utf8text = re.sub('\s+', ' ', normalized).encode('utf-8')

# new code
utf8text = ' '.join(normalized.encode('utf-8').split())

自从第一次编写以来,我了解到 Python 的正则表达式(re)模块无法正确识别这些空格字符,如果遇到,可能会导致崩溃。事实证明,使用 .split() 函数是一种更快、更可靠的方法。

I just found your posting. This is a great explantion about normalizing Unicode characters.

http://en.wikipedia.org/wiki/Unicode_equivalence

I found that many programming languages, like Python, have modules that can implement these normalization rules the Unicode standards. For my purposes, I found the following python code works very well. It converts all unicode variants of whitespace to the ascii range. After the normalization, a regex command can convert all white space to ascii \x32:

import unicodedata
# import re

ucode = u'大変、 よろしくお願い申し上げます。'

normalized = unicodedata.normalize('NFKC', ucode)

# old code
# utf8text = re.sub('\s+', ' ', normalized).encode('utf-8')

# new code
utf8text = ' '.join(normalized.encode('utf-8').split())

Since the first writing, I learned Python's regex (re) module improperly itentifies these whitespace characters and can cause a crash if encountered. It turns out a faster, more reliable method to uses the .split() function.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文