测试字符串中的日文/中文字符
我有一个程序可以读取一堆文本并对其进行分析。 文本可能是任何语言,但我需要专门测试日语和中文,以便以不同的方式分析它们。
我读过,我可以测试每个字符的 unicode 编号,以查明它是否在 CJK 字符范围内。 这很有帮助,但是如果可能的话,我想将它们分开,以便根据不同的词典处理文本。 有没有办法测试一个字符是日语还是中文?
I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.
I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
由于 unihan 代码点在 Unicode 标准中的实现方式,您将无法测试单个字符来确定它是日语还是中文。 基本上,每个汉字都是潜在的日语字符。 然而,反之则不然。 此外,还有许多约定可用于测试文本块是否采用一种语言或另一种语言。
问题是由于共同的字符和单词的数量过多而产生的。 但是,如果我需要一个快速而肮脏的解决方案来解决这个问题,我会检查整个文本块中的假名 - 如果文本包含假名,那么我就知道它是日语。 如果您还需要区分韩语,我会测试韩文。 另外,如果您需要区分中文类型,测试简化类型将是最好的方法。
You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.
The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.
Unicode的发展过程包括汉统。 这是因为很多日语字符都源自汉字,或者与汉字相同; 与韩语类似。 有一些字符(片假名和平假名 - 请参阅第 12 章日语中常用的 Unicode 标准 v5.1.0)表明文本是日语而不是中文,但我相信这将是一个统计测试而不是确定性的。
查看 O'Reilly 的关于 CJKV 信息处理 的书(CJKV 是 Chinese、Japan、Korean 的缩写) ,越南语;我有 CJK 前身潜伏在某处)。 还有 O'Reilly 的关于 Unicode Explaned 的书,这可能会有所帮助,但可能不适用于这个问题(我不记得有关如何识别日语和中文文本的讨论)。
The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.
Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).
你可能无法可靠地做到这一点。 日语使用了很多与汉语相同的字符。 我认为你能做的最好的事情就是查看一段文本。 如果您看到任何独特的日语字符,那么您可以假设整个块都是日语。 如果不是,那可能是中国人。
不过,我只是在学中文,所以我不是专家。
You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.
However, I'm just learning Chinese, so I'm not an expert.
测试片假名或平假名范围内的字符应该是确定文本是否为日语的非常可靠的方法,特别是在处理“常规”用户生成的文本时。 如果您正在查看法律文件或其他更官方的文件,可能会稍微困难一些,因为复杂的汉字会占很大比例 - 但它仍然应该相当可靠。
testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.
解决方法是在将其转换为 Unicode 之前检查编码。
A workaround is to check the encoding before it is converted to Unicode.
有许多字符仅(常用)用于日语或仅用于中文。
日本和中国都简化了许多字符,但往往采用不同的方式。 您可以检查日语 Shinjitai 和简体中文字符。 后者比前者多得多。 如果两者都没有,那么您可能使用的是繁体中文。
当然,如果您正在处理 Unicode 文本,您可能会偶尔发现罕见字符或混合语言,这可能会导致启发式错误,因此您最好通过计算字符类型来做出判断。
找出哪些字符在一种语言中常见而在其他语言中不常见的一个好方法是将旧编码相互比较。 您可以在互联网上轻松找到每个字符到 Unicode 的映射。
我曾经写过一些代码,通过代码点进行二分搜索,即使在 JavaScript 中也非常快 - 不过我可能在旅行中丢失了它(-:
There are many characters which are only (commonly) used in Japanese or only used in Chinese.
Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.
Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.
A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.
I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-: