使用javascript，我如何计算亚洲字符和英语单词的混合

发布于 2024-08-22 08:24:25 字数 400 浏览 10 评论 0原文

我需要取一串混合的亚洲字符（目前，假设只有中文汉字或日语汉字/平假名/片假名）和“字母数字”（即英语、法语），并按以下方式对其进行计数：

1）对每个进行计数亚洲字符为 1； 2) 将每个字母数字WORD计为1；

举几个例子：

株式会社myCompany = 4 个字符 + 1 个单词 = 总共 5 个株式会社マイコ = 7 chars

到目前为止我唯一的想法是使用：

var wordArray=val.split(/\w+/);

然后检查每个元素以查看其内容是否是字母数字（因此算作 1）或不是（因此取数组长度）。但我觉得这根本不是很聪明，而且被统计的文本可能多达 10,000 个单词，所以不是很快。

有想法吗？

原文

I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:

1) count each Asian CHARACTER as 1;
2) count each Alphanumeric WORD as 1;

a few examples:

株式会社myCompany = 4 chars + 1 word = 5 total
株式会社マイコ = 7 chars

my only idea so far is to use:

var wordArray=val.split(/\w+/);

and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.

Ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷离° 2024-08-29 08:24:25

不幸的是 JavaScript 的 RegExp 不支持 Unicode 字符类； \w 仅适用于 ASCII 字符（以某些浏览器错误为模）。

不过，您可以按组使用 Unicode 字符，因此如果您可以将您感兴趣的每组字符隔离为一个范围，那么您就可以做到这一点。例如：（

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

这试图为日语提供更真实的“单词”计数，将一种假名的每次运行都计为一个单词。当然，这仍然不对，但它可能比将每个音节视为一个单词更接近.)

显然，如果您想“正确执行”，则必须考虑更多字符。首先，我们希望您没有超出基本多语言平面之外的字符！

Unfortunately JavaScript's RegExp has no support for Unicode character classes; \w only applies to ASCII characters (modulo some browser bugs).

You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)

Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!

回复收藏 0 原文

舟遥客 2024-08-29 08:24:25

您可以迭代文本中的每个字符，检查每个字符以查找断词。以下示例执行此操作，将每个中文/日语/韩语 (CJK) 表意文字计为单个单词，并将所有字母数字字符串视为单个单词。

关于我的实现的一些注释：

它可能无法正确处理重音字符。它们可能会触发断词。您可以修改 wordBreakRegEx 来解决此问题。
cjkRegEx 不包含一些更深奥的代码点范围，因为它们需要 5 个十六进制数字来引用，而 JavaScript 的正则表达式引擎似乎不允许您这样做。但您可能不需要担心这些，因为我什至认为大多数字体都不包含它们。
我故意将日语平假名和片假名排除在 cjkRegEx 之外，因为我不确定您想要如何处理这些内容。根据您正在处理的文本类型，将它们的字符串视为单个单词可能更有意义。在这种情况下，您需要添加逻辑来识别“假名单词”与“字母数字单词”。如果您不在乎，那么您只需将它们的代码点范围添加到 cjkRegEx 中即可。当然，您可以尝试识别假名字符串中的断词，但这很快就会变得非常困难。

示例实现：

function getWordCount(text) {
  // This matches all CJK ideographs.
  var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/;

  // This matches all characters that "break up" words.
  var wordBreakRegEx = /\W/;

  var wordCount = 0;
  var inWord = false;
  var length = text.length;
  for (var i = 0; i < length; i++) {
    var curChar = text.charAt(i);
    if (cjkRegEx.test(curChar)) {
      // Character is a CJK ideograph.
      // Count it as a word.
      wordCount += inWord ? 2 : 1;
      inWord = false;
    } else if (wordBreakRegEx.test(curChar)) {
      // Character is a "word-breaking" character.
      // If a word was started, increment the word count.
      if (inWord) {
        wordCount += 1;
        inWord = false;
    } else {
      // All other characters are "word" characters.
      // Indicate that a word has begun.
      inWord = true;
    }
  }

  // If the text ended while in a word, make sure to count it.
  if (inWord) {
    wordCount += 1;
  }

  return wordCount;
}

Unihan 数据库对于学习 unicode 中的 CJK 非常有帮助。当然，Unicode 主页也有大量信息。

You can iterate over each character in the text, examining each one to look for word breaks. The following example does this, counting each Chinese/Japanese/Korean (CJK) ideograph as a single word, and treating all alphanumeric strings as single words.

Some notes on my implementation:

It probably doesn't handle accented characters correctly. They will probably trigger word breaks. You can modify the wordBreakRegEx to fix this.
cjkRegEx doesn't include some of the more esoteric code point ranges, since they require 5 hex digits to reference and JavaScript's regex engine doesn't seem to let you do that. But you probably don't need to worry about these, since I don't even think most fonts include them.
I deliberately left Japanese Hiragana and Katakana out of cjkRegEx, since I'm not sure how you want to handle these. Depending on the type of text you're dealing with, it might make more sense to treat strings of them as single words. In that case, you'd need to add logic to recognize being in a "kana word" versus in a "alphanumeric word". If you don't care, then you just need to add their code point ranges to cjkRegEx. Of course, you could try to recognize word breaks within kana strings, but that quickly becomes Very Hard.

Example implementation:

function getWordCount(text) {
  // This matches all CJK ideographs.
  var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/;

  // This matches all characters that "break up" words.
  var wordBreakRegEx = /\W/;

  var wordCount = 0;
  var inWord = false;
  var length = text.length;
  for (var i = 0; i < length; i++) {
    var curChar = text.charAt(i);
    if (cjkRegEx.test(curChar)) {
      // Character is a CJK ideograph.
      // Count it as a word.
      wordCount += inWord ? 2 : 1;
      inWord = false;
    } else if (wordBreakRegEx.test(curChar)) {
      // Character is a "word-breaking" character.
      // If a word was started, increment the word count.
      if (inWord) {
        wordCount += 1;
        inWord = false;
    } else {
      // All other characters are "word" characters.
      // Indicate that a word has begun.
      inWord = true;
    }
  }

  // If the text ended while in a word, make sure to count it.
  if (inWord) {
    wordCount += 1;
  }

  return wordCount;
}

The Unihan Database is very helpful for learning about CJK in unicode. Also of course the Unicode home page has loads of info.

回复收藏 0 原文