计算 NSString 中的单词数

发布于 2024-10-19 22:54:28 字数 781 浏览 1 评论 0原文

我正在尝试为使用 UITextView 的应用程序实现字数统计功能。

英语中两个单词之间有一个空格,因此计算英语句子中的单词数非常容易。 中文和日文字数统计时会出现问题,因为通常整个句子中没有空格。

我检查了 iPad 中具有字数统计功能的三种不同的文本编辑器,并将它们与 MS Words 进行了比较。

例如,以下是一系列表示世界理念的日语字符: 世界(the world)の('s)aiidea(idea)

世界のaiidea1

) iPad 和 MS Words 页面将每个字符视为一个单词,因此包含 7 个单词字。

2) iPad 文本编辑器 P*** 将整个算作一个单词 -->他们只是用空格来分隔单词。

3) iPad 文本编辑器 i*** 将它们算作三个单词 -->我相信他们使用 CFStringTokenizer 和 kCFStringTokenizerUnitWord 因为我可以得到相同的结果)

我在互联网上研究过,Pages 和 MS Words 的字数统计似乎是正确的,因为每个汉字都有一个含义。

我找不到像Pages或MS Words这样统计单词数的类,并且从头开始实现它会非常困难,因为除了日语和中文之外,iPad还支持很多不同的外语。

我认为 CFStringTokenizer 和 kCFStringTokenizerUnitWord 是最好的选择。

有没有办法像Pages和MSWords一样统计NSString中的单词数?

谢谢

I'm trying to implement a word count function for my app that uses UITextView.

There's a space between two words in English, so it's really easy to count the number of words in an English sentence.
The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.

I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.

For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)

世界のアイデア

1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.

2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.

3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)

I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.

I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.

I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.

Is there a way to count words in NSString like Pages and MSWords?

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

掐死时间 2024-10-26 22:54:28

我建议继续使用 CFStringTokenizer。因为它是平台特性,所以会随着平台的升级而升级。苹果公司的许多人都在努力反映真正的文化差异。对于普通开发人员来说,这是很难知道的。

这很难,因为这本质上不是一个编程问题。这是人类文化语言问题。您需要针对每种文化的人类语言专家。对于日本人来说,你需要日本文化专家。然而,我认为日本人并不认真需要字数统计功能,因为据我所知,单词的概念本身在日本文化中并不那么重要。您应该首先定义单词的概念。

我不明白为什么你想将字数统计的概念强加到字符数中。您实例的汉字词。这相当于通过按含义拆分为 uni + verseuniverse 计为 2 个单词。连逻辑都没有。根据单词的定义,根据单词的含义来分割单词有时是完全错误的并且毫无用处。因为单词本身的定义因文化而异。在我的韩语中,单词只是一个形式单位,而不是一个含义单位。每个单词都与每个含义相匹配的想法仅在罗马字符文化中才是正确的。

如果您认为需要的话,只需为东亚用户提供另一个功能,例如字符计数。使用 -[NSString length] 方法计算 unicode 字符串中的字符非常简单。

我是一个说韩语的人,(所以也许不适合你的情况:)并且在很多情况下我们计算字符而不是单词。事实上,我一生中从未见过有人计算字数。我嘲笑 MS Word 上的字数统计功能,因为我猜没有人会使用它。 (但是现在我知道它在罗马字符文化中很重要。)我只使用过一次字数统计功能就知道它确实有效:)我相信这在中文或日语中是类似的。也许日本用户使用字数统计是因为他们的基本字母与罗马字符相似,没有组合的概念。然而,他们大量使用汉字,这是完全合成的、以字符为中心的系统。

如果你让字数统计功能在这些语言上发挥很大作用(人们使用这些语言甚至不需要将句子分成更小的正式单位!),很难想象有人会使用它。如果没有语言专家,该功能就不应该正确。

I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.

This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.

And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.

Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.

I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.

If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.

非要怀念 2024-10-26 22:54:28

如果您的字符串不包含识别断词的标记(如空格),这将是一个非常困难的问题。我知道尝试解决字谜的一种方法是:

在字符串的开头,您以一个字符开始。是一个词吗?它可以是“A”等单词,但也可以是“AN”或“ANALOG”等单词的一部分。因此,必须考虑整个字符串来决定什么是单词。您将考虑下一个字符,看看是否可以从您认为可能找到的第一个单词后面的第一个字符开始创建另一个单词。如果您确定单词是“A”并且只剩下“NALOG”,那么您很快就会发现没有更多单词可找到。当您开始在字典中查找单词(见下文)时,您就知道您正在对单词的断开位置做出正确的选择。当你不再寻找词语时,你就知道你做出了错误的选择,你需要回头。

其中很大一部分是拥有足以包含您可能遇到的任何单词的词典。英语资源是 TWL06 或 SOWPODS 或其他拼字词典,包含许多晦涩的单词。您需要大量内存来执行此操作,因为如果您根据包含所有可能单词的简单数组检查单词,您的程序将运行得非常慢。如果您解析字典,将其保留为 plist 并重新创建字典,您的检查将足够快,但它将需要更多的磁盘空间和更多的内存空间。这些大型拼字词典之一可以扩展到大约 10MB,以实际单词作为键,用一个简单的 NSNumber 作为值的占位符 - 你不关心值是什么,只关心键存在于字典中,它告诉你该词被认为是有效的。

如果你在计数时维护一个数组,那么当你将包含最后一个字符的最后一个单词添加到数组中时,你就会以胜利的方式进行[数组计数],但你也有一种简单的回溯方法。如果在某个时候您无法找到有效的单词,您可以将 lastObject 从数组中弹出并在字符串的开头替换它,然后开始寻找替代单词。如果这不能让你回到正轨,那就再说一句话。

我将继续进行实验,在解析字符串时寻找前面可能的三个单词 - 当您确定了三个潜在的单词时,取出第一个单词,将其存储在数组中并查找另一个单词。如果您发现这样做太慢,并且仅考虑前面的两个单词就可以得到不错的结果,请将其减少到两个。如果您发现您的分词策略遇到了太多死胡同,那么请增加您考虑的前面的单词数。

另一种方法是采用自然语言规则 - 例如“A”和“NALOG”可能看起来不错,因为“A”后面有辅音,但“A”和“ARDVARK”将被排除,因为它对于单词来说是正确的以元音开头,跟随“AN”,而不是“A”。这可以变得像你想要的那样复杂 - 我不知道这在日语中是否变得更简单,但肯定有常见的动词结尾,例如“ma su”。

(编辑:开始赏金,如果我的方式不是的话,我想知道最好的方法。)

This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:

At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.

A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.

If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.

I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.

Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".

(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)

Bonjour°[大白 2024-10-26 22:54:28

如果您使用的是 iOS 4,您可以

__block int count = 0;
[string enumerateSubstringsInRange:range
                           options:NSStringEnumerationByWords
                        usingBlock:^(NSString *word,
                                     NSRange wordRange,
                                     NSRange enclosingRange,
                                     BOOL *stop)
    {
        count++;
    }
];

NSString 类参考

还有 WWDC 2010 会议,第 110 期,关于高级文本处理,解释了这一点,大约10分钟左右。

If you are using iOS 4, you can do something like

__block int count = 0;
[string enumerateSubstringsInRange:range
                           options:NSStringEnumerationByWords
                        usingBlock:^(NSString *word,
                                     NSRange wordRange,
                                     NSRange enclosingRange,
                                     BOOL *stop)
    {
        count++;
    }
];

More information in the NSString class reference.

There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.

满天都是小星星 2024-10-26 22:54:28

我认为 CFStringTokenizer 和 kCFStringTokenizerUnitWord 是最好的选择。

没错,您必须迭代文本并简单地计算途中遇到的单词标记的数量。

I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.

That's right, you have to iterate through text and simply count number of word tokens encontered on the way.

心是晴朗的。 2024-10-26 22:54:28

我的母语不是中文/日语,但这是我的 2 美分。

每个汉字确实都有一个含义,但单词的概念是字母/字符的组合来代表一个想法,不是吗?

从这个意义上说,“sekai no aidia”中可能有 3 个单词(如果不计算 NO/GA/DE/WA 等粒子,则为 2 个)。与英语相同 - “世界的想法”是两个单词,而“世界的想法”是 3 个单词,让我们忘记所需的“the”呵呵。

鉴于此,在我看来,计算单词在非罗马语言中没有那么有用,类似于 Eonil 提到的。最好计算这些语言的字符数。与中文/日文母语人士核实一下,看看他们的想法。

如果我要这样做,我会用空格和粒子(至少对于日语、韩语)对字符串进行标记,并计算标记数。不太清楚中文..

Not a native chinese/japanese speaker, but here's my 2cents.

Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?

In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.

That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.

If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..

谜兔 2024-10-26 22:54:28

使用日语,您可以创建一个语法解析器,我认为这与中文相同。然而,说起来容易做起来难,因为自然语言往往有很多例外,但这并非不可能。

请注意,它实际上并不高效,因为您必须先解析每个句子,然后才能计算单词数。

我建议使用解析器编译器,而不是自己构建一个解析器编译器,至少您可以专注于语法而不是自己创建解析器。它效率不高,但应该可以完成工作。

还有一个后备算法,以防您的语法无法正确解析输入(也许输入一开始确实没有意义),您可以使用字符串的长度来让您更轻松。

如果您构建它,您可能有一个市场机会将其用作日语/中文业务规则的自然语言领域特定语言。

With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.

Please note it won't really be efficient since you have to parse each sentence before being able to count the words.

I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.

Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.

If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.

拔了角的鹿 2024-10-26 22:54:28

只需使用长度方法:

[@"世界のアイデア" length];  // is 7

话虽这么说,作为一个说日语的人,我认为 3 是正确的答案。

Just use the length method:

[@"世界のアイデア" length];  // is 7

That being said, as a Japanese speaker, I think 3 is the right answer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文