Objective-C 中的 NSString 标记化
在 Objective-C 中标记/分割 NSString 的最佳方法是什么?
What is the best way to tokenize/split a NSString in Objective-C?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
在 Objective-C 中标记/分割 NSString 的最佳方法是什么?
What is the best way to tokenize/split a NSString in Objective-C?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(9)
此处找到了答案:
Found answer here:
每个人都提到过
componentsSeparatedByString:
但您也可以使用CFStringTokenizer
(请记住NSString
和CFString
是可以互换的)也会标记自然语言(例如中文/日语,它们不会在空格上分割单词)。Everyone has mentioned
componentsSeparatedByString:
but you can also useCFStringTokenizer
(remember that anNSString
andCFString
are interchangeable) which will tokenize natural languages too (like Chinese/Japanese which don't split words on spaces).如果您只想拆分字符串,请使用
-[NSString ComponentsSeparatedByString:]
。 对于更复杂的标记化,请使用 NSScanner 类。If you just want to split a string, use
-[NSString componentsSeparatedByString:]
. For more complex tokenization, use the NSScanner class.如果您的标记化需求更复杂,请查看我的开源 Cocoa 字符串标记化/解析工具包: ParseKit:
http://parsekit.com< /a>
对于使用分隔符(如“:”)简单地分割字符串,ParseKit 肯定是大材小用。 但同样,对于复杂的标记化需求,ParseKit 非常强大/灵活。
另请参阅 ParseKit 令牌化文档。
If your tokenization needs are more complex, check out my open source Cocoa String tokenizing/parsing toolkit: ParseKit:
http://parsekit.com
For simple splitting of strings using a delimiter char (like ':'), ParseKit would definitely be overkill. But again, for complex tokenization needs, ParseKit is extremely powerful/flexible.
Also see the ParseKit Tokenization documentation.
如果要对多个字符进行标记,可以使用 NSString 的
componentsSeparatedByCharactersInSet
。 NSCharacterSet 有一些方便的预制集,例如whitespaceCharacterSet
和illegalCharacterSet
。 它具有 Unicode 范围的初始值设定项。您还可以组合字符集并使用它们进行标记,如下所示:
请注意,如果
componentsSeparatedByCharactersInSet
在一行中遇到多个 charSet 成员,则会生成空白字符串,因此您可能需要测试一下对于长度小于 1 的情况。If you want to tokenize on multiple characters, you can use NSString's
componentsSeparatedByCharactersInSet
. NSCharacterSet has some handy pre-made sets like thewhitespaceCharacterSet
and theillegalCharacterSet
. And it has initializers for Unicode ranges.You can also combine character sets and use them to tokenize, like this:
Be aware that
componentsSeparatedByCharactersInSet
will produce blank strings if it encounters more than one member of the charSet in a row, so you might want to test for lengths less than 1.如果您希望将字符串标记为搜索词,同时保留“引用短语”,这里有一个尊重各种类型的引号对的
NSString
类别:""
''
''
“”
用法:
代码:
If you're looking to tokenise a string into search terms while preserving "quoted phrases", here's an
NSString
category that respects various types of quote pairs:""
''
‘’
“”
Usage:
Code:
如果您正在寻找分割字符串的语言特征(单词、段落、字符、句子和行),请使用字符串枚举:
此 api 适用于空格并不总是分隔符的其他语言(例如日语)。 另外,使用 NSStringEnumerationByComlatedCharacterSequences 也是枚举字符的正确方法,因为许多非西方字符的长度超过一个字节。
If you are looking for splitting linguistic feature's of a string (Words, paragraphs, characters, sentences and lines), use string enumeration:
This api works with other languages where spaces are not always the delimiter (e.g. Japanese). Also using
NSStringEnumerationByComposedCharacterSequences
is the proper way to enumerate over characters, since many non-western characters are more than one byte long.我遇到过这样的情况:在使用 ldapsearch 进行 LDAP 查询后,我必须拆分控制台输出。 首先设置并执行 NSTask (我在这里找到了一个很好的代码示例:从 Cocoa 应用程序执行终端命令)。 但随后我必须拆分并解析输出,以便仅从 Ldap 查询输出中提取打印服务器名称。 不幸的是,这是相当繁琐的字符串操作,如果我们用简单的 C 数组操作来操作 C 字符串/数组,那么这根本就没有问题。 这是我使用可可对象的代码。 如果您有更好的建议,请告诉我。
I had a case where I had to split the console output after an LDAP query with ldapsearch. First set up and execute the NSTask (I found a good code sample here: Execute a terminal command from a Cocoa app). But then I had to split and parse the output so as to extract only the print-server names out of the Ldap-query-output. Unfortunately it is rather tedious string-manipulation which would be no problem at all if we were to manipulate C-strings/arrays with simple C-array operations. So here is my code using cocoa objects. If you have better suggestions, let me know.
我自己遇到过这样的情况,仅按组件分隔字符串是不够的,许多任务例如
1)将令牌分类为类型
2)添加新令牌
3)在自定义闭包之间分隔字符串,例如“{”和“}”之间的所有单词
对于任何此类要求,我发现 Parse Kit 是一个救星。
我用它成功地解析了 .PGN(可移植的游戏符号)文件,它非常快速且精简。
I have my self come across instance where it was not enough to just separate string by component many tasks such as
1) Categorizing token into types
2) Adding new tokens
3)Separating string between custom closures like all words between "{" and "}"
For any such requirements i found Parse Kit a life saver.
I used it to parse .PGN (prtable gaming notation) files successfully its very fast and lite.