对于Java,有一个分词器完全符合我想要的吗?
我想标记文本,但不仅仅用空格分隔。
有些东西,比如专有名称,我只想设置一个标记(例如:“Renato Dinhani Conceição”)。另一种情况:百分比(“60%”)并且不分为两个标记。
我想知道是否有某个库中的 Tokenizator 可以提供高度定制化?如果没有,我会尝试编写自己的,如果有一些接口或实践可以遵循。
并非所有事情都需要得到普遍认可。示例:我不需要识别中文字母。
我的申请是大学申请,主要针对葡萄牙语。只有一些内容,例如名称、地点和类似内容会来自其他语言。
I'm want to tokenize a text, but not separating only with whitespaces.
There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Conceição"). Another case: percentual ("60 %") and not split into two tokens.
What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.
Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.
My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会尝试不从代币化的角度,而是从规则的角度来解决这个问题。这将是最大的挑战 - 创建一个全面的规则集来满足您的大多数情况。
规则 isName 示例:
(例如:
isName = false"Renato
isName = trueDihani
isName = trueConceição").
isName = trueAnother
isName = false留给您:
(例如:
,“Renato Dinhani Conceição”)。
,另一个
I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge - creating a comprehensive rule set that will satisfy most of your cases.
Example for rule isName:
(eg.:
isName = false"Renato
isName = trueDinhani
isName = trueConceição").
isName = trueAnother
isName = falseLeaving you with:
(eg.:
,"Renato Dinhani Conceição").
,Another
我认为分词器对于你想要的东西来说太简单了。分词器的一个升级是像 JFlex 这样的词法分析器。这些将把字符流分割成单独的标记,就像标记器一样,但具有更灵活的规则。
即便如此,您似乎仍需要某种自然语言处理,因为教词法分析器区分专有名称和普通单词之间的区别可能很棘手。通过教它一串以大写字母开头的单词都属于一起,数字后面可以跟单位等等,你也许可以走得很远。祝你好运。
I think that a tokenizer is going to be too simplistic for what you want. One step up from a tokenizer would be a lexer like JFlex. These will split up a stream of characters into separate tokens likea tokenizer but with much more flexible rules.
Even so, it seems like you're going to need some sort of natural language processing, as teaching a lexer the difference between a proper name and normal words might be tricky. You might be able to get pretty far by teaching it that a string of words that start with upper-case letters all belong together, numbers may be followed by units, etc. Good luck.
您应该尝试 Apache OpenNLP。它包括可供使用的葡萄牙语句子检测器和分词器模型。
下载 Apache OpenNLP 并解压。将葡萄牙语模型复制到 OpenNLP 文件夹。从 http://opennlp.sourceforge.net/models-1.5/ 下载模型
从命令行使用它:
使用 API:
You should try Apache OpenNLP. It includes ready to use Sentence Detector and Tokenizer models for Portuguese.
Download Apache OpenNLP and extract it. Copy the Portuguese model to the OpenNLP Folder. Download the model from http://opennlp.sourceforge.net/models-1.5/
Using it from command line:
Using the API:
StringTokenizer 是一个遗留类,仅为了向后兼容而维护。在新代码中不鼓励使用它。
您应该使用 String.split() 函数。 split 函数采用正则表达式作为参数。此外,您可以使用 Pattern 和 Matcher 类来增强它。您可以编译您的模式对象,然后使用它来匹配各种场景。
StringTokenizer is a legacy class that is maintained only for backward compatibility. It's use is discouraged in new code.
You should use the String.split() function. The split function takes a regular expression as it's argument. Additionally, you can enhance it with using the Pattern and Matcher classes. You can compile your pattern objects and then use it to match various scenarios.