对于Java,有一个分词器完全符合我想要的吗?

发布于 2024-11-26 15:06:57 字数 280 浏览 3 评论 0原文

我想标记文本,但不仅仅用空格分隔。

有些东西,比如专有名称,我只想设置一个标记(例如:“Renato Dinhani Conceição”)。另一种情况:百分比(“60%”)并且不分为两个标记。

我想知道是否有某个库中的 Tokenizator 可以提供高度定制化?如果没有,我会尝试编写自己的,如果有一些接口或实践可以遵循。

并非所有事情都需要得到普遍认可。示例:我不需要识别中文字母。

我的申请是大学申请,主要针对葡萄牙语。只有一些内容,例如名称、地点和类似内容会来自其他语言。

I'm want to tokenize a text, but not separating only with whitespaces.

There some things like proper names that I want to set only one token (eg.: "Renato Dinhani Conceição"). Another case: percentual ("60 %") and not split into two tokens.

What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.

Not everything need to be universal recognition. Example: I don't need to reconigze chinese alphabet.

My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鱼忆七猫命九 2024-12-03 15:06:57

我会尝试不从代币化的角度,而是从规则的角度来解决这个问题。这将是最大的挑战 - 创建一个全面的规则集来满足您的大多数情况。

  • 用人类的术语定义什么是不应根据空格分割的单位。名称示例就是其中之一。
  • 对于空白分割的每一个例外,创建一组规则来识别它。对于名称示例:2 个或多个连续的大写单词,中间有或没有语言特定的非大写名称单词(例如“de”)。
  • 将每个规则实现为其自己的类,可以在循环时调用该类。
  • 根据空格分割整个字符串,然后循环它,跟踪之前出现的标记和当前出现的标记,为每个标记应用规则类。

规则 isName 示例:

  • 循环 1:(例如: isName = false
  • 循环 2:"Renato isName = true
  • 循环 3:Dihani isName = true
  • 循环 4:Conceição"). isName = true
  • 循环 5:Another isName = false

留给您:(例如:“Renato Dinhani Conceição”)。另一个

I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge - creating a comprehensive rule set that will satisfy most of your cases.

  • Define in human terms what are units that should not be split up based on whitespace. The name example is one.
  • For each one of those exceptions to the whitespace split, create a set of rules for how to identify it. For the name example: 2 or more consecutive capitalized words with or without language specific non-capitalized name words in between (like "de").
  • Implement each rule as its own class which can be called as you loop.
  • Split the entire string based on whitespace, and then loop it, keeping track of what token came before, and what is current, applying your rule classes for each token.

Example for rule isName:

  • Loop 1: (eg.: isName = false
  • Loop 2: "Renato isName = true
  • Loop 3: Dinhani isName = true
  • Loop 4: Conceição"). isName = true
  • Loop 5: Another isName = false

Leaving you with: (eg.:, "Renato Dinhani Conceição")., Another

瑾兮 2024-12-03 15:06:57

我认为分词器对于你想要的东西来说太简单了。分词器的一个升级是像 JFlex 这样的词法分析器。这些将把字符流分割成单独的标记,就像标记器一样,但具有更灵活的规则。

即便如此,您似乎仍需要某种自然语言处理,因为教词法分析器区分专有名称和普通单词之间的区别可能很棘手。通过教它一串以大写字母开头的单词都属于一起,数字后面可以跟单位等等,你也许可以走得很远。祝你好运。

I think that a tokenizer is going to be too simplistic for what you want. One step up from a tokenizer would be a lexer like JFlex. These will split up a stream of characters into separate tokens likea tokenizer but with much more flexible rules.

Even so, it seems like you're going to need some sort of natural language processing, as teaching a lexer the difference between a proper name and normal words might be tricky. You might be able to get pretty far by teaching it that a string of words that start with upper-case letters all belong together, numbers may be followed by units, etc. Good luck.

揽清风入怀 2024-12-03 15:06:57

您应该尝试 Apache OpenNLP。它包括可供使用的葡萄牙语句子检测器和分词器模型。

下载 Apache OpenNLP 并解压。将葡萄牙语模型复制到 OpenNLP 文件夹。从 http://opennlp.sourceforge.net/models-1.5/ 下载模型

从命令行使用它:

bin/opennlp TokenizerME pt-ten.bin 
Loading Tokenizer model ... done (0,156s)
O José da Silva chegou, está na sua sala.
O José da Silva chegou , está na sua sala .

使用 API:

// load the model
InputStream modelIn = new FileInputStream("pt-token.bin");

try {
  TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the tokenizer
Tokenizer tokenizer = new TokenizerME(model);

// tokenize your sentence
String tokens[] = tokenizer.tokenize("O José da Silva chegou, está na sua sala.");

You should try Apache OpenNLP. It includes ready to use Sentence Detector and Tokenizer models for Portuguese.

Download Apache OpenNLP and extract it. Copy the Portuguese model to the OpenNLP Folder. Download the model from http://opennlp.sourceforge.net/models-1.5/

Using it from command line:

bin/opennlp TokenizerME pt-ten.bin 
Loading Tokenizer model ... done (0,156s)
O José da Silva chegou, está na sua sala.
O José da Silva chegou , está na sua sala .

Using the API:

// load the model
InputStream modelIn = new FileInputStream("pt-token.bin");

try {
  TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the tokenizer
Tokenizer tokenizer = new TokenizerME(model);

// tokenize your sentence
String tokens[] = tokenizer.tokenize("O José da Silva chegou, está na sua sala.");
玩心态 2024-12-03 15:06:57

StringTokenizer 是一个遗留类,仅为了向后兼容而维护。在新代码中不鼓励使用它。

您应该使用 String.split() 函数。 split 函数采用正则表达式作为参数。此外,您可以使用 Pattern 和 Matcher 类来增强它。您可以编译您的模式对象,然后使用它来匹配各种场景。

StringTokenizer is a legacy class that is maintained only for backward compatibility. It's use is discouraged in new code.

You should use the String.split() function. The split function takes a regular expression as it's argument. Additionally, you can enhance it with using the Pattern and Matcher classes. You can compile your pattern objects and then use it to match various scenarios.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文