...训练数据应该位于制表符分隔的列中，并且您
通过地图定义这些列的含义。一列应该是
称为“答案”并具有 NER 类，并且现有功能已知
关于“word”和“tag”等名称。您定义数据文件、地图、
以及通过属性文件生成哪些功能。有
大量关于不同属性特征的文档
在 NERFeatureFactory 的 Javadoc 中生成，尽管最终你
必须去源代码来回答一些问题...

您还可以在 CRFClassifier的javadoc：

典型的命令行用法
用于在提供的序列化分类器上运行经过训练的模型
文本文件：
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile Samplesentences.txt
在属性文件中指定所有参数（训练、测试或
运行时）：
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
从命令行训练和测试简单的 NER 模型：
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > >输出

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:

simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.

In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

回复收藏 0 原文

甜心 2024-12-03 04:45:39

例如 - 您可以使用标准库 java.text 中的一些类，或使用 StreamTokenizer （您可以根据您的要求自定义它）。但正如您所知 - 来自互联网来源的文本数据通常有许多拼写错误，为了获得更好的性能，您必须使用模糊标记器之类的东西 - java.text 和其他标准实用程序在这种情况下的功能太有限。

因此，我建议您使用正则表达式 (java.util.regex) 并根据您的需要创建自己类型的标记生成器。

PS >
根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分。您可能会在下图中看到简单的状态机识别器（您可以构建更高级的解析器，它可以识别文本中更复杂的模板）。

在此处输入图像描述