需要从自由文本中提取信息,如位置、课程等信息

发布于 2024-12-14 04:42:17 字数 302 浏览 3 评论 0原文

我需要为教育领域编写一个文本解析器,它可以从自由文本中提取诸如机构、位置、课程等信息。

目前我正在通过lucene进行操作,步骤如下:

  1. 索引与学院、课程和位置相关的所有数据。
  2. 制作自由文本的木瓦,并在位置、课程和学院索引目录中搜索每个木瓦,然后尝试找出文本的哪一部分代表位置、课程等。

在这种方法中,我错过了很多可以编写的案例,例如 B.tech作为 btech、b-tech 或 b.tech。

我想知道是否有任何东西可以做所有这些事情,我听说过灵管和门,但不知道它们的效率如何。

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.

Currently i am doing it through lucene, steps are as follows:

  1. Index all the data related to institute, courses and location.
  2. Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.

In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.

I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

一笔一画续写前缘 2024-12-21 04:42:17

您绝对需要GATE。 GATE 有 2 个最常用的主要功能(以及数千个其他功能):规则字典。字典(GATE 术语中的地名词典)允许您将所有可能的情况(如“B.tech”、“btech”等)放入单个文本文件中,并让 GATE 查找并标记它们。规则(更准确地说,JAPE 规则)允许您定义文本中的模式。例如,以下是捕获 MIT 邮政地址的模式(“77 Massachusetts Ave., Building XX, Cambridge MA 02139”):

{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}

其中 (SP)(COMMA) - 宏(只是为了使文本更短),{Somthing} - 是注释,{Token.kind == number} - 注释“Token”,特征“kind”等于“number”(即文本中的数字),{Lookup} - 从字典中捕获值的注释(顺便说一句,GATE 已经有美国城市等字典)。这是一个非常简单的示例,但您应该会看到如何轻松地涵盖甚至非常复杂的情况。

You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):

{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}

where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.

两相知 2024-12-21 04:42:17

我没有使用 Lucene,但在你的情况下,我会保留相同关键字的不同形式,只保留一个链接表等。在这张表中,我将保留这些不同形式的关系。

I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.

笙痞 2024-12-21 04:42:17

您可能需要编写正则表达式来涵盖词汇表的每种可能形式。

请小心选择分析器/分词器,因为像 B.tech 这样的词可以很容易地分成 2 个不同的词(即 Btech )。

You may need to write a regular expression to cover each possible form of your vocabulary.

Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).

じ违心 2024-12-21 04:42:17

您可能需要检查 UIMA。与 Lingpipe 和 Gate 一样,该框架具有文本注释功能,这正是您想要做的。以下教程将帮助您编写 UIMA 注释器:

http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code

UIMA 有 插件,特别是 Lucene 集成

You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:

http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code

UIMA has addons, in particular one for Lucene integration.

音盲 2024-12-21 04:42:17

您可以尝试http://code.google.com/p/graph-expression/
地址解析规则示例

  GraphRegExp.Matcher Token = match("Token");
            GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
            GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
            GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
            GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
            GraphRegExp.Matcher Postcode =
                            mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
                    ;
            //mark(String, Matcher) -- means creating chunk over sub matcher
            GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
            //without new lines
            streetAddress = regexpNot("\n", streetAddress);
            GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));

            Chunker chunker = Chunkers.pipeline(
                    Chunkers.regexp("Token", "\\w+"),
                    Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
                    new GraphExpChunker("Address",
                            seq(
                                    opt(streetAddress),
                                    opt(Postoffice),
                                    City,
                                    StateLike,
                                    Postcode,
                                    Country
                            )
                    ).setDebugString(true)
            );

You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules

  GraphRegExp.Matcher Token = match("Token");
            GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
            GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
            GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
            GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
            GraphRegExp.Matcher Postcode =
                            mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
                    ;
            //mark(String, Matcher) -- means creating chunk over sub matcher
            GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
            //without new lines
            streetAddress = regexpNot("\n", streetAddress);
            GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));

            Chunker chunker = Chunkers.pipeline(
                    Chunkers.regexp("Token", "\\w+"),
                    Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
                    new GraphExpChunker("Address",
                            seq(
                                    opt(streetAddress),
                                    opt(Postoffice),
                                    City,
                                    StateLike,
                                    Postcode,
                                    Country
                            )
                    ).setDebugString(true)
            );
一张白纸 2024-12-21 04:42:17

B.tech 可以写作 btech、b-tech 或 b.tech

Lucene 会让你做 基于编辑距离的模糊搜索。对 roam~(注意 ~)的查询将找到诸如 foamroams 之类的术语。

这可能允许您匹配不同的情况。

B.tech can be written as btech, b-tech or b.tech

Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.

That might allow you to match the different cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文