当前位置：文江博客话题详情

需要从自由文本中提取信息，如位置、课程等信息

发布于 2024-12-14 04:42:17 字数 302 浏览 3 评论 0原文

我需要为教育领域编写一个文本解析器，它可以从自由文本中提取诸如机构、位置、课程等信息。

目前我正在通过lucene进行操作，步骤如下：

索引与学院、课程和位置相关的所有数据。
制作自由文本的木瓦，并在位置、课程和学院索引目录中搜索每个木瓦，然后尝试找出文本的哪一部分代表位置、课程等。

在这种方法中，我错过了很多可以编写的案例，例如 B.tech作为 btech、b-tech 或 b.tech。

我想知道是否有任何东西可以做所有这些事情，我听说过灵管和门，但不知道它们的效率如何。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一笔一画续写前缘 2024-12-21 04:42:17

您绝对需要GATE。 GATE 有 2 个最常用的主要功能（以及数千个其他功能）：规则和字典。字典（GATE 术语中的地名词典）允许您将所有可能的情况（如“B.tech”、“btech”等）放入单个文本文件中，并让 GATE 查找并标记它们。规则（更准确地说，JAPE 规则）允许您定义文本中的模式。例如，以下是捕获 MIT 邮政地址的模式（“77 Massachusetts Ave., Building XX, Cambridge MA 02139”）：

{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}

其中 (SP) 和 (COMMA) - 宏（只是为了使文本更短），{Somthing} - 是注释，{Token.kind == number} - 注释“Token”，特征“kind”等于“number”（即文本中的数字），{Lookup} - 从字典中捕获值的注释（顺便说一句，GATE 已经有美国城市等字典）。这是一个非常简单的示例，但您应该会看到如何轻松地涵盖甚至非常复杂的情况。

You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):

{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}

where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.

回复收藏 0 原文

两相知 2024-12-21 04:42:17

我没有使用 Lucene，但在你的情况下，我会保留相同关键字的不同形式，只保留一个链接表等。在这张表中，我将保留这些不同形式的关系。

回复收藏 0 原文

笙痞 2024-12-21 04:42:17

您可能需要编写正则表达式来涵盖词汇表的每种可能形式。

请小心选择分析器/分词器，因为像 B.tech 这样的词可以很容易地分成 2 个不同的词（即 B 和 tech ）。

回复收藏 0 原文

じ违心 2024-12-21 04:42:17

您可能需要检查 UIMA。与 Lingpipe 和 Gate 一样，该框架具有文本注释功能，这正是您想要做的。以下教程将帮助您编写 UIMA 注释器：

http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code

UIMA 有插件，特别是 Lucene 集成。

回复收藏 0 原文

音盲 2024-12-21 04:42:17

您可以尝试http://code.google.com/p/graph-expression/
地址解析规则示例

  GraphRegExp.Matcher Token = match("Token");
            GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
            GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
            GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
            GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
            GraphRegExp.Matcher Postcode =
                            mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
                    ;
            //mark(String, Matcher) -- means creating chunk over sub matcher
            GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
            //without new lines
            streetAddress = regexpNot("\n", streetAddress);
            GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));

            Chunker chunker = Chunkers.pipeline(
                    Chunkers.regexp("Token", "\\w+"),
                    Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
                    new GraphExpChunker("Address",
                            seq(
                                    opt(streetAddress),
                                    opt(Postoffice),
                                    City,
                                    StateLike,
                                    Postcode,
                                    Country
                            )
                    ).setDebugString(true)
            );

You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules

  GraphRegExp.Matcher Token = match("Token");
            GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
            GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
            GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
            GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
            GraphRegExp.Matcher Postcode =
                            mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
                    ;
            //mark(String, Matcher) -- means creating chunk over sub matcher
            GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
            //without new lines
            streetAddress = regexpNot("\n", streetAddress);
            GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));

            Chunker chunker = Chunkers.pipeline(
                    Chunkers.regexp("Token", "\\w+"),
                    Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
                    new GraphExpChunker("Address",
                            seq(
                                    opt(streetAddress),
                                    opt(Postoffice),
                                    City,
                                    StateLike,
                                    Postcode,
                                    Country
                            )
                    ).setDebugString(true)
            );

回复收藏 0 原文