约会的自然语言解析?

发布于 2024-08-27 12:05:29 字数 438 浏览 8 评论 0 原文

我正在寻找一个 Java 库来帮助解析用户输入的表示日历应用程序“约会”的文本。例如:

周二 11:30

下午 5 点与 Mike 共进午餐 周五欢乐时光

我发现了一些有希望的线索,例如 https://github.com/samtingleff/jchronichttp://www.datejs.com/< /a> 可以解析日期 - 但我还需要能够提取事件的标题,例如“与迈克共进午餐”。

如果这样的 API 不存在,我也对如何从编码角度最好地解决问题的想法感兴趣。

I'm looking for a Java library to help parse user entered text that represents an 'appointment' for a calendar application. For instance:

Lunch with Mike at 11:30 on Tuesday

or

5pm Happy hour on Friday

I've found some promising leads like https://github.com/samtingleff/jchronic and http://www.datejs.com/ which can parse dates - but I also need to be able to extract the title of the event like "Lunch with Mike".

If such an API doesn't exist, I'm also interested in any thoughts on how best to approach the problem from a coding perspective.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

娇俏 2024-09-03 12:05:29

延长 JChronic 可能是您最好的选择。我认为,鉴于对这个问题的答复,这不太可能存在一个预先构建的库(尽管看起来这样的东西可能有用......我猜解析自然语言日期的主要用例如果能够提取的话会更有用来自用户提供的字符串的附加数据)。

在实现方面,可能最直接的事情就是扩展 JChronic,因为它支持用例的很大一部分,但更多的是 正如您从单元测试中看到的 框架应该已经忽略无关信息。
幸运的是,如果您查看主要class,扩展/修改/包装 parse() 方法以支持事件标题的自定义扫描器应该不会太难。 (我自己的偏好是包装框架而不是分叉和修改它,因为这可以让您更轻松地从底层代码的任何改进中受益)。

最终,最直接的方法可能是生成一个正则表达式解析器,该解析器忽略 JChronic 尝试捕获的大部分内容(这意味着要深入熟悉 JChronic 源代码)。

与任何 NLP 类型的项目一样,成功实现这一点的关键是拥有尽可能多的示例,最好是自动化单元测试(最终,即使测试用例测试多次重复相同的功能,最好有更多的例子而不是更少)。幸运的是,由于我们谈论的是自然语言,这样的测试用例应该特别容易获得,因为即使是非程序员的朋友、家人等也应该能够为您提供“事件描述”(或者您想要调用的任何内容)他们)。您还需要特别关注日期解析位可能会干扰位置/标题解析位的边缘情况(例如,在“sigur rós at 8pm”中,“at”显然是时间的一部分,而在“party”中)在菲比的星期六”显然不是)。

我意识到我对 JChronic 说了很多,但我觉得这是解决你的问题的自然选择,因为它已经涵盖了解析自然语言“约会”的大部分“困难部分”,即我们的语言的模糊性使用大约时间,并且已经以您的目标语言实现。

Extending JChronic may be your best bet. I think, given the responses to this question, it's unlikely that a pre-built library for this exists (though it seems like such a thing could be useful... I'm guessing that the major use-cases for parsing natural language dates would be even more useful if they had the ability to extract additional data from user-supplied strings).

Implementation-wise, probably the most straight-forward thing to do is to extend JChronic, since, it supports quite a significant part of your use-case, but more over as you can see from the unit test extraneous information should already be ignored by the framework.
Fortunately, too, if you look at the main class, it should not be too hard to extend / modify / wrap the parse() method to support a custom scanner for an event title. (My own preference of these would be to wrap the framework rather than fork and modify it, as that allows you to benefit from any improvements to the underlying code more easily).

Ultimately, what may prove the most straight-forward way of doing this is to generate a regex-parser that ignores most of what JChronic tries to capture (and this would mean becoming deeply familiar with the JChronic source code).

The key to successfully implementing this, as with any NLP-type project is to have as many examples as you can possibly get, preferrably as automated unit tests (ultimately, even if the test cases test duplicate the same functionality many times, it is better to have more examples than fewer). Fortunately, since we're talking about natural language, such test cases should be particularly easy to get, since even non-programmer friends, family, etc. should be able to provide you with "event descriptions" (or whatever you want to call them). You'll also want to especially focus on edge cases where the date-parsing bit might interfere with the location / title parsing bit (for example in "sigur rós at 8pm" the "at" is clearly part of the time whereas in "party at phoebe's saturday" it clearly isn't).

I realize I said quite a bit about JChronic, but I feel that it's a natural choice for your problem as it already covers much of the "hard part" of parsing natural-language "appointments", i.e., the fuzziness of our language that we use about time, and is already implemented in the language you are targetting.

私藏温柔 2024-09-03 12:05:29

有两种相对简单的方法可以尝试提取约会名称。

使用序列标记包

如果您有标记数据集,则可以使用 CRF++Yamcha,提取约会标题比如“与迈克共进午餐”。

使用命名实体和规则

如果您没有带标签的数据集,您可能会通过使用命名实体识别器来标记约会文本中的所有人员、位置和组织。作为奖励,这还将为您提供时间和时间。日期,因此您无需编写自己的代码来提取这些日期。

命名实体全部被标记后,编写一些规则来提取或构造每个约会的标题应该非常简单。

如果您正在寻找一款基于 Java 的 NER 标记器,您可以使用 Stanford 或随 OpenNLP 分发的版本

There are two relatively straightforward ways of trying to extract the appointment names.

Use a Sequence Labeling Package

If you have a labeled data set, you could train a sequence model, using packages like CRF++ or Yamcha, to pull out appointment titles like "Lunch with Mike".

Use Named Entities and Rules

If you don't have a labeled dataset, you could probably get some milage out of using a named entity recognizer to tag all the people, locations, and organizations in the appointment text. As a bonus this will also give you times & dates, so you won't need to write your own code to pull those out.

With the named entities all labeled, it should be pretty straight forward to write some rules to extract or construct titles for each appointment.

If you're looking for a Java based NER tagger, you could use the one released by Stanford or the one distributed with OpenNLP

煮茶煮酒煮时光 2024-09-03 12:05:29

我想不出有什么可以满足您的要求。您可以尝试斯坦福 NLP Java 包或 OpenNLP。然而,这可能是对你想要做的事情的一个大锤解决方案。

或者您可以尝试自己解析它。如果您想处理更多输入,请使用 JFlex 扫描输入并进行标记,并使用 CUP 创建语法。

I can't think of anything at the top of my head that would do that to your specifications. You could try the Stanford NLP Java package or OpenNLP. However that might be a sledgehammer solution to what your trying to do.

Alternatively you can try parsing it yourself. Use JFlex to scan the input and tokenize and CUP to create a grammar if you want to handle more input.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文