约会的自然语言解析？

发布于 2024-08-27 12:05:29 字数 438 浏览 8 评论 0 原文

我正在寻找一个 Java 库来帮助解析用户输入的表示日历应用程序“约会”的文本。例如：

周二 11:30

或

下午 5 点与 Mike 共进午餐周五欢乐时光

我发现了一些有希望的线索，例如 https://github.com/samtingleff/jchronic 和 http://www.datejs.com/< /a> 可以解析日期 - 但我还需要能够提取事件的标题，例如“与迈克共进午餐”。

如果这样的 API 不存在，我也对如何从编码角度最好地解决问题的想法感兴趣。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇俏 2024-09-03 12:05:29

延长 JChronic 可能是您最好的选择。我认为，鉴于对这个问题的答复，这不太可能存在一个预先构建的库（尽管看起来这样的东西可能有用......我猜解析自然语言日期的主要用例如果能够提取的话会更有用来自用户提供的字符串的附加数据）。

在实现方面，可能最直接的事情就是扩展 JChronic，因为它支持用例的很大一部分，但更多的是正如您从单元测试中看到的框架应该已经忽略无关信息。
幸运的是，如果您查看主要class，扩展/修改/包装 parse() 方法以支持事件标题的自定义扫描器应该不会太难。（我自己的偏好是包装框架而不是分叉和修改它，因为这可以让您更轻松地从底层代码的任何改进中受益）。

最终，最直接的方法可能是生成一个正则表达式解析器，该解析器忽略 JChronic 尝试捕获的大部分内容（这意味着要深入熟悉 JChronic 源代码）。

与任何 NLP 类型的项目一样，成功实现这一点的关键是拥有尽可能多的示例，最好是自动化单元测试（最终，即使测试用例测试多次重复相同的功能，最好有更多的例子而不是更少）。幸运的是，由于我们谈论的是自然语言，这样的测试用例应该特别容易获得，因为即使是非程序员的朋友、家人等也应该能够为您提供“事件描述”（或者您想要调用的任何内容）他们）。您还需要特别关注日期解析位可能会干扰位置/标题解析位的边缘情况（例如，在“sigur rós at 8pm”中，“at”显然是时间的一部分，而在“party”中）在菲比的星期六”显然不是）。

我意识到我对 JChronic 说了很多，但我觉得这是解决你的问题的自然选择，因为它已经涵盖了解析自然语言“约会”的大部分“困难部分”，即我们的语言的模糊性使用大约时间，并且已经以您的目标语言实现。

Extending JChronic may be your best bet. I think, given the responses to this question, it's unlikely that a pre-built library for this exists (though it seems like such a thing could be useful... I'm guessing that the major use-cases for parsing natural language dates would be even more useful if they had the ability to extract additional data from user-supplied strings).

Implementation-wise, probably the most straight-forward thing to do is to extend JChronic, since, it supports quite a significant part of your use-case, but more over as you can see from the unit test extraneous information should already be ignored by the framework.
Fortunately, too, if you look at the main class, it should not be too hard to extend / modify / wrap the parse() method to support a custom scanner for an event title. (My own preference of these would be to wrap the framework rather than fork and modify it, as that allows you to benefit from any improvements to the underlying code more easily).

Ultimately, what may prove the most straight-forward way of doing this is to generate a regex-parser that ignores most of what JChronic tries to capture (and this would mean becoming deeply familiar with the JChronic source code).

The key to successfully implementing this, as with any NLP-type project is to have as many examples as you can possibly get, preferrably as automated unit tests (ultimately, even if the test cases test duplicate the same functionality many times, it is better to have more examples than fewer). Fortunately, since we're talking about natural language, such test cases should be particularly easy to get, since even non-programmer friends, family, etc. should be able to provide you with "event descriptions" (or whatever you want to call them). You'll also want to especially focus on edge cases where the date-parsing bit might interfere with the location / title parsing bit (for example in "sigur rós at 8pm" the "at" is clearly part of the time whereas in "party at phoebe's saturday" it clearly isn't).

I realize I said quite a bit about JChronic, but I feel that it's a natural choice for your problem as it already covers much of the "hard part" of parsing natural-language "appointments", i.e., the fuzziness of our language that we use about time, and is already implemented in the language you are targetting.

回复收藏 0 原文