包含拼写错误的单词的类似子句中的文本搜索
我们需要从免费文本子句中提取一些信息。让我们认为我们有一个关于AA船离开港口而去另一个港口的条款。相同的含义可以通过几种方式表达出来:
The Ship A departed from the Port X on Monday, to reach Port Y.
The ship A left the Port X on Monday to reach Port Y.
The Ship A arrived to Port Y, it left Port X on Monday.
Port Y will be visited by Ship A which left Port X on Monday.
作者也可能会拼错字:
departed -> deported, dearted, depared, departeed, deparded
reach -> reaach, rech, rreach, reac
arrived -> arived, arivved, arrivd
那么从这些条款中提取“ a shaph a”,“ a a a”,“ port x”,“ port y”,“ port y”的最佳方法是什么? 编程语言是Java。 我们应该使用Reqular表达式或Lucene Fuzzy Search或Elasticsearch等。 还是它们的某种组合?
谢谢
We need to extract some information from a free text clause. Let's think we have a clause about a a ship leaving a port and going another port. The same meaning can be expressed in several ways like this:
The Ship A departed from the Port X on Monday, to reach Port Y.
The ship A left the Port X on Monday to reach Port Y.
The Ship A arrived to Port Y, it left Port X on Monday.
Port Y will be visited by Ship A which left Port X on Monday.
And also author might misspell words:
departed -> deported, dearted, depared, departeed, deparded
reach -> reaach, rech, rreach, reac
arrived -> arived, arivved, arrivd
So what is the best way to extract "Ship A", "Port X", "Port Y", "Monday" words from those clauses?
Programming language is Java.
Shall we use reqular expressions or lucene fuzzy search or elasticsearch etc.
Or some combination of them?
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该程序在示例字符串中找到所需的信息,并以正确的顺序列出。它需要进一步努力应对失误。我们还可以扩大日期的日期以接受日期。
输出
在
This program finds the information that you need in the sample strings and puts it in the right order. Its needs further work to cope with mispellings. We could also expand the day regex to accept dates.
output
Tested at https://www.tutorialspoint.com/compile_java_online.php