时间提取(即从自由格式文本中提取日期/时间实体)- 如何?
有没有人找到一种简单但有效的方法从文本中提取日期引用? 我已经搜索了大量的时间提取工具,但还没有太多。 有一些白皮书,但它似乎属于整个语义网事物的一个子集,但没有得到太多关注。
我只是在寻找 80% 有效的东西。 不需要捕获诸如“2009 年 1 月之后的月份”之类的内容,但基本的常见日期实体就很好了。
我愿意接受所有建议,甚至是奇特的正则表达式。
开火吧!
(谢谢 - 亨利)
Has anyone found a simple, but effective way to extract date references from text? I've done a fair amount of searching for temporal extraction tools, but there isn't a lot out there. There are a few white papers, but it seems to fall into a subset of the whole semantic web thingy but not given much attention.
I'm just looking for something that is 80% effective. There is no need to capture things like "the month after Jan 2009", but basic common dates entities would be nice.
I'm open to all suggestions, even fancy regex expressions.
Fire away!
(and thanks - Henry)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果数据中的目标时间表达式仅采用有限的格式,请使用正则表达式和迭代方法来完善您的系统
否则,请使用斯坦福 NLP 工具包,SUTime,这可能有点矫枉过正,但绝对能满足你的要求
If the target temporal expressions in your data are only in limited format, use regular expression and iterative approach to refine your system
Otherwise, use Stanford NLP toolkit, SUTime, which might be an over-kill but definitely meet your demands
我这样做的一种方法是只查找 4 个数字并将其转换为数字。 如果该数字在您感兴趣的年份范围内,那么您可能有一年可以使用。 如果您对任何匹配的月份和日期感兴趣,您可以检查相邻的单词,看看它们是月份名称还是 1 到 31 之间的数字。我相信这会满足您 80% 的要求。
年份的正则表达式:[0-9]{4} - 您需要转换为数字并查看它是否在您认为有效的年份范围内。
月份的正则表达式:jan|january|feb|february ...等每月
月份的正则表达式:[0-9]{1,2} - 您需要转换为数字并查看它是否为 1 -31
One way I have done this is to just look for anything that is 4 numbers and convert it to a number. If the number falls within the range of years you are interested in, you probably have a year you can use. If you are interested in any matching months and days you could check adjacent words to see if they are a month name or a number between 1 and 31. I am confident this would satisfy your 80% requirement.
Regex for years: [0-9]{4} - you will need to convert to a number and see if it's within the range of years you consider valid.
Regex for months: jan|january|feb|february ... etc for each month
Regex for days of the month: [0-9]{1,2} - you would need to convert to a number and see if it is 1-31
我对如何找到喂养它的内容一无所知,但是 此库将解析各种日期,并可用作“这是一个真实的日期”功能。 (完全公开,我是该库的作者)
I'm drawing a blank on how to find what to feed it, but this library will parse a wide range of dates and could be used as the "is this a real date" function. (Full disclosure, I'm the author of that lib)