如何从 Python(或其他语言)的文本块中解析多个日期
我有一个字符串,其中包含多个日期值,我想将它们全部解析出来。该字符串是自然语言,因此到目前为止我发现的最好的东西是 dateutil。
不幸的是,如果一个字符串中有多个日期值,dateutil 会抛出一个错误:
>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
关于如何从长字符串中解析所有日期有什么想法吗?理想情况下,会创建一个列表,但如果需要,我可以自己处理。
我正在使用 Python,但此时,如果其他语言能够完成工作,那么它们可能也可以。
PS-我想我可以在中间递归地分割输入文件并尝试,再试一次直到它起作用,但这是一个地狱般的黑客。
I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.
Unfortunately, if a string has multiple date values in it, dateutil throws an error:
>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.
I'm using Python, but at this point, other languages are probably OK, if they get the job done.
PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
看看它,最简单的方法是修改 dateutil 解析器有一个模糊多重选项。
parser._parse
获取您的字符串,使用_timelex
对其进行标记,然后将标记与parserinfo
中定义的数据进行比较。这里,如果一个令牌与
parserinfo
中的任何内容都不匹配,除非fuzzy
为 True,否则解析将失败。我建议您在没有任何处理的时间令牌时允许不匹配,然后当您遇到不匹配时,处理此时解析的数据并再次开始查找时间令牌。
不应该花太多力气。
更新
当您等待补丁发布时...
这有点hacky,使用库中的非公共函数,但不需要修改库并且不是试用版-和-错误。如果您有任何可以转换为浮点数的单独令牌,则可能会出现误报。您可能需要对结果进行更多过滤。
效果:
Dieter 的更新
Dateutil 2.1 似乎是为了与 python3 兼容而编写的,并使用名为
six
的“兼容性”库。它有些不对劲,它没有将str
对象视为文本。如果您将字符串作为 unicode 或类似文件的对象传递,则此解决方案适用于 dateutil 2.1:
如果要在 parserinfo 上设置选项,请实例化 parserinfo 并将其传递给解析器对象。例如:
Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.
parser._parse
takes your string, tokenizes it with_timelex
and then compares the tokens with data defined inparserinfo
.Here, if a token doesn't match anything in
parserinfo
, the parse will fail unlessfuzzy
is True.What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.
Shouldn't take too much effort.
Update
While you're waiting for your patch to get rolled in...
This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.
Yields:
Update for Dieter
Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called
six
. Something isn't right with it and it's not treatingstr
objects as text.This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:
If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:
当我离线时,我对昨天在这里发布的答案感到困扰。是的,它完成了这项工作,但它不必要地复杂且效率极低。
这是信封背面的版本,应该会做得更好!
用法示例:
可能值得注意的是,在处理空/未知字符串时,其行为与 dateutil.parser.parse 略有不同。 Dateutil 将返回当前日期,而 parse_multiple 返回一个空列表,恕我直言,这正是人们所期望的。
PS刚刚发现MattH 的更新答案,其功能非常相似。
While I was offline, I was bothered by the answer I posted here yesterday. Yes it did the job, but it was unnecessarily complicated and extremely inefficient.
Here's the back-of-the-envelope edition that should do a much better job!
Example usage:
It's probably worth noting that its behaviour deviates slightly from
dateutil.parser.parse
when dealing with empty/unknown strings. Dateutil will return the current day, whileparse_multiple
returns an empty list which, IMHO, is what one would expect.P.S. Just spotted MattH's updated answer which does something very similar.
我认为如果你把“单词”放在一个数组中,它应该可以解决问题。这样您就可以验证它是否是日期,然后放入一个变量中。
一旦你有了日期,你应该使用 datetime library 库。
I think if you put the "words" in an array, it should do the trick. With that you can verify if it is a date or no, and put in a variable.
Once you have the date you should use datetime library library.
为什么不编写一个涵盖日期可能出现的所有可能形式的正则表达式模式,然后启动正则表达式来探索文本?我认为在字符串中表达日期的方式不会有几十种。
唯一的问题是收集日期表达式的最大值
Why not writing a regex pattern covering all the possible forms in which a date can appear, and then launching the regex to explore the text ? I presume that there are not dozen of dozens of manners to express a date in a string.
The only problem is to gather the maximum of date's expressions
我看到已经有一些很好的答案,但添加了这个答案,因为它在我的用例中效果更好,而上面的答案却没有。
使用此库: https://datefinder.readthedocs.io/en/ latest/index.html#module-datefinder
输出:
I see that there are some good answers already but adding this one as it worked better in a use case of mine while the above answers didn't.
Using this library: https://datefinder.readthedocs.io/en/latest/index.html#module-datefinder
Output: