如何从 Python(或其他语言)的文本块中解析多个日期

发布于 2024-11-29 06:14:52 字数 879 浏览 1 评论 0原文

我有一个字符串,其中包含多个日期值,我想将它们全部解析出来。该字符串是自然语言,因此到目前为止我发现的最好的东西是 dateutil

不幸的是,如果一个字符串中有多个日期值,dateutil 会抛出一个错误:

>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

关于如何从长字符串中解析所有日期有什么想法吗?理想情况下,会创建一个列表,但如果需要,我可以自己处理。

我正在使用 Python,但此时,如果其他语言能够完成工作,那么它们可能也可以。

PS-我想我可以在中间递归地分割输入文件并尝试,再试一次直到它起作用,但这是一个地狱般的黑客。

I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.

Unfortunately, if a string has multiple date values in it, dateutil throws an error:

>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.

I'm using Python, but at this point, other languages are probably OK, if they get the job done.

PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

安静 2024-12-06 06:14:52

看看它,最简单的方法是修改 dateutil 解析器有一个模糊多重选项。

parser._parse 获取您的字符串,使用 _timelex 对其进行标记,然后将标记与 parserinfo 中定义的数据进行比较。

这里,如果一个令牌与 parserinfo 中的任何内容都不匹配,除非 fuzzy 为 True,否则解析将失败。

我建议您在没有任何处理的时间令牌时允许不匹配,然后当您遇到不匹配时,处理此时解析的数据并再次开始查找时间令牌。

不应该花太多力气。


更新

当您等待补丁发布时...

这有点hacky,使用库中的非公共函数,但不需要修改库并且不是试用版-和-错误。如果您有任何可以转换为浮点数的单独令牌,则可能会出现误报。您可能需要对结果进行更多过滤。

from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(a):
  print "Found:", item
  print "Parsed:", p.parse(item)

效果:

Found: 2011 04 23
Parsed: 2011-04-23 00:00:00
Found: 29 July 1928
Parsed: 1928-07-29 00:00:00

Dieter 的更新

Dateutil 2.1 似乎是为了与 python3 兼容而编写的,并使用名为 six 的“兼容性”库。它有些不对劲,它没有将 str 对象视为文本。

如果您将字符串作为 unicode 或类似文件的对象传递,则此解决方案适用于 dateutil 2.1:

from cStringIO import StringIO
for item in timesplit(StringIO(a)):
  print "Found:", item
  print "Parsed:", p.parse(StringIO(item))

如果要在 parserinfo 上设置选项,请实例化 parserinfo 并将其传递给解析器对象。例如:

from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)

Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.

parser._parse takes your string, tokenizes it with _timelex and then compares the tokens with data defined in parserinfo.

Here, if a token doesn't match anything in parserinfo, the parse will fail unless fuzzy is True.

What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.

Shouldn't take too much effort.


Update

While you're waiting for your patch to get rolled in...

This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.

from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(a):
  print "Found:", item
  print "Parsed:", p.parse(item)

Yields:

Found: 2011 04 23
Parsed: 2011-04-23 00:00:00
Found: 29 July 1928
Parsed: 1928-07-29 00:00:00

Update for Dieter

Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six. Something isn't right with it and it's not treating str objects as text.

This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:

from cStringIO import StringIO
for item in timesplit(StringIO(a)):
  print "Found:", item
  print "Parsed:", p.parse(StringIO(item))

If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:

from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)
她说她爱他 2024-12-06 06:14:52

当我离线时,我对昨天在这里发布的答案感到困扰。是的,它完成了这项工作,但它不必要地复杂且效率极低。

这是信封背面的版本,应该会做得更好!

import itertools
from dateutil import parser

jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
    parser.parserinfo.UTCZONE,
    parser.parserinfo.PERTAIN,
    (x for s in parser.parserinfo.WEEKDAYS for x in s),
    (x for s in parser.parserinfo.MONTHS for x in s),
    (x for s in parser.parserinfo.HMS for x in s),
    (x for s in parser.parserinfo.AMPM for x in s),
))

def parse_multiple(s):
    def is_valid_kw(s):
        try:  # is it a number?
            float(s)
            return True
        except ValueError:
            return s.lower() in keywords

    def _split(s):
        kw_found = False
        tokens = parser._timelex.split(s)
        for i in xrange(len(tokens)):
            if tokens[i] in jumpwords:
                continue 
            if not kw_found and is_valid_kw(tokens[i]):
                kw_found = True
                start = i
            elif kw_found and not is_valid_kw(tokens[i]):
                kw_found = False
                yield "".join(tokens[start:i])
        # handle date at end of input str
        if kw_found:
            yield "".join(tokens[start:])

    return [parser.parse(x) for x in _split(s)]

用法示例:

>>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]

可能值得注意的是,在处理空/未知字符串时,其行为与 dateutil.parser.parse 略有不同。 Dateutil 将返回当前日期,而 parse_multiple 返回一个空列表,恕我直言,这正是人们所期望的。

>>> from dateutil import parser
>>> parser.parse("")
datetime.datetime(2011, 8, 12, 0, 0)
>>> parse_multiple("")
[]

PS刚刚发现MattH 的更新答案,其功能非常相似。

While I was offline, I was bothered by the answer I posted here yesterday. Yes it did the job, but it was unnecessarily complicated and extremely inefficient.

Here's the back-of-the-envelope edition that should do a much better job!

import itertools
from dateutil import parser

jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
    parser.parserinfo.UTCZONE,
    parser.parserinfo.PERTAIN,
    (x for s in parser.parserinfo.WEEKDAYS for x in s),
    (x for s in parser.parserinfo.MONTHS for x in s),
    (x for s in parser.parserinfo.HMS for x in s),
    (x for s in parser.parserinfo.AMPM for x in s),
))

def parse_multiple(s):
    def is_valid_kw(s):
        try:  # is it a number?
            float(s)
            return True
        except ValueError:
            return s.lower() in keywords

    def _split(s):
        kw_found = False
        tokens = parser._timelex.split(s)
        for i in xrange(len(tokens)):
            if tokens[i] in jumpwords:
                continue 
            if not kw_found and is_valid_kw(tokens[i]):
                kw_found = True
                start = i
            elif kw_found and not is_valid_kw(tokens[i]):
                kw_found = False
                yield "".join(tokens[start:i])
        # handle date at end of input str
        if kw_found:
            yield "".join(tokens[start:])

    return [parser.parse(x) for x in _split(s)]

Example usage:

>>> parse_multiple("I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928")
[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]

It's probably worth noting that its behaviour deviates slightly from dateutil.parser.parse when dealing with empty/unknown strings. Dateutil will return the current day, while parse_multiple returns an empty list which, IMHO, is what one would expect.

>>> from dateutil import parser
>>> parser.parse("")
datetime.datetime(2011, 8, 12, 0, 0)
>>> parse_multiple("")
[]

P.S. Just spotted MattH's updated answer which does something very similar.

故事还在继续 2024-12-06 06:14:52

我认为如果你把“单词”放在一个数组中,它应该可以解决问题。这样您就可以验证它是否是日期,然后放入一个变量中。

一旦你有了日期,你应该使用 datetime library 库。

I think if you put the "words" in an array, it should do the trick. With that you can verify if it is a date or no, and put in a variable.

Once you have the date you should use datetime library library.

朮生 2024-12-06 06:14:52

为什么不编写一个涵盖日期可能出现的所有可能形式的正则表达式模式,然后启动正则表达式来探索文本?我认为在字符串中表达日期的方式不会有几十种。

唯一的问题是收集日期表达式的最大值

Why not writing a regex pattern covering all the possible forms in which a date can appear, and then launching the regex to explore the text ? I presume that there are not dozen of dozens of manners to express a date in a string.

The only problem is to gather the maximum of date's expressions

·深蓝 2024-12-06 06:14:52

我看到已经有一些很好的答案,但添加了这个答案,因为它在我的用例中效果更好,而上面的答案却没有。

使用此库: https://datefinder.readthedocs.io/en/ latest/index.html#module-datefinder


import datefinder

def DatesToList(x):
    
    dates = datefinder.find_dates(x)
    
    lists = []
    
    for date in dates:
        
        lists.append(date)
        
    return (lists)


dates = DateToList(s)


输出:

[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]

I see that there are some good answers already but adding this one as it worked better in a use case of mine while the above answers didn't.

Using this library: https://datefinder.readthedocs.io/en/latest/index.html#module-datefinder


import datefinder

def DatesToList(x):
    
    dates = datefinder.find_dates(x)
    
    lists = []
    
    for date in dates:
        
        lists.append(date)
        
    return (lists)


dates = DateToList(s)


Output:

[datetime.datetime(2011, 4, 23, 0, 0), datetime.datetime(1928, 7, 29, 0, 0)]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文