在Python中解析日期而不使用默认值

发布于 2024-12-20 04:08:13 字数 1221 浏览 0 评论 0原文

我正在使用 python 的 dateutil.parser 工具来解析从第三方提要获取的一些日期。它允许指定一个默认日期,该日期本身默认为今天,用于填充解析日期的缺失元素。虽然这通常很有帮助,但对于我的用例来说没有合理的默认值,而且我更愿意将部分日期视为我根本没有得到日期(因为它几乎总是意味着我得到了乱码数据)。我已经编写了以下解决方案:(

from dateutil import parser
import datetime

def parse_no_default(dt_str):
  dt = parser.parse(dt_str, default=datetime.datetime(1900, 1, 1)).date()
  dt2 = parser.parse(dt_str, default=datetime.datetime(1901, 2, 2)).date()
  if dt == dt2:
    return dt
  else:
    return None

此代码片段仅查看日期,因为这是我的应用程序所关心的全部内容,但类似的逻辑可以扩展以包括时间组件。)

我想知道(希望)有一个更好的方法来做到这一点。至少可以说,两次解析同一个字符串只是为了看看它是否填充不同的默认值,这似乎是对资源的严重浪费。

这是预期行为的一组测试(使用nosetest生成器):

import nose.tools
import lib.tools.date

def check_parse_no_default(sample, expected):
  actual = lib.tools.date.parse_no_default(sample)
  nose.tools.eq_(actual, expected)

def test_parse_no_default():
  cases = ( 
      ('2011-10-12', datetime.date(2011, 10, 12)),
      ('2011-10', None),
      ('2011', None),
      ('10-12', None),
      ('2011-10-12T11:45:30', datetime.date(2011, 10, 12)),
      ('10-12 11:45', None),
      ('', None),
      )   
  for sample, expected in cases:
    yield check_parse_no_default, sample, expected

I'm using python's dateutil.parser tool to parse some dates I'm getting from a third party feed. It allows specifying a default date, which itself defaults to today, for filling in missing elements of the parsed date. While this is in general helpful, there is no sane default for my use case, and I would prefer to treat partial dates as if I had not gotten a date at all (since it almost always means I got garbled data). I've written the following work around:

from dateutil import parser
import datetime

def parse_no_default(dt_str):
  dt = parser.parse(dt_str, default=datetime.datetime(1900, 1, 1)).date()
  dt2 = parser.parse(dt_str, default=datetime.datetime(1901, 2, 2)).date()
  if dt == dt2:
    return dt
  else:
    return None

(This snippet only looks at the date, as that's all I care about for my application, but similar logic could be extended to include the time component.)

I'm wondering (hoping) there's a better way of doing this. Parsing the same string twice just to see if it fills in different defaults seems like a gross waste of resources, to say the least.

Here's the set of tests (using nosetest generators) for the expected behavior:

import nose.tools
import lib.tools.date

def check_parse_no_default(sample, expected):
  actual = lib.tools.date.parse_no_default(sample)
  nose.tools.eq_(actual, expected)

def test_parse_no_default():
  cases = ( 
      ('2011-10-12', datetime.date(2011, 10, 12)),
      ('2011-10', None),
      ('2011', None),
      ('10-12', None),
      ('2011-10-12T11:45:30', datetime.date(2011, 10, 12)),
      ('10-12 11:45', None),
      ('', None),
      )   
  for sample, expected in cases:
    yield check_parse_no_default, sample, expected

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

清风挽心 2024-12-27 04:08:13

根据您的域,以下解决方案可能有效:

DEFAULT_DATE = datetime.datetime(datetime.MINYEAR, 1, 1)

def parse_no_default(dt_str):    
    dt = parser.parse(dt_str, default=DEFAULT_DATE).date()
    if dt != DEFAULT_DATE:
       return dt
    else:
       return None

另一种方法是猴子补丁解析器类(这是非常hackiesh,所以如果您有其他选择,我不会推荐它):

import dateutil.parser as parser
def parse(self, timestr, default=None,
          ignoretz=False, tzinfos=None,
          **kwargs):
    return self._parse(timestr, **kwargs)
parser.parser.parse = parse

您可以按如下方式使用它:

>>> ddd = parser.parser().parse('2011-01-02', None)
>>> ddd
_result(year=2011, month=01, day=02)
>>> ddd = parser.parser().parse('2011', None)
>>> ddd
_result(year=2011)

通过检查哪些成员可用在结果(ddd)中,您可以确定何时返回 None。
当所有字段可用时,您可以将 ddd 转换为日期时间对象:

# ddd might have following fields:
# "year", "month", "day", "weekday",
# "hour", "minute", "second", "microsecond",
# "tzname", "tzoffset"
datetime.datetime(ddd.year, ddd.month, ddd.day)

Depending on your domain following solution might work:

DEFAULT_DATE = datetime.datetime(datetime.MINYEAR, 1, 1)

def parse_no_default(dt_str):    
    dt = parser.parse(dt_str, default=DEFAULT_DATE).date()
    if dt != DEFAULT_DATE:
       return dt
    else:
       return None

Another approach would be to monkey patch parser class (this is very hackiesh, so I wouldn't recommend it if you have other options):

import dateutil.parser as parser
def parse(self, timestr, default=None,
          ignoretz=False, tzinfos=None,
          **kwargs):
    return self._parse(timestr, **kwargs)
parser.parser.parse = parse

You can use it as follows:

>>> ddd = parser.parser().parse('2011-01-02', None)
>>> ddd
_result(year=2011, month=01, day=02)
>>> ddd = parser.parser().parse('2011', None)
>>> ddd
_result(year=2011)

By checking which members available in result (ddd) you could determine when return None.
When all fields available you can convert ddd into datetime object:

# ddd might have following fields:
# "year", "month", "day", "weekday",
# "hour", "minute", "second", "microsecond",
# "tzname", "tzoffset"
datetime.datetime(ddd.year, ddd.month, ddd.day)
马蹄踏│碎落叶 2024-12-27 04:08:13

这可能是一个“黑客”,但看起来 dateutil 只查看您传入的默认值之外的很少的属性。您可以提供一个以所需方式爆炸的“假”日期时间。

>>> import datetime
>>> import dateutil.parser
>>> class NoDefaultDate(object):
...     def replace(self, **fields):
...         if any(f not in fields for f in ('year', 'month', 'day')):
...             return None
...         return datetime.datetime(2000, 1, 1).replace(**fields)
>>> def wrap_parse(v):
...     _actual = dateutil.parser.parse(v, default=NoDefaultDate())
...     return _actual.date() if _actual is not None else None
>>> cases = (
...   ('2011-10-12', datetime.date(2011, 10, 12)),
...   ('2011-10', None),
...   ('2011', None),
...   ('10-12', None),
...   ('2011-10-12T11:45:30', datetime.date(2011, 10, 12)),
...   ('10-12 11:45', None),
...   ('', None),
...   )
>>> all(wrap_parse(test) == expected for test, expected in cases)
True

This is probably a "hack", but it looks like dateutil looks at very few attributes out of the default you pass in. You could provide a 'fake' datetime that explodes in the desired way.

>>> import datetime
>>> import dateutil.parser
>>> class NoDefaultDate(object):
...     def replace(self, **fields):
...         if any(f not in fields for f in ('year', 'month', 'day')):
...             return None
...         return datetime.datetime(2000, 1, 1).replace(**fields)
>>> def wrap_parse(v):
...     _actual = dateutil.parser.parse(v, default=NoDefaultDate())
...     return _actual.date() if _actual is not None else None
>>> cases = (
...   ('2011-10-12', datetime.date(2011, 10, 12)),
...   ('2011-10', None),
...   ('2011', None),
...   ('10-12', None),
...   ('2011-10-12T11:45:30', datetime.date(2011, 10, 12)),
...   ('10-12 11:45', None),
...   ('', None),
...   )
>>> all(wrap_parse(test) == expected for test, expected in cases)
True
幸福不弃 2024-12-27 04:08:13

我在 dateutil 中遇到了完全相同的问题,我编写了这个函数,并认为我会为了后代而发布它。基本上使用像 @ILYA Khlopotov 这样的底层 _parse 方法建议:

from dateutil.parser import parser
import datetime
from StringIO import StringIO

_CURRENT_YEAR = datetime.datetime.now().year
def is_good_date(date):
    try:
        parsed_date = parser._parse(parser(), StringIO(date))
    except:
        return None
    if not parsed_date: return None
    if not parsed_date.year: return None
    if parsed_date.year < 1890 or parsed_date.year > _CURRENT_YEAR: return None
    if not parsed_date.month: return None
    if parsed_date.month < 1 or parsed_date.month > 12: return None
    if not parsed_date.day: return None
    if parsed_date.day < 1 or parsed_date.day > 31: return None
    return parsed_date

返回的对象不是 datetime 实例,但它具有 .year.month.day 属性,这足以满足我的需求。我想您可以轻松地将其转换为 datetime 实例。

I ran into the exact same problem with dateutil, I wrote this function and figured I would post it for posterity's sake. Basically using the underlying _parse method like @ILYA Khlopotov suggests:

from dateutil.parser import parser
import datetime
from StringIO import StringIO

_CURRENT_YEAR = datetime.datetime.now().year
def is_good_date(date):
    try:
        parsed_date = parser._parse(parser(), StringIO(date))
    except:
        return None
    if not parsed_date: return None
    if not parsed_date.year: return None
    if parsed_date.year < 1890 or parsed_date.year > _CURRENT_YEAR: return None
    if not parsed_date.month: return None
    if parsed_date.month < 1 or parsed_date.month > 12: return None
    if not parsed_date.day: return None
    if parsed_date.day < 1 or parsed_date.day > 31: return None
    return parsed_date

The returned object isn't adatetime instance, but it has the .year, .month, and, .day attributes, which was good enough for my needs. I suppose you could easily convert it to a datetime instance.

自控 2024-12-27 04:08:13

simple-date 为你做到了这一点(它确实在内部尝试了多种格式,但没有你想象的那么多,因为它使用的模式使用可选部分(如正则表达式)扩展了 python 的日期模式)。

请参阅 https://github.com/andrewcooke/simple-date - 但仅限 python 3.2 及更高版本(对不起)。

它比默认情况下您想要的更宽松:

>>> for date in ('2011-10-12', '2011-10', '2011', '10-12', '2011-10-12T11:45:30', '10-12 11:45', ''):
...   print(date)
...   try: print(SimpleDate(date).naive.datetime)
...   except: print('nope')
... 
2011-10-12
2011-10-12 00:00:00
2011-10
2011-10-01 00:00:00
2011
2011-01-01 00:00:00
10-12
nope
2011-10-12T11:45:30
2011-10-12 11:45:30
10-12 11:45
nope

nope

但您可以指定自己的格式。例如:

>>> from simpledate import SimpleDateParser, invert
>>> parser = SimpleDateParser(invert('Y-m-d(%T| )?(H:M(:S)?)?'))
>>> for date in ('2011-10-12', '2011-10', '2011', '10-12', '2011-10-12T11:45:30', '10-12 11:45', ''):
...   print(date)
...   try: print(SimpleDate(date, date_parser=parser).naive.datetime)
...   except: print('nope')
... 
2011-10-12
2011-10-12 00:00:00
2011-10
nope
2011
nope
10-12
nope
2011-10-12T11:45:30
2011-10-12 11:45:30
10-12 11:45
nope

nope

ps invert() 只是切换 % 的存在,否则在指定复杂的日期模式时会变得一团糟。所以这里只有文字 T 字符需要 % 前缀(在标准 python 日期格式中,它将是唯一没有前缀的字母数字字符)

simple-date does this for you (it does try multiple formats, internally, but not as many as you might think, because the patterns it uses extend python's date patterns with optional parts, like regexps).

see https://github.com/andrewcooke/simple-date - but only python 3.2 and up (sorry).

it's more lenient than what you want by default:

>>> for date in ('2011-10-12', '2011-10', '2011', '10-12', '2011-10-12T11:45:30', '10-12 11:45', ''):
...   print(date)
...   try: print(SimpleDate(date).naive.datetime)
...   except: print('nope')
... 
2011-10-12
2011-10-12 00:00:00
2011-10
2011-10-01 00:00:00
2011
2011-01-01 00:00:00
10-12
nope
2011-10-12T11:45:30
2011-10-12 11:45:30
10-12 11:45
nope

nope

but you could specify your own format. for example:

>>> from simpledate import SimpleDateParser, invert
>>> parser = SimpleDateParser(invert('Y-m-d(%T| )?(H:M(:S)?)?'))
>>> for date in ('2011-10-12', '2011-10', '2011', '10-12', '2011-10-12T11:45:30', '10-12 11:45', ''):
...   print(date)
...   try: print(SimpleDate(date, date_parser=parser).naive.datetime)
...   except: print('nope')
... 
2011-10-12
2011-10-12 00:00:00
2011-10
nope
2011
nope
10-12
nope
2011-10-12T11:45:30
2011-10-12 11:45:30
10-12 11:45
nope

nope

ps the invert() just switches the presence of % which otherwise become a real mess when specifying complex date patterns. so here only the literal T character needs a % prefix (in standard python date formatting it would be the only alpha-numeric character without a prefix)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文