识别字符串中的日期

发布于 2024-10-20 10:17:54 字数 963 浏览 9 评论 0原文

我想要一个像这样的类：

public interface IDateRecognizer
{
    DateTime[] Recognize(string s);
}

日期可能存在于字符串中的任何位置，并且可能是任何格式。目前，我可以限制为美国文化格式。日期不会以任何方式界定。它们的日期部分之间可能有任意数量的空格。我的想法是：

ANTLR
Regex
Hand roll

我从未使用过 ANTLR，所以我会从头开始学习。我想知道是否有库或代码示例可以做类似的事情来帮助我启动。对于如此狭窄的用途来说，ANTLR 是否太重了？

我以前经常使用正则表达式，但我讨厌它，因为大多数人讨厌它。

我当然可以手动滚动它，但我不想重新解决已解决的问题。

建议？

更新：这是一个示例。鉴于此输入：

日期是 63 年 11 月 3 日。这是另一篇：1963年11月3日；和另一篇 63 年 11 月 3 日和一些更多（11/03/1963）。日期可以是任何美国格式。他们可能有像 11-2-1963 或奇怪的额外破折号里面的空白是这样的： 1963 年 11 月 3 日，甚至可能缺少逗号就像 [Nov 3 63] 但这是一个优势案例。

输出应该是一个包含七个日期时间的数组。每个日期都是相同的：11/03/1963 00:00:00。

更新：我完全手工滚动了这个，我对结果很满意。我最终没有使用 Regex，而是使用了带有自定义 DateTimeFormatInfo 的 DateTime.TryParse，它允许您非常轻松地微调允许的格式以及处理 2 位数年份。考虑到这是异步处理的，性能是完全可以接受的。棘手的部分是以有效的方式标记和测试相邻标记集。

原文

I want a class something like this:

public interface IDateRecognizer
{
    DateTime[] Recognize(string s);
}

The dates might exist anywhere in the string and might be any format. For now, I could limit to U.S. culture formats. The dates would not be delimited in any way. They might have arbitrary amounts of whitespace between parts of the date. The ideas I have are:

ANTLR
Regex
Hand rolled

I have never used ANTLR, so I would be learning from scratch. I wonder if there are libraries or code samples out there that do something similar that could jump start me. Is ANTLR too heavy for such a narrow use?

I have used Regex a lot before, but I hate it for all the reasons that most people hate it.

I could certainly hand roll it but I'd rather not re-solve a solved problem.

Suggestions?

UPDATE: Here is an example. Given this input:

This is a date 11/3/63. Here is
another one: November 03, 1963; and
another one Nov 03, 63 and some
more (11/03/1963). The dates could be
in any U.S. format. They might have
dashes like 11-2-1963 or weird extra
whitespace inside like this:
Nov 3, 1963,
and even maybe the comma is missing
like [Nov 3 63] but that's an edge
case.

The output should be an array of seven DateTimes. Each date would be the same: 11/03/1963 00:00:00.

UPDATE: I totally hand rolled this, and I am happy with the result. Instead of using Regex, I ended up using DateTime.TryParse with a custom DateTimeFormatInfo, which allows you to very easily fine tune what formats are allowed and also handling of 2 digit years. Performance is quite acceptable given that this is handled async. The tricky part was tokenizing and testing sets of adjacent tokens in an efficient way.

分享到QQ

分享到微博