识别字符串中的日期
我想要一个像这样的类:
public interface IDateRecognizer
{
DateTime[] Recognize(string s);
}
日期可能存在于字符串中的任何位置,并且可能是任何格式。目前,我可以限制为美国文化格式。日期不会以任何方式界定。它们的日期部分之间可能有任意数量的空格。我的想法是:
- ANTLR
- Regex
- Hand roll
我从未使用过 ANTLR,所以我会从头开始学习。我想知道是否有库或代码示例可以做类似的事情来帮助我启动。对于如此狭窄的用途来说,ANTLR 是否太重了?
我以前经常使用正则表达式,但我讨厌它,因为大多数人讨厌它。
我当然可以手动滚动它,但我不想重新解决已解决的问题。
建议?
更新:这是一个示例。鉴于此输入:
日期是 63 年 11 月 3 日。这是 另一篇:1963年11月3日;和 另一篇 63 年 11 月 3 日和一些 更多(11/03/1963)。日期可以是 任何美国格式。他们可能有 像 11-2-1963 或奇怪的额外破折号 里面的空白是这样的: 1963 年 11 月 3 日, 甚至可能缺少逗号 就像 [Nov 3 63] 但这是一个优势 案例。
输出应该是一个包含七个日期时间的数组。每个日期都是相同的:11/03/1963 00:00:00。
更新:我完全手工滚动了这个,我对结果很满意。我最终没有使用 Regex,而是使用了带有自定义 DateTimeFormatInfo 的 DateTime.TryParse,它允许您非常轻松地微调允许的格式以及处理 2 位数年份。考虑到这是异步处理的,性能是完全可以接受的。棘手的部分是以有效的方式标记和测试相邻标记集。
I want a class something like this:
public interface IDateRecognizer
{
DateTime[] Recognize(string s);
}
The dates might exist anywhere in the string and might be any format. For now, I could limit to U.S. culture formats. The dates would not be delimited in any way. They might have arbitrary amounts of whitespace between parts of the date. The ideas I have are:
- ANTLR
- Regex
- Hand rolled
I have never used ANTLR, so I would be learning from scratch. I wonder if there are libraries or code samples out there that do something similar that could jump start me. Is ANTLR too heavy for such a narrow use?
I have used Regex a lot before, but I hate it for all the reasons that most people hate it.
I could certainly hand roll it but I'd rather not re-solve a solved problem.
Suggestions?
UPDATE: Here is an example. Given this input:
This is a date 11/3/63. Here is
another one: November 03, 1963; and
another one Nov 03, 63 and some
more (11/03/1963). The dates could be
in any U.S. format. They might have
dashes like 11-2-1963 or weird extra
whitespace inside like this:
Nov 3, 1963,
and even maybe the comma is missing
like [Nov 3 63] but that's an edge
case.
The output should be an array of seven DateTimes. Each date would be the same: 11/03/1963 00:00:00.
UPDATE: I totally hand rolled this, and I am happy with the result. Instead of using Regex, I ended up using DateTime.TryParse with a custom DateTimeFormatInfo, which allows you to very easily fine tune what formats are allowed and also handling of 2 digit years. Performance is quite acceptable given that this is handled async. The tricky part was tokenizing and testing sets of adjacent tokens in an efficient way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我会寻求一些手动解决方案,将输入字符串切成可管理的大小,以便让一些正则表达式完成工作。这似乎是从单元测试开始的一个很好的测试。
I'd go for some hand rolled solution to chop the input string into manageable size to let some Regex'es do the work. This seems like a great test to start with unit testing.
我建议你使用正则表达式。我将一个正则表达式(匹配一个日期)放入一个字符串中,并将多个正则表达式放入一个数组中。然后在运行时创建完整的正则表达式。这使得系统更加灵活。根据您的需要,您可以考虑将不同的日期正则表达式放入(XML)文件/数据库中。
I'd suggest you to go with the regex. I'd put one regex (matching one date) into one string and multiple of them into an array. Then create the full regex in runtime. This makes the system more flexible. Depending what you need, you could consider putting the different date-regex into a (XML)file / db.
对于正则表达式来说,识别日期似乎是一项直接且简单的任务。我不明白你为什么要试图避免它。
对于这种语义非常有限的情况,ANTLR 就显得有些过分了。
虽然性能可能是一个潜在的问题,但我真的怀疑其他选项是否会给你带来更好的性能。
所以我会选择
Regex
。Recognising dates seems to be a straight forward and easy task for Regex. I cannot understand why you are trying to avoid it.
ANTLR for this case where you have a very limited set of semantics is just overkill.
While performance could be a potential issue but I would really doubt if other options would give you better performance.
So I would go with
Regex
.