加快日期模式匹配速度
我正在编写一些简单的代码,尝试推断特定字符串是否实际上是 Java 日期,如果是,则识别其格式(模式)。
显然,由于存在多种可能的日期格式,确定哪一种适用于字符串需要连续的模式匹配,这确实非常耗时且消耗 CPU,因为输入字符串也可以具有其他值。
因此,对于名为 input
的字符串变量,我最终所做的事情类似于
String datePattern;
if (isLikeDate(input))
{
datePattern = matchAnyOfThePredefinedDatePatterns(input);
}
isLike...
方法拒绝明显的非日期字符串,而 >match...
方法会遍历大约 40-50 个预定义模式,尝试构造一个有效的 SimpleDateFormat 对象。如果输入字符串不是每次检查的模式的有效日期,则构造函数会引发异常。
异常处理会大大减慢速度,但似乎无法避免。 Apache Commons Date 包表现出类似的性能。
有没有更快的方法来实现这种日期模式匹配?
I am writing some simple code that tries to deduce whether or not a specific String is actually a Java date and, if yes, identify its format (pattern).
Obviously, because there are many possible date formats, establishing which one is applicable for a string requires successive pattern matching, which is really time and CPU-consuming, given that the input string can have other values, too.
So, what I have ended up doing, for a String variable called input
, is something like
String datePattern;
if (isLikeDate(input))
{
datePattern = matchAnyOfThePredefinedDatePatterns(input);
}
where the isLike...
method rejects obvious non-date strings and the match...
method goes over about 40-50 predefined patterns, trying to construct a valid SimpleDateFormat object. The constructor throws an exception if the input
string is not a valid date for the pattern examined each time.
The exception handling slows things down dramatically, but there seems to be no avoiding it. The Apache Commons Date packages exhibit similar performance.
Is there any faster way of implementing this date pattern matching?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据模式的复杂性,您可能需要将每个潜在模式与正则表达式(或手写代码)进行匹配,然后尝试将其正确解析为日期。例如,如果模式是“yyyyMMddThh:mm:ss”,您可以在将其传递给日期解析代码之前检查长度、T 的位置、冒号的位置以及其他所有内容是否都是数字。
这种级别的模式匹配可以非常自由 - 它只是试图排除对模式的明确侵犯。重要的是它不会拒绝任何实际上有效的值。
缺点是,对于任何匹配的模式,您都需要做两次工作 - 但通过显着减少抛出的异常数量,这仍然可以轻松平衡。
编辑:为了澄清一下,您当前正在测试它是否看起来可以匹配任何模式,然后测试所有模式。我建议您为每个模式都有一个正则表达式,并且只尝试解析已经与相应正则表达式匹配的模式。
我还建议尝试 Joda Time - 它不仅是一个更好的 API,而且它的模式是线程-安全,因此您可以重复使用它们。据推测,您当前每次需要解析某些内容时都会创建新的
SimpleDateFormat
对象。Depending on the complexity of the patterns, you might want to match each potential pattern with a regex (or hand-written code) before trying to parse it properly as a date. For example, if the pattern is "yyyyMMddThh:mm:ss" you could check for the length, the position of the T, the position of the colons, and that everything else is a digit before passing it on to the date parsing code.
This level of pattern matching can be very liberal - it's only trying to rule out definite infringements of the pattern. The important thing is that it doesn't reject any values which are actually valid.
The downside is that for any pattern which does match, you're doing work twice - but that may well still be easily balanced by significantly reducing the number of exceptions you throw.
EDIT: Just to clarify, you're currently testing whether it looks like it could match any of the patterns, and then testing all of them. I'm suggesting that you have a regex for each pattern, and only try parsing against patterns which have already matched the corresponding regex.
I'd also suggest trying Joda Time - not only is it a generally better API, but its patterns are thread-safe, so you can reuse them. Presumably you're currently creating new
SimpleDateFormat
objects each time you have something to parse.这是否意味着您在每次调用
match
时都会构造新的SimpleDateFormat
对象?太贵了,别这么做。保留先前构造的格式对象。如果我没记错的话
SimpleDateFormat.parse()
不是线程安全的,因此需要一些额外的工作。当然,您想首先尝试成功机会较高的格式,但我不知道您是否对数据模式有洞察力。
Does this mean you are constructing new
SimpleDateFormat
objects in every call tomatch
? That is quite expensive, don't do that.Keep the format objects previously constructed. If I remember right
SimpleDateFormat.parse()
is not thread-safe so some extra work will be required.Of course, you want to try the formats with higher chances of succeeding first, but I don't know if you have that insight into data patterns.
您可能会考虑构建一个类似 trie 的状态机,有点像用传入的字符串玩弹珠机。对于非日期(基本上是日期语法解析器),这会相对较快地失败。
不确定它是否总是更快,或者更快到值得付出努力。
You might consider building a trie-like state machine, sort of like playing pachinko with the incoming string. This would fail relatively quickly on non-dates–basically a date grammar parser.
Not sure if it would always be faster, or faster-enough to be worth the effort.