Java 中日期的区分和解析
我知道这个话题并不新鲜,尽管我必须再次挖掘它。 我已经在网络上搜索了很多次(包括 stackoverflow 上的一些线程),但到目前为止还没有找到令人满意的答案。
(除其他外,我检查了 解析 Java 中的不明确日期 和 http://www.coderanch.com/ t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
我目前正在用 Java 编写一个 Dateparser,它需要一个日期和生成一个格式字符串,SimpleDateFormat 可以使用它来解析日期。
这些日期是通过日志文件(IBM Websphere、Tomcat、Microsoft Exchange,...)中的正则表达式(是的,这是一个丑陋的xD)进行解析。因为我们的客户位于(至少 2 个)不同的区域设置,所以无法简单地将字符串“扔”到 SimpleDateFormat 的解析方法并期望它正常工作。
此外,日期和月份的位置存在问题(即格式“dd/MM/yyyy”或“MM/dd/yyyy”),如果我没有至少两个数据集,则无法解决该问题数字已更改。
因此,我当前的方法是将安装在特定客户系统上的特定软件的日期格式存储在数据库(mysql / xml / ...)中,并强制用户至少指定客户名和软件名,以便有足够的背景分解可能给出的格式的可能性数量。
然后,该“子集”将用于尝试解析指定软件的日志文件。 (子集以 HashMap 的形式存储在 HashMap 中 哈希映射>地图; 整数键是格式字符串的长度,第二个哈希图的字符串键指定仅包含分隔字符的日期签名。 (即“.. ::.”对于格式为“dd.MM.yyyy 11:11:11.111”的日期)
我还考虑了数字的值,即数字> 12 必须是一天,因为没有第 13 个月。但这仅适用于一个月 12 日之后的日期字符串。
是否有机会避免实现有关日志文件所在环境的先验知识,从而使解析器能够可靠地解析一个日期,而无需引用用于比较的第二个日期字符串?
我已经在这个问题上坚持了近 3 个月了 -.-
非常欢迎任何建议 =)
编辑:
好吧,大家可以关闭这个帖子了。我现在针对我的具体问题提出了不同的解决方案。对于那些有兴趣的人: 我正在用 Java 编写一个 Logreader。由于我们有定期维护,我必须阅读许多日志文件。 但文件中写入的不仅仅是纯文本信息。 想象一下,周日晚上,一台服务器刚刚崩溃,下一个注意到的人就是客户 IT 部门的负责人。然后第二天我必须维护并检查日志文件。从内容来看,一切都还好,没有什么异常。发送维护报告半小时后,我收到一封邮件,上面提到的负责人咆哮道,服务器崩溃了,而且似乎没有引起注意。
关键是,您无法跟踪数千行日志文件的内容和时间戳。因此,我开发了一个组件,它读取日志文件并计算两个不同日志条目之间的时间。每个日志行都被解析为 java.util.Date,以便稍后获取日期作为时间戳,以获得有关日志间隔的高分辨率。然后我将差异放到折线图上,这使得两条日志线之间的较长超时可见为与文件其余部分相关的大峰值。
我现在的解决方案是完全丢弃字符串的日期部分并插入具有预定义格式的虚拟日期。仅当小时和分钟接近 23:59 时才需要更改日期。 随后,原始日期显示在图表上,下方是“假数据”。
我感谢大家的建议和反馈 =) (我希望到目前为止我的英语是可以理解的;))
i know this topic isn't new, though i have to dig it up again.
I already searched the Web numerous times (including some Threads here on stackoverflow) but haven't found a satisfying answer so far.
(Amongst others I checked
Parsing Ambiguous Dates in Java and
http://www.coderanch.com/t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
I am currently writing a Dateparser in Java, which takes a date and generates a format-String which can be used by SimpleDateFormat for parsing the date.
The dates are parsed via regex (yes, it's an ugly one xD) from Logfiles (IBM Websphere, Tomcat, Microsoft Exchange, ....). Because we have customers in (at least 2) different Locales, there is no way to simply "throw" the String against the parse-method of SimpleDateFormat and expect it to work properly.
Furthermore, there is the problem with the position of day and month (i.e. formats "dd/MM/yyyy" or "MM/dd/yyyy") which cannot be solved if i don't have at least two datasets where the day-digit has changed..
So my current approach would be storing the dateformats for a specific software installed at a specific customer's systems in a database (mysql / xml / ... ) and forcing the user to at least specify customername and softwarename so there is enough context to break down the amount of possibilites the format may be given in.
This "subset" then would be used to try to parse the logfiles of the specified software.
(The subset is stored in a HashMap in a HashMap in the form
HashMap> map;
The Integer-Key is the length of the formatstring and the String Key of the second Hashmap specifies a datesignature only containing the separating characters.
(i.e. ".. ::." for a date with format "dd.MM.yyyy 11:11:11.111")
I also take into account the value of the digits, i.e. a digit > 12 has to be a day because there is no 13th month. But this only works reliably for Date-Strings later than the 12th of a month..
Is there any chance to avoid implementing prior knowledge about the environment out of which the logfile came, thus enabling the parser to reliably parse one date without having to refer a second datestring for comparison?
I'm stuck on that for almost 3 months now -.-
Any suggestions would be very welcome =)
Edit:
Okay guys this thread can be closed. I now came up with a different solution for my specific problem. For those who are interested:
I am writing a Logreader in Java. As we have regular maintenance I have to read many logfiles.
But it's not just the plain text information that's written in the file.
Imagine a server just having crashed, it's sunday night and the next person to notice is the head of the IT dpt of the customer. Then on the following day I have to to maintenance and check the logfiles. Judging by content, everything seemed okay, nothing unusual. Half an hour after sending the maintenance report I receive a mail with the above mentioned head of it dpt ranting, that the server had crashed and it seemed to go unnoticed.
The point is, you can't keep track over content and the timestamps for logfiles with several thousand lines. So i developed a component which reads a logfile and calculates the time between two different log-entrys. Each logline got parsed into a java.util.Date to later get the Date as Timestamp for high resolution regarding the log-intervals. The differences i then threw onto a linegraph, which makes longer timeouts between two loglines visible as a big spike relating to the rest of the file.
My solution now will be to completely throw away the date-half of the String and insert a dummy-Date with a predefined format. The date only has to change if the Hour and minute approach 23:59.
The original date later is presented on the graph with the "fake-data" lying beneath.
I thank all of you for your suggestions and feedback =)
(And I hope my English has been understandable so far ;) )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为您所采用的策略(即分析更大的数据集)是您可以获得的最佳策略。
从一行日志文件中,您永远不会知道 3/5/11 是 2011 年 5 月 3 日还是 2011 年 3 月 5 日。(我猜也可能有一些语言环境将其解释为 2003 年 5 月 11 日。 ..)
前一段时间我自己也遇到了这些问题,我也只能尝试通过查看数字> 12 或变化最快的内容(必须是“天”)来引入某种背景。但你自己已经说过了...
I think the strategy you are going for (i.e. analysing a bigger set of data) is the best you can get.
From a single line of logfile you will never know if 3/5/11 is the 3rd of may in 2011 or the 5th of march in 2011. (I guess there might also be locales that might interpret this as 11th of may in 2003...)
I had these problems myself some time ago, and i also could only try to introduce some sort of context by either looking at numbers>12, or what changes quickest (must be "day"). But you already stated that yourself...
我的建议是将所有日期存储为“不明确”,直到可以解决不明确的问题。 (这假设特定客户始终以相同格式提供数据。)一旦您从客户那里获得可以明确识别日期格式的日志,您就可以将该格式追溯应用于以前的文件。
为此,您需要一个将每个客户映射到其日期格式的表,并使用一些标记(例如 NULL)来指示该格式尚未建立。您可能还需要创建自己的日期表示形式,以便可以对这些不明确的日期进行建模。
因此,举个例子,如果可能的日期格式是:
给定日期,您应该始终能够识别年份(允许两位数年份将使这个问题变得相当困难)。因此,您应该能够按如下方式映射日期:
My suggestion is to store all dates as 'ambiguous' until such time that the ambiguity can be resolved. (This assumes that a particular customer will always supply data in the same format.) As soon as you get a log from a customer for which you can unambiguously identify the date format, you would then be able to retrospectively apply this format to previously files.
To do this, you would need a table mapping each customer to their date format with some marker (e.g. NULL) to indicate that format is not yet established. You will probably also need to create your own date representation such that you can model these ambiguous dates.
So, as an example, if the possible date formats are:
Given dates, you should always be able to identify the year (permitting two digit years would make this problem considerably harder). So you should be able to map dates as follows:
如果可能,您可以要求客户将日期格式字符串连同他们的实际日期字符串一起传递。
即在他们的日志文件中,他们需要再多一列
..... 、 '03/11/2011' 、 'MM/DD/YYYY' 、...
If possible, you can ask the customers to pass the dateformat string also along with their actual date strings.
i.e. in their log files, they would need to have one more column
..... , '03/11/2011' , 'MM/DD/YYYY' , ...