将 Stack Overflow 中的非标准日期字符串解析为 .NET DateTime
我正在为 Stack Overflow 编写一个屏幕抓取工具。我现在正在编写的部分采用 HTML 并将所有信息放入模型对象中。我在解析答案中的信息时遇到了一些麻烦。
问题在于 StackOverflow 用于描述绝对时间的日期格式。 DateTime.Parse
不起作用在他们身上。我尝试过使用 DateTime.ParseExact
< /a> 但我没有成功。两者都会抛出 FormatException
这里是一些背景:
如果您查看 HTML 源代码来寻找答案,您会得到以下结果:
<div id="answer-{id}" class="answer">
<!-- ... -->
answered <span title="2009-06-18 13:21:16Z UTC" class="relativetime">Jun 18 at 13:21</span>
<!-- ... -->
</div>
请注意,绝对时间存储在 span 的 title 属性中。我使用 CodePlex 中的 HTML Agility Pack 来访问元素,并提取了属性的值。
现在我想知道如何将 "2009-06-18 13:21:16Z UTC"
放入 .NET DateTime
对象。
我希望能够在没有正则表达式等的情况下做到这一点,但由于整个项目很黑客且不稳定,我真的不介意!
最后,由于以下原因我无法使用数据转储:
- 我无法使用 BitTorrent。曾经。
- 如果可以的话,这些文件对于我的网络连接来说太大了。
- 这有点过时了。
- 没那么有趣!
I'm writing a screen-scraper for Stack Overflow. The bit I'm writing now takes the HTML and puts all the information into a model object. I've run into a bit of bother while parsing the information from an answer.
The problem is the date format that StackOverflow uses to describe absolute times. DateTime.Parse
doesn't work on them. I've tried fooling around with DateTime.ParseExact
but I've had no success. Both throw a FormatException
Here's some background:
If you look at the source HTML for an answer, you get this:
<div id="answer-{id}" class="answer">
<!-- ... -->
answered <span title="2009-06-18 13:21:16Z UTC" class="relativetime">Jun 18 at 13:21</span>
<!-- ... -->
</div>
Notice that the absolute time is stored in the span's title attribute. I've used the HTML Agility Pack from CodePlex to access the elements, and have extracted the value of the attribute.
Now I'm wondering how to get the "2009-06-18 13:21:16Z UTC"
into a .NET DateTime
object.
I'd like to be able to do this without Regexes, etc., but as the whole project is hackish and unstable, I don't really mind!
Finally, I can't use the data dump for these reasons:
- I can't use BitTorrent. Ever.
- If I could, the files are too big for my net connection.
- It's a bit out of date.
- It's not as fun!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
同一
DateTime
字符串中的"Z"
和"UTC"
似乎是多余的。如果从字符串中删除
"UTC"
,则Parse
有效:"Z"
and"UTC"
in the sameDateTime
string seems redundant.If you remove
"UTC"
from the string,Parse
works:好吧,您永远不会为此使用正则表达式,但我认为该格式只是此处描述的“u”: http://msdn.microsoft.com/en-us/library/az4se3k1.aspx
所以 ParseExact 应该接受这一点(需要做一些小工作)。
Well, you'd never use regex for this, but I think that format is just "u" described here: http://msdn.microsoft.com/en-us/library/az4se3k1.aspx
So ParseExact should accept that (with some minor work).
我在这里还没有找到匹配时区(Z UTC)的魔法,但假设它们都是 UTC,这应该可以帮助您开始:
I havn't found the magic to match the timezone (Z UTC) here, but assuming they're all UTC, this should get you started: