动态获取文章发布日期/作者的有效方法?
我正在开发一个参考网络应用程序,作为我正在学习的课程的一部分,其目的是让学生快速轻松地参考他们在其中找到信息的材料,但我遇到了一些问题。
第一个是获取文章/网站的发布日期。处理静态 HTML 站点时,这很容易,因为我可以简单地使用 document.lastModified 来获取上次修改的时间。在处理更常见的 CMS 支持的网站时会出现问题,因为页面是动态生成的,这会导致 document.lastModified 始终返回相当于“现在”的内容...这根本不准确。
网站开发人员可以采取一些步骤,通过 HTML5 的实现使这一点变得更容易,即添加元素,该元素可以设置其他属性来将其定义为发布帖子的时间。像这样的网站很好,但绝大多数网站都没有使用 HTML5,而且我认为这种情况不会很快发生改变。有人对如何准确识别帖子的创建时间有一些想法吗?
第二是准确识别帖子或页面的作者。有几种方法可以识别这一点。第一个是网站是否使用 hAtom 微格式来识别网站的元素,这使事情变得容易......但与发布日期一样并不常见。
接下来是查看网站的元数据,并根据存储在那里的内容识别作者。这种情况并不常见,而且通常是网站的所有者或不对该帖子负责的其他人,这使得它作为资源使用有些不可靠。
I'm working on a referencing webapp as part of a course I am studying, the aim of which is to allow students to quickly and easily reference the materials they find information in and I'm running into a couple of issues with things.
The first is getting an article/site's published date. When dealing with static HTML sites this is easy, as I can simply use document.lastModified to pull in the time it was last modified. Issues arise when dealing with the much more common CMS powered website, as pages are dynamically generated which causes document.lastModified to always return the equivalent of 'now'... which isn't accurate at all.
There are steps that developers of sites can take to make this a bit easier with the implementation of HTML5, namely with the addition of the element, which can have additional attributes set to define it as the time a post was published. Sites like these are fine, but the vast majority of sites aren't using HTML5 and I don't really see this changing any time soon. Anyone out there got some ideas on how to accurately identify when a post was created?
The second is accurately identifying the author of a post or page. There are a couple of ways to identify this. The first is if a site has used the hAtom microformat to identify elements of the site, which makes things easy... but as with post dates isn't common.
The next is looking at the meta data of a site, and identifying the author based on content stored there. This is both uncommon and also generally the owner of the site, or another person not responsible for the post, which leaves it somewhat unreliable for use as a resource.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果网站有 RSS 源,并且文章足够新,可以包含在其中,您可以从中提取有关该文章的元数据。
If the website has an RSS feed and the article is recent enough to be included in it you could extract metadata about the article from it.
听起来是一件非常困难的事情,只是因为据我所知,这些信息绝对没有标准化。有些网站可能会将其放入关键字中,有些则不会。
作为媒体批评课程的一部分,我做了一些抓取,我发现几乎每个 cms 都必须单独处理。总的来说,制作可以在随机网页上找到作者信息的东西听起来非常困难。
您也许可以专门制作一些东西来从 WordPress 博客中捕获这些信息,因为它们有很多共同点。但要设计成只是访问任何网站并获取特定信息,这是相当困难的。
根本不是想劝阻你,只是说你设定了一个相当高的目标,恕我直言。
Sounds like a pretty tough thing to make, only because there is absolutely no standardization for this information that I know of. Some sites might put it in their keywords, others not.
I did some scraping as part of a media criticism class, and I find that pretty much each cms has to be processed individually. Overall, making something that would find the author info on a random web pages sounds very difficult.
You might be able to make something specifically for capturing this info from WordPress blogs, since those have so many commonalities. But something designed to just hit up any site and grab specific pieces of info, that's pretty tough.
Not trying to discourage you at all, just saying that you've set a pretty high goal, imho.
抱歉,我帮不了什么忙,但是使用正则表达式扫描页面中的“By ___”或“Source:___”来获取信息的作者/来源怎么样?
至于上次修改的日期,据我所知,没有简单的方法可以获取它,因为日期的正则表达式会在侧边栏、链接等中返回最近的文章。是的,正如您所说的 document.lastmodified 不会工作。您可以考虑将其替换为“添加日期”到您的参考资料或类似内容。
希望这至少对您有一点帮助,如果没有,可以给您一两个想法。
当然,如果有任何可用的 API / RSS,您可以扫描它以查找最后更新/发布的日期,并使用它?
Sorry I can't help very much, but what about using regex to scan the page for 'By ___' or 'Source: ___' to get the author / source of the information?
As for the date last modified, as far as I know there's no easy way to grab this, as regex'ing for a date would return recent articles in sidebars, links, etc. And yeah, as you said document.lastmodified wouldn't work. You could consider replacing this with "date added" to your referencer, or similar.
Hope this helps you at least a little bit, and if not, gives you an idea or two.
Of course, if there's any API / RSS available, you could scan it for the last updated / posted date, and use that?