HTML 和其他文本的模板删除/检测/差异实用程序
我记得不久前在某个随机网站上读过一个程序,该程序会查看 HTML 网站上的多个页面并检测页面之间的差异/相似之处,以自动检测哪些部分是模板“样板”以及哪些部分是新内容,并且然后据此自动吐出内容的部分。
不幸的是,我没有记住有关该实用程序的足够详细信息,无法在谷歌上实际找到它,所以我想知道你们中是否有人遇到过类似的东西,并且能记住它的名字。
谢谢。
I remember reading a while back on some random website about a program that would look at multiple pages on an HTML site and detect the differences/similarities between the pages to automatically detect which parts were template "boilerplate" and which parts were new content, and then based on this, automatically spit out just the parts that are content.
Unfortunately, I didn't remember enough details about this utility to actually find it on google, so I wonder if any of you guys have run across anything like this, and CAN remember the name of it.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
墨菲定律(或者是其他定律)已经失效,我在放弃并发布这个问题后不久就发现了它。我正在考虑的项目是这样的:
http://code.google.com/p/boilerpipe/< /a>
谢谢。
Murphy's Law (or is it some other law) has stricken, and I've found it just moments after I'd given up and posted this question. The project I am thinking of is this:
http://code.google.com/p/boilerpipe/
Thanks.