Nutch 如何避免抓取CGI生成的日历网页
我正在使用 Nutch 抓取一个大型网站。
网页是由CGI程序生成的。大多数网页的 URL 都包含诸如 ?id=2323&title=foo
之类的表达式。
我想抓取这些网页,因为它们包含许多有用的信息。
然而,我面临的一个问题是这个网站有日历。还会生成一些类似日期的网页。这意味着 Nutch 会尝试抓取一些无辜的网页,例如 year=2030&month=12
。
这是相当愚蠢的。
在 Nutch 中如何避免这样的陷阱?编写许多正则表达式?
I am using Nutch to crawl a large website.
The webpages are generated by CGI program. Most of the webpages' URL contains expressions such as ?id=2323&title=foo
.
I want to crawl these webpages as they contain many useful information.
However, a problem I'm facing is that this website has a calendar. Some date-like webpages are generated too. That means Nutch will try to crawl some innocent webpages such as year=2030&month=12
.
This is quite stupid.
How can I avoid such trap in Nutch? Writing many regex expression?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
将正则表达式模式添加到
conf/regex-urlfilter.txt
中以指定接受或拒绝 url 的规则。Add regex patterns to
conf/regex-urlfilter.txt
to speficy rules to accept or reject urls.