URL 检索服务的最佳实践? 如何避免成为攻击媒介?
我正在修改一个网络工具,给定一个 URL,该工具将检索文本并向用户提供一些有关内容的统计信息。
我担心为用户提供一种从我的盒子向网络上的任意 URL 发起 GET 请求的方法可能会成为攻击的媒介(例如到 http://undefending.box/broken-sw/admin ?do_something_bad
)。
有没有办法尽量减少这种风险? 提供公共 URL 检索能力时有哪些最佳实践?
我考虑过的一些想法:
- 尊重
robots.txt
- 仅接受或拒绝某些 URL 模式
- 检查适当网站的黑名单/白名单(如果存在这样的事情)
- 通过众所周知的第 3 方公共网络代理工作,假设他们已经内置了这些保障措施,
感谢您的帮助。
编辑:它将仅评估 HTML 或文本内容,而不下载或评估链接的脚本、图像等。如果是 HTML,我将使用 HTML 解析器。
I'm tinkering with a web tool that, given a URL, will retrieve the text and give the user some statistics on the content.
I'm worried that giving users a way to initiate a GET request from my box to any arbitrary URL on the net may serve as a vector for attacks (e.g. to http://undefended.box/broken-sw/admin?do_something_bad
).
Are there ways to minimize this risk? Any best practices when offering public URL retrieval capacity?
Some ideas I've thought about:
- honoring
robots.txt
- accepting or rejecting only certain URL patterns
- checking blacklist/whitelist of appropriate sites (if such a thing exists)
- working through a well known 3rd party's public web proxy, on the assumption that they've already built in these safeguards
Thanks for your help.
Edit: It'll be evaluating only HTML or text content, without downloading or evaluating linked scripts, images, etc. If HTML, I'll be using an HTML parser.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
统计信息是否仅关于文档中的文本? 您打算使用 HTML 解析器来评估它吗?
如果您只是要分析文本,即不下载更多链接、评估脚本等,那么风险就不那么严重。
通过防病毒程序传递您下载的每个文件可能不会有什么坏处。 您还应该将 GET 限制为某些内容类型(即不要下载二进制文件;确保它是某种文本编码)。
Are the statistics going to be only about the text in the document? Are you going to evaluate it using a HTML parser?
If it's only the text that you're going to analyze, that is, without downloading further links, evaluating scripts, etc. then the risk is less severe.
It probably wouldn't hurt to pass each file you download through an Anti-Virus program. You should also restrict the GETs to certain content-types (i.e. don't download binaries; make sure it's some sort of text encoding).