有什么方法可以检测 python 中输入错误的网址吗?

发布于 2024-11-16 16:53:25 字数 543 浏览 0 评论 0原文

我的 python 程序涉及访问用户提供的 url,然后在页面上执行操作。理想情况下,输入错误的网址将被识别并弹出错误。但如果它们具有正确的语法并且没有指向任何地方,那么就会加载 ISP 错误页面或广告网站。

例如:

“http://washingtonn.edu”--> http://search5.comcast.com/?cat=dnsr& ;con=dsqcy&url=washingtonn.edu

"http://www.amazdon.com/" --> http://www.amazdon.com/

有没有办法在不知道所有可能的页面的情况下检测到这些?第二个可能相当困难,因为它是一个真实的网站,但我很乐意抓住第一个。

谢谢!

My python program involves going to a user-supplied url and then doing stuff on the page. Ideally, mistyped urls would be recognized and pop up an error. But if they have the right syntax and just don't point anywhere, then either an ISP error page or an ad site is loaded instead.

For example:

"http://washingtonn.edu" --> http://search5.comcast.com/?cat=dnsr&con=dsqcy&url=washingtonn.edu

"http://www.amazdon.com/" --> http://www.amazdon.com/

Is there any way to detect these without knowing all the possible pages? The second one might be pretty hard because it's an actual site, but I'd be happy with catching the first.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

柠檬心 2024-11-23 16:53:25

除非我误解了你的问题,否则你的要求是不可能的,没有意义,或者远非微不足道。

如果您考虑一下,除了检测到页面不存在的 404 错误之外,如果页面确实存在,则无法知道该页面是“好”还是“坏”,因为这是主观的。也许可以应用一些一般规则,但您无法拥抱所有可能性。

唯一的方法就像谷歌对建议所做的那样,但这意味着一个巨大的数据库,其中包含网站受欢迎程度的列表,并且每次都测试邻近性,但这远远超出了微不足道的范围,而且可能没有必要。

要在 python 中处理 404 法规,您可以使用 lie httplib

祝你好运!

Unless I am misunderstanding your question, what you ask for is impossible, doesn't make sense, or is far far from trivial.

If you think about it, other than a 404 error, where you detect that a page does not exist, if a page does exist there is not way of knowing whether the page is "good" or "bad" as this is subjective. It might be possible to apply some general rules, but you can't make embrace all the possibilities.

The only way would be something like what Google does with the suggestions, but this would imply a huge database with a list of popularity of websites, and test every time for proximity, but that is far beyond trivial and probably not necessary.

For handling 404 statutes in python you could use lie httplib.

Good luck!

土豪我们做朋友吧 2024-11-23 16:53:25

您可以检查请求的 HTTP 状态代码。您可能最感兴趣的是 404 - 未找到状态。在第二种情况下,你是对的 - 如果响应是一个网页,你无法知道这是用户想要的还是拼写错误

You can check the HTTP status code of your requests. Probably most interesting for you is the 404 - Not Found status. In the second case, you are right - if the response is a web page, you can't know if is what user wanted or is a typo

南城追梦 2024-11-23 16:53:25

你所说的是启发式的,它实际上是一个非常复杂的话题。您可以有一个常见网站和常见拼写错误的列表 - 如果某些内容无法解析(即 404 HTTP 响应),请根据列表检查输入,并选择“最接近”的答案(这是一个完整的算法本身)。但它不会太可靠,因为拼写错误的网站确实可能正确解析(尽管解析到非预期的域)。

如果您非常担心 URL 拼写错误,一个非常简单的解决方案就是询问 URL 两次。

What you're talking about is heuristics and it's actually a very complex topic. You could have a list of common websites and common misspellings- if something cannot resolve (i.e, 404 HTTP response) check the input against the list, and pick the "closest" answer (this is a whole algorithm in-of-itself). It wouldn't be too reliable though, because a misspelled website may indeed resolve correctly (although to the unintended domain).

a really simple solution, if you're very concerned about misspelled urls is to just ask for the URL twice.

不一样的天空 2024-11-23 16:53:25

您可以使用正则表达式来检查有效的 url,还可以使用 httplib 来检查响应代码并需要 200 才能继续。

如果 url 有效,HTTPConnection.getresponse() 将返回 200

You could use a regex to check for a valid url, and also use httplib to check for the response codes and require a 200 to continue.

HTTPConnection.getresponse() will return 200 if a url is valid

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文