识别“无效”不尝试解析 URL
我正在构建一个 Facebook 应用程序,它从用户 Facebook 帐户中的各种来源获取 URL,例如用户的点赞。
我遇到的一个问题是,许多 Facebook 条目的“网站”和“链接”字段中的字符串不是 URL。 Facebook 不会检查用户输入,因此这些字段基本上可以包含任何字符串。
我希望能够处理这些字段中的字符串,例如 "http://google.com"
、"https://www.bankofamerica.com"
等 URL >, "http://www.nytimes.com/2011/06/13/us/13fbi.html?_r=1&hp"
, "bit.ly"
, “www.pbs.org”
均被接受。
所有的字符串都像“这里是用户输入的随机文本字符串”
,“这里'\s ano!!!#%#$^其他奇怪的随机字符串”
全部被拒绝。
在我看来,“确定”URL 的唯一方法是尝试解析它,但我相信这会消耗大量资源。
任何人都可以想出聪明的方法来正则表达式或以其他方式分析这些字符串,以便正确捕获“很多”URLS - 80%? 95% 99.995% 的 URL?
谢谢!
编辑:仅供参考,我正在使用 Python 进行开发。但与语言无关的解决方案也很棒。
I'm building a Facebook App which grabs the URLs from various sources in a user's Facebook acount--e.g., a user's likes.
A problem I've encountered is that many Facebook entries have string which are not URLs in their "website" and "link" fields. Facebook does no checking on user input so these fields can essentially contain any string.
I want to be able to process the strings in these field such that URLs like "http://google.com"
, "https://www.bankofamerica.com"
, "http://www.nytimes.com/2011/06/13/us/13fbi.html?_r=1&hp"
, "bit.ly"
, "www.pbs.org"
are all accepted.
And all the strings like "here is a random string of text the user entered"
, "here'\s ano!!! #%#$^ther weird random string"
are all rejected.
It seems to me the only way to be "sure" of a URL is to attempt to resolve it, but I believe that will be prohibitively resource intensive.
Can anyone think of clever way to regex or otherwise analyze these strings such that "a lot" of the URLS are properly captured--80%? 95% 99.995% of URLs?
Thanks!
EDIT: FYI, I'm developing in Python. But a language agnostic solution is great as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您的开发语言,有许多用于验证 URL 的工具。假设您正在使用 JavaScript 进行开发,快速的 Google 搜索会发现许多方法,具体取决于您需要的稳健程度。
请参阅http://www.w3.org/Addressing/URL/url-spec。 txt 为权威规范。
There are numerous tools for validating URLs depending on your development language. Assuming you are developing in JavaScript, a quick Google search unearths many approaches, depending on the level of robustness your need requires.
See http://www.w3.org/Addressing/URL/url-spec.txt for the authoritative specification.
我首先匹配
"^(?:https?://)?([A-Za-z0-9-\.]+)/"
然后进行 DNS 查找(缓存) 对于该主机名,如果您想确保主机名没有拼写错误。 95% 技术使用顶级域名白名单(或它们的一些正则表达式),当新域名(.info、.eu、.biz、.aero)可用时,您必须维护该白名单。网址中还存在某些不允许(未转义)的字符 - 然而,有些人确实输入了诸如
"http://example.com/I don't Want to go!!!"
之类的网址然后,他们的浏览器将其转义为有效的“...I%20don%27t%20wanna%20go%21%21%21”
。I'd first match for
"^(?:https?://)?([A-Za-z0-9-\.]+)/"
and then do a DNS lookup (cached) for that hostname, if you want to make sure that the hostname isn't misspelled. The 95% technique uses a whitelist of toplevel domains (or some regular expression for them), which you'd have to maintain when new ones (.info, .eu, .biz, .aero) become available.There are also certain characters that are not allowed (unescaped) in URLs - however, some people do enter URLs like
"http://example.com/I don't wanna go!!!"
and their browser then escapes it to the valid"...I%20don%27t%20wanna%20go%21%21%21"
.