如何知道文本字符串是 facebook url、电子邮件地址还是其他 uri?

发布于 2024-11-01 08:02:46 字数 389 浏览 6 评论 0原文

我正在创建一个系统来注册不同的活动。对于每个事件,它存储一个地址,可以是以下之一:

  1. Facebook 资源(基本上以“facebook.com”开头的 URL)
  2. 电子邮件地址(任何有效的电子邮件)
  3. 另一个 URL
  4. (虚假/thrash/等)

第四个并不重要。

我需要根据地址类型(FB API/发送电子邮件/发布表单)执行不同的操作。我正在考虑只存储它是什么类型,但我首先想问是否有一些正则表达式或类似的东西来知道它是什么类型。

第一个很简单,只需检查它是否以“http://www.facebook.com”开头即可。对于其他人,我考虑寻找像“http://”或“@”这样的标记,但后来我认为两者都可以包含这两者。

I'm creating a system to sign up for different events. For each event it stores an address which can be one of the following:

  1. Facebook resource (basically URL starting with "facebook.com")
  2. E-mail address (any valid e-mail)
  3. Another URL
  4. (bogus/thrash/etc)

The 4th is not important.

I need to do different things depending on the type of adress (FB API/send e-mail/POST a form). I was thinking about just storing what type it is but I first want to ask if there is some regexp or similar to know what type it is.

The first one is easy, just check if it starts with "http://www.facebook.com". For the others I thought about looking for tokens like "http://" or "@" but then I thought both can contain both of those.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

薄凉少年不暖心 2024-11-08 08:02:46

首先,@zespri 的评论是正确的 - 这是存储实际类型的更好的设计。即使您使用我在下面建议的正则表达式,将来仍然可能会出现问题。

但是,是的,在这种情况下可以使用正则表达式:

以下正则表达式是典型的电子邮件检测器。它比仅使用“@”符号更安全:

([a-zA-Z]+[a-zA-Z0-9._+\-]{3,}(?:@|%40)[a-zA-Z0-9]+[a-zA-Z0-9\.\-]?(?:\.[a-zA-Z]+)+)

以下三个可查找 Facebook 个人资料和页面。
您可以去掉后缀,只保留 Facebook 域名,或者进行一些进一步的研究和编辑以限制其他类型的 Facebook 资源:

facebook\.(?:com?\.|net\.)?[a-z]{2,3}/.+\?id=(\d+)
facebook\.(?:com?\.|net\.)?[a-z]{2,3}/p\.php.+i=(\d+)
facebook\.(?:com?\.|net\.)?[a-z]{2,3}/(\w[\w\.\-]+\w)(?:$|[/\?#])

避免使用“http://www”。前缀 - 你永远不知道可以使用什么子域,而且它们经常被省略。
另请注意,Facebook 的顶级域名 (TLD) 不仅仅是 .com

对于“其他”URL,您可以只查找锚点

^https?://

It's unclear from your question whether users enter these into your system, or whether it's done in an uncontrolled manner. Note that people often omit the http prefix, so this isn't really a reliable way to detect URLs.

如果您正在寻找 HTML 页面中作为链接的 URL,则可以通过搜索锚点来更可靠地检测到它们:

<a\s+(?:.*?)href=['"]?(https?://[^'^"^\s]+)(?:.*?)>

First, @zespri is correct in his comment - it's a much better design to store the actual type. Even if you use the regular expressions I suggest below, things could still break in the future.

But yes, it's possible to use regex in this case:

The following regex is the quintessential email detector. It's much safer to use than just an '@' sign:

([a-zA-Z]+[a-zA-Z0-9._+\-]{3,}(?:@|%40)[a-zA-Z0-9]+[a-zA-Z0-9\.\-]?(?:\.[a-zA-Z]+)+)

The following three find facebook profiles and pages.
You can get rid of the suffix to stay with just the facebook domain(s), or do some further research and edits to limit to other kinds of facebook resources:

facebook\.(?:com?\.|net\.)?[a-z]{2,3}/.+\?id=(\d+)
facebook\.(?:com?\.|net\.)?[a-z]{2,3}/p\.php.+i=(\d+)
facebook\.(?:com?\.|net\.)?[a-z]{2,3}/(\w[\w\.\-]+\w)(?:$|[/\?#])

Avoid the 'http://www.' prefix - you never know what subdomain may be used, plus they're often omitted.
Also note that there are more tld's to facebook than just the .com

For 'other' URLs, you could just look for the anchor

^https?://

It's unclear from your question whether users enter these into your system, or whether it's done in an uncontrolled manner. Note that people often omit the http prefix, so this isn't really a reliable way to detect URLs.

If you're looking for URLs as links within HTML pages they can be more reliably detected by searching for anchors:

<a\s+(?:.*?)href=['"]?(https?://[^'^"^\s]+)(?:.*?)>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文