当用户输入 URL(例如 http://www.google.com
)时,我希望能够使用 PHP 解析该文本,找到任何链接、并将其替换为
标记,其中包含原始 URL 作为 HREF
。
换句话说,http://www.google.com
将变为
http://www.google .com
我希望能够对这些形式的所有 URL 执行此操作(.com
可与任何 TLD 互换):
http://www.google.com
www.google.com
google.com
docs.google.com
什么是最有效的方法这?我可以尝试编写一些非常奇特的正则表达式,但我怀疑这对我来说是最好的方法。
为了获得奖励积分,我还想在任何缺少它的 URL 前面添加 http://
,并将显示文本本身剥离为 http://www.google 形式的内容.com/reallyLongL...
并随后显示外部链接图标。
When a user enters a URL, e.g. http://www.google.com
, I would like to be able to parse that text using PHP, find any links, and replace them with <a>
tags that include the original URL as an HREF
.
In other words, http://www.google.com
will become
<a href="http://www.google.com">http://www.google.com</a>
I'd like to be able to do this for all URLs of these forms (with .com
interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http://
to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL...
and display an external link icon afterwards.
发布评论
评论(6)
尝试查找domain.com 格式的链接将是一件很痛苦的事情。这需要跟踪所有 TLD 并在搜索中使用它们。如果您没有输入我输入的最后一句话的结尾,并且该句子的开头将是指向 http://search.if。即使您这样做了,.in 也是有效的 TLD 和常用词。
我建议告诉您的用户他们必须以
www.
或http://
开头链接,然后编写一个简单的正则表达式来捕获它们并添加链接。Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with
www.
orhttp://
then write a simple regex to capture them and add the links.这不是 URL,而是主机名。在任意文本中开始标记裸主机名通常不是一个好主意,因为在一般情况下,任何单词或点分隔单词序列都是完全有效的主机名。这意味着您会遇到可怕的黑客攻击,例如寻找领先的
www.
(并且您会得到诸如“为什么我可以链接到www.stackoverflow.com
但不能链接到>stackoverflow.com
?”)或尾随 TLD(随着更多新 TLD 的引入,这变得越来越不切实际;“为什么我喜欢 ncm.com 而不是ncm.museum
? ”),并且您经常会标记不应该是链接的内容。,但是我不知道如果没有正则表达式,您将如何做到这一点。
诀窍是处理标记。如果输入中可以包含
<
、&
和"
字符,则不得让它们进入 HTML 输出。如果您的输入是纯文本,您可以通过调用htmlspecialchars()
来实现这一点,然后再对 nico 的答案中的模式进行简单的替换(如果输入已经包含标记,那么您就会遇到问题。可能需要一个 HTML 解析器来确定哪些位是标记,以避免在其中添加更多标记。类似地,如果您在此之后进行更多处理,插入更多标记,那么这些步骤在“bbcode”中可能会遇到同样的困难。就像语言一样,这通常会导致错误和安全问题。)
另一个问题是尾随标点符号人们通常在链接后放置句号、逗号、右括号、感叹号等,这些不应该是链接的一部分。链接,但实际上是有效的字符,将它们删除而不将它们放入链接中是有用的,但是您会破坏以
)
结尾的 Wiki 链接,因此您可能不想处理。如果链接中有
作为尾随字符。这种事情不能通过简单的正则表达式替换来完成,但可以在替换回调函数中完成。(
或类似的内容,则将 )This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading
www.
(and you'll get questions like “why can I link towww.stackoverflow.com
but notstackoverflow.com
?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but notncm.museum
?”), and you'll often mark up things that aren't supposed to be links.Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have
<
,&
and"
characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by callinghtmlspecialchars()
before applying a simple replacement on a pattern like that in nico's answer.(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in
)
, so maybe you want to not treat)
as a trailing character if there's a(
in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.HTML Purifier 有一个内置 linkify 功能,为您省去所有麻烦。
如果您正在处理还必须显示的任何类型的用户输入,它的其他功能也非常有用,不容错过。
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
不那么花哨的正则表达式应该可以工作
请注意,最后两个不可能正确执行,因为你无法区分 google.com 和这样的东西。我完成一个句子,并且在句号后不加空格。
至于缩短 URL,请将 URL 放在
$url
中:Not so fancy regexps that should work
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in
$url
:来自 http://www.exorithm.com/algorithm/view/markup_urls
From http://www.exorithm.com/algorithm/view/markup_urls
我在这里完全按照我想要的方式工作:
I got this working exactly the way I want here: