如何使用 Perl 从纯文本中提取 URL?

发布于 2024-08-27 07:01:03 字数 394 浏览 11 评论 0 原文

我需要 Perl 正则表达式来解析纯文本输入并将所有链接转换为有效的 HTML HREF 链接。我已经尝试了在网上找到的 10 个不同版本,但没有一个可以正常工作。我还测试了 StackOverflow 上发布的其他解决方案,但似乎都不起作用。正确的解决方案应该能够在纯文本输入中找到任何 URL 并将其转换为:

<a href="$1">$1</a>

在某些情况下,我尝试过的其他正则表达式无法正确处理,包括:

  1. 行尾的 URL,后跟
  2. 包含问题的返回 URL标记
  3. 以“https”开头的 URL

我希望另一个 Perl 人员已经有一个他们正在使用的正则表达式,可以共享。预先感谢您的帮助!

I need the Perl regex to parse plain text input and convert all links to valid HTML HREF links. I've tried 10 different versions I found on the web but none of them seen to work correctly. I also tested other solutions posted on StackOverflow, none of which seem to work. The correct solution should be able to find any URL in the plain text input and convert it to:

<a href="$1">$1</a>

Some cases other regular expressions I tried didn't handle correctly include:

  1. URLs at the end of a line which are followed by returns
  2. URLs that included question marks
  3. URLs that start with 'https'

I'm hoping that another Perl guy out there will already have a regular expression they are using for this that they can share. Thanks in advance for your help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

長街聽風 2024-09-03 07:01:03

您需要 URI::Find。提取链接后,您应该能够很好地处理其余问题。

perlfaq9"如何提取 URL?",顺便说一句。这些 perlfaq 中有很多好东西。 :)

You want URI::Find. Once you extract the links, you should be able to handle the rest of the problem just fine.

This is answered in perlfaq9's answer to "How do I extract URLs?", by the way. There is a lot of good stuff in those perlfaq. :)

孤独岁月 2024-09-03 07:01:03

除了 URI::Find 之外,还可以查看大型正则表达式数据库:Regexp::Common,其中有一个 Regexp::Common::URI 模块可以为您提供简单的内容:

my ($uri) = $str =~ /$RE{URI}{-keep}/;

如果您想要该 uri 中的不同部分(主机名、查询参数等),请参阅 Regexp::Common::URI::http 用于在 $RE{URI} 正则表达式中捕获的内容。

Besides URI::Find, also checkout the big regular expression database: Regexp::Common, there is a Regexp::Common::URI module that gives you something as easy as:

my ($uri) = $str =~ /$RE{URI}{-keep}/;

If you want different pieces (hostname, query parameters etc) in that uri, see the doc of Regexp::Common::URI::http for what's captured in the $RE{URI} regular expression.

寄风 2024-09-03 07:01:03

当我尝试使用以下文本 URI::Find::Schemeless 时:

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://[email protected]/test/me
How about one without a protocol www.example.com?

它搞砸了http://example.org/(9.3)。因此,我在 Regexp::Common 的帮助下想出了以下内容:

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</</g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

这适用于所示的输入。当然,生活从来没有像您通过尝试 (http://example.org/(9.3)) 看到的那么容易。

When I tried URI::Find::Schemeless with the following text:

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://[email protected]/test/me
How about one without a protocol www.example.com?

it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</</g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).

送舟行 2024-09-03 07:01:03

在这里我发布了如何提取 url 的示例代码。
这里它将从标准输入中获取行。
并且它会检查输入行是否包含有效的 URL 格式。
它会给你

use strict;
use warnings;

use Regexp::Common qw /URI/;

while (1)
{
        #getting the input from stdin.
        print "Enter the line: \n";
        my $line = <>;
        chomp ($line); #removing the unwanted new line character
        my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/       and  print "Contains an HTTP URI.\n";
        print "URL : $uri\n" if ($uri);
}

我得到的 URL 示例输出如下

Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com

Here I have posted the sample code using how to extract the urls.
Here it will take the lines from the stdin.
And it will check whether the input line contains valid URL format.
And it will give you the URL

use strict;
use warnings;

use Regexp::Common qw /URI/;

while (1)
{
        #getting the input from stdin.
        print "Enter the line: \n";
        my $line = <>;
        chomp ($line); #removing the unwanted new line character
        my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/       and  print "Contains an HTTP URI.\n";
        print "URL : $uri\n" if ($uri);
}

Sample output I am getting is as follows

Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文