javascript 正则表达式从锚标记中提取锚文本和 URL

发布于 2024-07-09 15:28:15 字数 662 浏览 8 评论 0原文

我在名为“input_content”的 JavaScript 变量中有一段文本，该文本包含多个锚标记/链接。我想匹配所有锚标记并提取锚文本和 URL，并将其放入像（或类似）这样的数组中：

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

我已经破解了它（http://pastie.org/339755) ，但我对此感到困惑。谢谢您的帮助！

原文

I have a paragraph of text in a javascript variable called 'input_content' and that text contains multiple anchor tags/links. I would like to match all of the anchor tags and extract anchor text and URL, and put it into an array like (or similar to) this:

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

I've taken a crack at it (http://pastie.org/339755), but I am stumped beyond this point. Thanks for the help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情归归情 2024-07-16 15:28:15

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

这假设您的锚点始终采用 ... 的形式，即如果有任何其他属性，它将不起作用（例如，目标）。可以改进正则表达式以适应这一点。

分解正则表达式：

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

每次调用我们的匿名函数都会收到三个标记作为第二个、第三个和第四个参数，即arguments[1]、arguments[2]、arguments[3]：

arguments[1]是整个锚点
argument[2] 是 href 部分，
arguments[3] 是里面的文本。

我们将使用 hack 将这三个参数作为新数组推送到我们的主 matches 数组中。 arguments 内置变量不是真正的 JavaScript 数组，因此我们必须对其应用 split 数组方法来提取我们想要的项目：

Array.prototype.slice.call(arguments, 1, 4)

这将提取arguments 中的项目从索引 1 开始，到索引 4 结束（不包括）。

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

给出：

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

To break down the regular expression:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

arguments[1] is the entire anchor
arguments[2] is the href part
arguments[3] is the text inside

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

Gives:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

回复收藏 0 原文

深海夜未眠 2024-07-16 15:28:15

由于您可能在网络浏览器中运行 javascript，因此正则表达式似乎不是一个好主意。如果该段落首先来自页面，请获取容器的句柄，调用 .getElementsByTagName() 获取锚点，然后以这种方式提取所需的值。

如果不可能，则创建一个新的 html 元素对象，将文本分配给它的 .innerHTML 属性，然后调用 .getElementsByTagName()。

回复收藏 0 原文

口干舌燥 2024-07-16 15:28:15

我认为 Joel 有权利这么做——正则表达式因在标记方面表现不佳而臭名昭著，因为需要考虑的可能性太多了。锚标签还有其他属性吗？它们的顺序是什么？分隔的空白始终是单个空格吗？鉴于您已经拥有可用的浏览器 HTML 解析器，最好将其投入使用。

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

无论链接如何存储，这都会返回一个类似于您所描述的数组。请注意，您可以通过将参数名称更改为“container”并删除前两行来更改函数以使用传递的元素而不是文本。 textContent/innerText 属性获取链接显示的文本，去除任何标记（粗体/斜体/字体/...）。如果您想保留标记，可以将 .textContent 替换为 .innerHTML 并删除内部 if() 语句。

I think Joel has the right of it — regexes are notorious for playing poorly with markup, as there are simply too many possibilities to consider. Are there other attributes to the anchor tags? What order are they in? Is the separating whitespace always a single space? Seeing as you already have a browser's HTML parser available, best to put that to work instead.

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

This will return an array like the one you describe regardless of how the links are stored. Note that you could change the function to work with a passed element instead of text by changing the parameter name to "container" and removing the first two lines. The textContent/innerText property gets the text displayed for the link, stripped of any markup (bold/italic/font/…). You could replace .textContent with .innerHTML and remove the inner if() statement if you want to preserve the markup.

回复收藏 0 原文

甜心小果奶 2024-07-16 15:28:15

为了搜索者的利益：我创建了一些可以与锚标记中的附加属性一起使用的东西。对于那些不熟悉正则表达式的人来说，美元（$1 等）值是正则表达式组匹配。

var text = 'This is my <a target="_blank" href="www.google.co.uk">link</a> Text';
var urlPattern = /([^+>]*)[^<]*(<a [^>]*(href="([^>^\"]*)")[^>]*>)([^<]+)(<\/a>)/gi;
var output = text.replace(urlPattern, "$1___$2___$3___$4___$5___$6");
alert(output);

请参阅工作 jsFiddle 和正则表达式101。

或者，您可以从组中获取信息，如下所示：

var returnText = text.replace(urlPattern, function(fullText, beforeLink, anchorContent, href, lnkUrl, linkText, endAnchor){
                    return "The bits you want e.g. linkText";
                });

For the benefit of searchers: I created something that will work with additional attributes in the anchor tag. For those not familiar with Regex, the dollar ($1 etc) values are the regex group matches.

var text = 'This is my <a target="_blank" href="www.google.co.uk">link</a> Text';
var urlPattern = /([^+>]*)[^<]*(<a [^>]*(href="([^>^\"]*)")[^>]*>)([^<]+)(<\/a>)/gi;
var output = text.replace(urlPattern, "$1___$2___$3___$4___$5___$6");
alert(output);

See working jsFiddle and regex101.

Alternatively, you can get info out of the groups like this:

var returnText = text.replace(urlPattern, function(fullText, beforeLink, anchorContent, href, lnkUrl, linkText, endAnchor){
                    return "The bits you want e.g. linkText";
                });

回复收藏 0 原文

淡莣 2024-07-16 15:28:15

我认为 JQuery 将是您最好的选择。这不是最好的脚本，我相信其他人可以提供更好的东西。但这会创建一个正是您正在寻找的内容的数组。

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
        $("a").each(function(){
            items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
            i++;      
        });
    });

    function showItems(){
        alert(items);
    }

</script>

I think JQuery would be your best bet. This isn't the best script and I'm sure others can give something better. But this creates an array of exactly what you're looking for.

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
        $("a").each(function(){
            items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
            i++;      
        });
    });

    function showItems(){
        alert(items);
    }

</script>

回复收藏 0 原文

玉环 2024-07-16 15:28:15

提取 url：

varpattern = /.href="(.)".*/;
var url = string.replace(pattern,'$1');

演示：

//var string = '<a id="btn" target="_blank" class="button" href="https://yourdomainame.com:4089?param=751&2ndparam=2345">Buy Now</a>;'
//uncomment the above as an example of link.outerHTML

var string = link.outerHTML
var pattern = /.*href="(.*)".*/;
var href = string.replace(pattern,'$1');
alert(href)

对于“锚文本”，为什么不使用：
link.innerHtml

To extract the url:

var pattern = /.href="(.)".*/;
var url = string.replace(pattern,'$1');

Demo:

//var string = '<a id="btn" target="_blank" class="button" href="https://yourdomainame.com:4089?param=751&2ndparam=2345">Buy Now</a>;'
//uncomment the above as an example of link.outerHTML

var string = link.outerHTML
var pattern = /.*href="(.*)".*/;
var href = string.replace(pattern,'$1');
alert(href)

For "anchor text", why not use:
link.innerHtml

回复收藏 0 原文

~没有更多了~