计算并解析 html 文件中的所有 href 链接

发布于 2024-10-05 21:57:05 字数 880 浏览 0 评论 0原文

按照我的上一个问题我一直在尝试从 html 文件中解析 href 字符串，以便将该字符串发送到我上一个问题的解决方案。

这就是我所拥有的，但它不起作用...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

基本上我需要将 href 的信息提取到字符串中，例如：

访问 W3Schools

如有任何帮助，我们将不胜感激。

原文

Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.

this is what I have but it doesn't work...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

basically I need to extract into a string the information of the href, something like:

<a href="http://www.w3schools.com">Visit W3Schools</a>

Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

方觉久 2024-10-12 21:57:05

这段代码有很多问题。

每次循环时，仅将 begin 递增 1。这意味着您会一遍又一遍地找到相同的 href。我认为您的意思是将 begin 移至 end 之后？
strncpy 通常会复制 100 个字符（因为 HTML 会更长），因此不会终止字符串。您希望 url[100] = '\0' 某处
为什么分配 1000 个字符但仅使用 100 个？
您从 begin 开始搜索 end。这意味着如果 href="" 之前有一个，您会发现它。
您不使用 end 做任何事情。
为什么不在 URL 末尾搜索终止引号？

考虑到上述问题（并添加 URL 的终止），它对我来说工作正常。

鉴于

"<a href=\"/email_services.php\">Email services</a> "

它打印了

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

对于空间的分配，我认为你应该保留“href=\””的strstr的结果（调用此start，然后你需要的大小是end - start +1 表示终止 NUL）。

（标签也。

There's a lot of things wrong with this code.

You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move begin to after end?
The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want url[100] = '\0' somewhere
Why do you allocate 1000 characters and use only 100?
You search for end starting with begin. This means if there's a before the href="" you'll find that instead.
You don't use end for anything.
Why don't you search for the terminating quote at the end of the URL?

Given the above issues (and adding the termination of URL) it works OK for me.

Given

"<a href=\"/email_services.php\">Email services</a> "

it prints

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this start and then the size you need is end - start (+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.

Also, remember href= isn't unique to anchors. It can appear in some other tags too.

回复收藏 0 原文