计算并解析 html 文件中的所有 href 链接

发布于 2024-10-05 21:57:05 字数 880 浏览 0 评论 0原文

按照我的 上一个问题 我一直在尝试从 html 文件中解析 href 字符串,以便将该字符串发送到我上一个问题的解决方案。

这就是我所拥有的,但它不起作用...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

基本上我需要将 href 的信息提取到字符串中,例如:

访问 W3Schools

如有任何帮助,我们将不胜感激。

Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.

this is what I have but it doesn't work...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

basically I need to extract into a string the information of the href, something like:

<a href="http://www.w3schools.com">Visit W3Schools</a>

Any help is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

方觉久 2024-10-12 21:57:05

这段代码有很多问题。

  • 每次循环时,仅将 begin 递增 1。这意味着您会一遍又一遍地找到相同的 href。我认为您的意思是将 begin 移至 end 之后?

  • strncpy 通常会复制 100 个字符(因为 HTML 会更长),因此不会终止字符串。您希望 url[100] = '\0' 某处

  • 为什么分配 1000 个字符但仅使用 100 个?

  • 您从 begin 开始搜索 end。这意味着如果 href="" 之前有一个,您会发现它。

  • 您不使用 end 做任何事情。

  • 为什么不在 URL 末尾搜索终止引号?

考虑到上述问题(并添加 URL 的终止),它对我来说工作正常。

鉴于

"<a href=\"/email_services.php\">Email services</a> "

它打印了

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

对于空间的分配,我认为你应该保留“href=\””的strstr的结果(调用此start,然后你需要的大小是end - start +1 表示终止 NUL)。

( 标签也。

There's a lot of things wrong with this code.

  • You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move begin to after end?

  • The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want url[100] = '\0' somewhere

  • Why do you allocate 1000 characters and use only 100?

  • You search for end starting with begin. This means if there's a before the href="" you'll find that instead.

  • You don't use end for anything.

  • Why don't you search for the terminating quote at the end of the URL?

Given the above issues (and adding the termination of URL) it works OK for me.

Given

"<a href=\"/email_services.php\">Email services</a> "

it prints

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this start and then the size you need is end - start (+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.

Also, remember href= isn't unique to anchors. It can appear in some other tags too.

眼泪也成诗 2024-10-12 21:57:05

这并不能真正回答您关于此代码的问题,但使用 C 库来执行此操作可能更可靠,例如 来自 libxml2 的 HTMLParser

HTML 解析看起来很简单,但有些边缘情况使得使用已知有效的东西比自己完成它们更容易。

This does not really answer your qustion about this code, but it would probably be more reliable to use a C library to do this, such as HTMLParser from libxml2.

HTML parsing looks easy, but there are edge cases that make it easier to use something that is known to work than to work though them all yourself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文