计算并解析 html 文件中的所有 href 链接
按照我的 上一个问题 我一直在尝试从 html 文件中解析 href 字符串,以便将该字符串发送到我上一个问题的解决方案。
这就是我所拥有的,但它不起作用...
void ParseUrls(char* Buffer)
{
char *begin = Buffer;
char *end = NULL;
int total = 0;
while(strstr(begin, "href=\"") != NULL)
{
end = strstr(begin, "</a>");
if(end != NULL)
{
char *url = (char*) malloc (1000 * sizeof(char));
strncpy(url, begin, 100);
printf("URL = %s\n", url);
if(url) free(url);
}
total++;
begin++;
}
printf("Total URLs = %d\n", total);
return;
}
基本上我需要将 href 的信息提取到字符串中,例如:
如有任何帮助,我们将不胜感激。
Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.
this is what I have but it doesn't work...
void ParseUrls(char* Buffer)
{
char *begin = Buffer;
char *end = NULL;
int total = 0;
while(strstr(begin, "href=\"") != NULL)
{
end = strstr(begin, "</a>");
if(end != NULL)
{
char *url = (char*) malloc (1000 * sizeof(char));
strncpy(url, begin, 100);
printf("URL = %s\n", url);
if(url) free(url);
}
total++;
begin++;
}
printf("Total URLs = %d\n", total);
return;
}
basically I need to extract into a string the information of the href, something like:
<a href="http://www.w3schools.com">Visit W3Schools</a>
Any help is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这段代码有很多问题。
每次循环时,仅将 begin 递增 1。这意味着您会一遍又一遍地找到相同的 href。我认为您的意思是将
begin
移至end
之后?strncpy 通常会复制 100 个字符(因为 HTML 会更长),因此不会终止字符串。您希望
url[100]
= '\0' 某处为什么分配 1000 个字符但仅使用 100 个?
您从 begin 开始搜索
end
。这意味着如果 href="" 之前有一个,您会发现它。您不使用
end
做任何事情。为什么不在 URL 末尾搜索终止引号?
考虑到上述问题(并添加 URL 的终止),它对我来说工作正常。
鉴于
它打印了
对于空间的分配,我认为你应该保留“href=\””的strstr的结果(调用此
start
,然后你需要的大小是end - start +1 表示终止 NUL)。
( 标签也。
There's a lot of things wrong with this code.
You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move
begin
to afterend
?The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want
url[100]
= '\0' somewhereWhy do you allocate 1000 characters and use only 100?
You search for
end
starting with begin. This means if there's a before the href="" you'll find that instead.You don't use
end
for anything.Why don't you search for the terminating quote at the end of the URL?
Given the above issues (and adding the termination of URL) it works OK for me.
Given
it prints
For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this
start
and then the size you need isend - start
(+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.Also, remember href= isn't unique to anchors. It can appear in some other tags too.
这并不能真正回答您关于此代码的问题,但使用 C 库来执行此操作可能更可靠,例如 来自 libxml2 的 HTMLParser。
HTML 解析看起来很简单,但有些边缘情况使得使用已知有效的东西比自己完成它们更容易。
This does not really answer your qustion about this code, but it would probably be more reliable to use a C library to do this, such as HTMLParser from libxml2.
HTML parsing looks easy, but there are edge cases that make it easier to use something that is known to work than to work though them all yourself.