读取文件并仅提取某些部分
ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
我想做的是提取“href”值。但我无法让它发挥作用。
编辑:
感谢托尼提示,我使用这个:
if(line.find("href=") != std::string::npos ){
// Process
}
它有效!
ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
What I am trying to do, is to extrat the "href" value. But I cant get it works.
EDIT:
Thanks to Tony hint, I use this:
if(line.find("href=") != std::string::npos ){
// Process
}
it works!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议不要尝试像这样解析 HTML。除非您对源代码非常了解并且非常确定其格式,否则您所做的任何事情都可能会出现问题。 HTML 是一种丑陋语言,具有(几乎)自相矛盾的规范,(例如)规定不允许某些特定的事情 - 但随后又告诉您无论如何都需要如何解释它们。
更糟糕的是,几乎任何字符都可以(至少可能)以至少三种或四种不同方式中的任何一种进行编码,因此除非您首先扫描(并执行)正确的转换(以正确的顺序),否则您最终可能会丢失合法链接和/或包括“虚假”链接。
您可能需要查看此上一个问题的答案,以获取有关 HTML 解析器的建议使用。
I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.
Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.
You might want to look at the answers to this previous question for suggestions about an HTML parser to use.
首先,您可能需要采用一些快捷方式在行上编写循环,以便使其更清晰。这是使用 C++ iostreams 的传统“一次读取行”循环:
至于处理行的内部部分,存在几个问题。
temp
和tempString
等之间存在混淆。string::find()
返回一个大的正整数来指示无效位置(< code>size_type 是无符号的),因此您将始终进入循环,除非从字符位置0
开始找到匹配项,在这种情况下您可能确实想要进入循环。这是
sample.html
的简单测试内容。将以下内容粘贴在循环中:
我实际上得到了输出
但是,正如 Jerry 指出的那样,您可能不想在生产环境中使用它。如果这是关于如何使用
、
和
库的简单作业或练习,然后继续进行这样的程序。As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:
As for the inner part the processes the line, there are several problems.
temp
andtempString
etc.string::find()
returns a large positive integer to indicate invalid positions (thesize_type
is unsigned), so you will always enter the loop unless a match is found starting at character position0
, in which case you probably do want to enter the loop.Here is a simple test content for
sample.html
.Sticking the following inside the loop:
I actually get the output
However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the
<string>
,<iostream>
and<fstream>
libraries, then go ahead with such a procedure.