读取文件并仅提取某些部分

发布于 2024-10-04 01:50:53 字数 730 浏览 0 评论 0原文

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            stop_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

我想做的是提取“href”值。但我无法让它发挥作用。

编辑:

感谢托尼提示,我使用这个:

if(line.find("href=") != std::string::npos ){   
    // Process
}

它有效!

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            stop_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

What I am trying to do, is to extrat the "href" value. But I cant get it works.

EDIT:

Thanks to Tony hint, I use this:

if(line.find("href=") != std::string::npos ){   
    // Process
}

it works!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

巴黎夜雨 2024-10-11 01:50:53

我建议不要尝试像这样解析 HTML。除非您对源代码非常了解并且非常确定其格式,否则您所做的任何事情都可能会出现问题。 HTML 是一种丑陋语言,具有(几乎)自相矛盾的规范,(例如)规定不允许某些特定的事情 - 但随后又告诉您无论如何都需要如何解释它们。

更糟糕的是,几乎任何字符都可以(至少可能)以至少三种或四种不同方式中的任何一种进行编码,因此除非您首先扫描(并执行)正确的转换(以正确的顺序),否则您最终可能会丢失合法链接和/或包括“虚假”链接。

您可能需要查看此上一个问题的答案,以获取有关 HTML 解析器的建议使用。

I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.

Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.

You might want to look at the answers to this previous question for suggestions about an HTML parser to use.

此生挚爱伱 2024-10-11 01:50:53

首先,您可能需要采用一些快捷方式在行上编写循环,以便使其更清晰。这是使用 C++ iostreams 的传统“一次读取行”循环:

#include <fstream>
#include <iostream>
#include <string>

int main ( int, char ** )
{
    std::ifstream file("sample.html");
    if ( !file.is_open() ) {
        std::cerr << "Failed to open file." << std::endl;
        return (EXIT_FAILURE);
    }
    for ( std::string line; (std::getline(file,line)); )
    {
        // process line.
    }
}

至于处理行的内部部分,存在几个问题。

  1. 它无法编译。我想这就是你所说的“我无法让它发挥作用”的意思。当提出问题时,您可能需要提供此类信息以获得良好的帮助。
  2. 变量名称 temptempString 等之间存在混淆。
  3. string::find() 返回一个大的正整数来指示无效位置(< code>size_type 是无符号的),因此您将始终进入循环,除非从字符位置 0 开始找到匹配项,在这种情况下您可能确实想要进入循环。

这是sample.html的简单测试内容。

<html>
    <a href="foo.pdf"/>
</html>

将以下内容粘贴在循环中:

if ((line.find("href=") != std::string::npos) &&
    (line.find(".pdf" ) != std::string::npos))
{
    const std::size_t start_pos = line.find("href");
    std::string temp = line.substr(start_pos+6);
    const std::size_t stop_pos = temp.find("\"");
    std::string result = temp.substr(0, stop_pos);
    std::cout << "'" << result << "'" << std::endl;
}

我实际上得到了输出

'foo.pdf'

但是,正如 Jerry 指出的那样,您可能不想在生产环境中使用它。如果这是关于如何使用 库的简单作业或练习,然后继续进行这样的程序。

As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:

#include <fstream>
#include <iostream>
#include <string>

int main ( int, char ** )
{
    std::ifstream file("sample.html");
    if ( !file.is_open() ) {
        std::cerr << "Failed to open file." << std::endl;
        return (EXIT_FAILURE);
    }
    for ( std::string line; (std::getline(file,line)); )
    {
        // process line.
    }
}

As for the inner part the processes the line, there are several problems.

  1. It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
  2. There is confusion between variable names temp and tempString etc.
  3. string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.

Here is a simple test content for sample.html.

<html>
    <a href="foo.pdf"/>
</html>

Sticking the following inside the loop:

if ((line.find("href=") != std::string::npos) &&
    (line.find(".pdf" ) != std::string::npos))
{
    const std::size_t start_pos = line.find("href");
    std::string temp = line.substr(start_pos+6);
    const std::size_t stop_pos = temp.find("\"");
    std::string result = temp.substr(0, stop_pos);
    std::cout << "'" << result << "'" << std::endl;
}

I actually get the output

'foo.pdf'

However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文