如何检索大文件中字符串的所有索引?

发布于 2024-12-09 07:39:30 字数 156 浏览 0 评论 0原文

想象一下,有一个非常大的 html 文件,其中当然有很多 html 标签。我无法将整个文件加载到内存中。

我的目的是提取此

和此

字符串的所有索引。我应该如何实现它?请为我建议一些方向。

Imagine there is a very large html file with of course lots of html tags. I cannot load the entire file into memory.

My intention is to extract all indexes for this <p> and this </p> strings. How should I achieve it? Please suggest some directions for me to do it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

丿*梦醉红颜 2024-12-16 07:39:31

您应该尝试 Html Agility Pack

You should try Html Agility Pack.

同尘 2024-12-16 07:39:31

如果您的 html 是纯 XHTML,那么您可以将其视为 XML 文档。将 XHTML 加载到 System.Xml.XmlDocument 中,然后使用 GetElementsByTagName("p") 方法返回

标记列表。这比尝试直接解析 html 更安全、更容易。

If your html is pure XHTML, then you could treat it as an XML document. Load your XHTML in a System.Xml.XmlDocument and then use the GetElementsByTagName("p") method to return a list of <p>-tags. This is much safer and easier than trying to parse the html directly.

梦里梦着梦中梦 2024-12-16 07:39:31

我首先创建一个 HTML 标记器,它使用 IEnumerableyield return 等会很简单。它可以使用 StreamReader.Read 逐个字符地读取文件,并且状态机 switch 将决定当前的状态并生成一系列标记或元组。

我在这里找到了一个旧的 HTML 标记器(Chris Anderson 的旧 BlogX 博客引擎)可以适应成为问题的流式解决方案的基础。

I would start by creating an HTML tokeniser, which using IEnumerable, yield return etc would be straightforward. It could read a file char-by-char using StreamReader.Read and a state machine switch would decide current state and yield a sequence of tokens or Tuples.

I found an old HTML tokenizer here (part of Chris Anderson's old BlogX blog engine) that could be adapted to become the basis of a streamable solution to the problem.

挥剑断情 2024-12-16 07:39:30

使用文件流,您应该能够以几 kb 大小的块加载文件。加载每个块时保留当前文件位置的索引。扫描您要查找的字符串的块,并将其偏移量添加到您的索引中。保留您找到的所有索引的列表。

Using file streams you should be able to load the file in chunks of several kb in size. Keep an index of your current file position as you load each chunk. Scan the chunk for the string you are looking for, and add it's offset to you index. Keep a list of all the indexes you find.

梦巷 2024-12-16 07:39:30

使用文件流的示例:

/// <summary>
/// Get a collection of index,string for everything inside p tags in the html file
/// </summary>
/// <param name="htmlFilename">filename of the html file</param>
/// <returns>collection of index,string</returns>
private Dictionary<long, string> GetHtmlIndexes(string htmlFilename)
{
    //init result
    Dictionary<long, string> result = new Dictionary<long, string>();

    StreamReader sr = null;
    try
    {
        sr = new StreamReader(htmlFilename);
        long offsetIndex = 0;
        while (!sr.EndOfStream)
        {

            string line = sr.ReadLine(); //assuming html isn't condensed into 1 single line
            offsetIndex += line.Length;  //assuming 'index' you require is the file offset
            int openingIndex = line.IndexOf(@"<p");
            int closingIndex = line.IndexOf(@">");
            if ( openingIndex > -1)
            {
                int contentIndex = openingIndex + 3; // as in <p tag or <p>tag
                string pTagContent = line.Substring( contentIndex);
                if(closingIndex> contentIndex)
                {
                    int tagLength = closingIndex - contentIndex;
                    pTagContent = line.Substring( contentIndex, tagLength);
                }
                //else, the tag finishes on next or subsequent lines and we only get content from this line

                result.Add(offsetIndex + contentIndex, pTagContent);
            }


        } //end file loop

    }
    catch (Exception ex)
    {
        //handle error ex
    }
    finally
    {
        if(sr!=null)
            sr.Close();
    }


    return result;
}

这有一些限制,您可以从注释中看到。
我怀疑使用 LINQ 会更简洁。我希望这能给你一个起点?

An example using file streams:

/// <summary>
/// Get a collection of index,string for everything inside p tags in the html file
/// </summary>
/// <param name="htmlFilename">filename of the html file</param>
/// <returns>collection of index,string</returns>
private Dictionary<long, string> GetHtmlIndexes(string htmlFilename)
{
    //init result
    Dictionary<long, string> result = new Dictionary<long, string>();

    StreamReader sr = null;
    try
    {
        sr = new StreamReader(htmlFilename);
        long offsetIndex = 0;
        while (!sr.EndOfStream)
        {

            string line = sr.ReadLine(); //assuming html isn't condensed into 1 single line
            offsetIndex += line.Length;  //assuming 'index' you require is the file offset
            int openingIndex = line.IndexOf(@"<p");
            int closingIndex = line.IndexOf(@">");
            if ( openingIndex > -1)
            {
                int contentIndex = openingIndex + 3; // as in <p tag or <p>tag
                string pTagContent = line.Substring( contentIndex);
                if(closingIndex> contentIndex)
                {
                    int tagLength = closingIndex - contentIndex;
                    pTagContent = line.Substring( contentIndex, tagLength);
                }
                //else, the tag finishes on next or subsequent lines and we only get content from this line

                result.Add(offsetIndex + contentIndex, pTagContent);
            }


        } //end file loop

    }
    catch (Exception ex)
    {
        //handle error ex
    }
    finally
    {
        if(sr!=null)
            sr.Close();
    }


    return result;
}

This has limitations which you can see from the comments.
I suspect using LINQ will be a lot neater. I hope this gives you a starting point?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文