当前位置：文江博客话题详情

如何检索大文件中字符串的所有索引？

发布于 2024-12-09 07:39:30 字数 156 浏览 0 评论 0原文

想象一下，有一个非常大的 html 文件，其中当然有很多 html 标签。我无法将整个文件加载到内存中。

我的目的是提取此

和此

字符串的所有索引。我应该如何实现它？请为我建议一些方向。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丿*梦醉红颜 2024-12-16 07:39:31

您应该尝试 Html Agility Pack。

回复收藏 0 原文

同尘 2024-12-16 07:39:31

如果您的 html 是纯 XHTML，那么您可以将其视为 XML 文档。将 XHTML 加载到 System.Xml.XmlDocument 中，然后使用 GetElementsByTagName("p") 方法返回

标记列表。这比尝试直接解析 html 更安全、更容易。

回复收藏 0 原文

梦里梦着梦中梦 2024-12-16 07:39:31

我首先创建一个 HTML 标记器，它使用 IEnumerable、yield return 等会很简单。它可以使用 StreamReader.Read 逐个字符地读取文件，并且状态机 switch 将决定当前的状态并生成一系列标记或元组。

我在这里找到了一个旧的 HTML 标记器（Chris Anderson 的旧 BlogX 博客引擎）可以适应成为问题的流式解决方案的基础。

回复收藏 0 原文

挥剑断情 2024-12-16 07:39:30

使用文件流，您应该能够以几 kb 大小的块加载文件。加载每个块时保留当前文件位置的索引。扫描您要查找的字符串的块，并将其偏移量添加到您的索引中。保留您找到的所有索引的列表。

回复收藏 0 原文

梦巷 2024-12-16 07:39:30

使用文件流的示例：

/// <summary>
/// Get a collection of index,string for everything inside p tags in the html file
/// </summary>
/// <param name="htmlFilename">filename of the html file</param>
/// <returns>collection of index,string</returns>
private Dictionary<long, string> GetHtmlIndexes(string htmlFilename)
{
    //init result
    Dictionary<long, string> result = new Dictionary<long, string>();

    StreamReader sr = null;
    try
    {
        sr = new StreamReader(htmlFilename);
        long offsetIndex = 0;
        while (!sr.EndOfStream)
        {

            string line = sr.ReadLine(); //assuming html isn't condensed into 1 single line
            offsetIndex += line.Length;  //assuming 'index' you require is the file offset
            int openingIndex = line.IndexOf(@"<p");
            int closingIndex = line.IndexOf(@">");
            if ( openingIndex > -1)
            {
                int contentIndex = openingIndex + 3; // as in <p tag or <p>tag
                string pTagContent = line.Substring( contentIndex);
                if(closingIndex> contentIndex)
                {
                    int tagLength = closingIndex - contentIndex;
                    pTagContent = line.Substring( contentIndex, tagLength);
                }
                //else, the tag finishes on next or subsequent lines and we only get content from this line

                result.Add(offsetIndex + contentIndex, pTagContent);
            }


        } //end file loop

    }
    catch (Exception ex)
    {
        //handle error ex
    }
    finally
    {
        if(sr!=null)
            sr.Close();
    }


    return result;
}

这有一些限制，您可以从注释中看到。
我怀疑使用 LINQ 会更简洁。我希望这能给你一个起点？

An example using file streams:

/// <summary>
/// Get a collection of index,string for everything inside p tags in the html file
/// </summary>
/// <param name="htmlFilename">filename of the html file</param>
/// <returns>collection of index,string</returns>
private Dictionary<long, string> GetHtmlIndexes(string htmlFilename)
{
    //init result
    Dictionary<long, string> result = new Dictionary<long, string>();

    StreamReader sr = null;
    try
    {
        sr = new StreamReader(htmlFilename);
        long offsetIndex = 0;
        while (!sr.EndOfStream)
        {

            string line = sr.ReadLine(); //assuming html isn't condensed into 1 single line
            offsetIndex += line.Length;  //assuming 'index' you require is the file offset
            int openingIndex = line.IndexOf(@"<p");
            int closingIndex = line.IndexOf(@">");
            if ( openingIndex > -1)
            {
                int contentIndex = openingIndex + 3; // as in <p tag or <p>tag
                string pTagContent = line.Substring( contentIndex);
                if(closingIndex> contentIndex)
                {
                    int tagLength = closingIndex - contentIndex;
                    pTagContent = line.Substring( contentIndex, tagLength);
                }
                //else, the tag finishes on next or subsequent lines and we only get content from this line

                result.Add(offsetIndex + contentIndex, pTagContent);
            }


        } //end file loop

    }
    catch (Exception ex)
    {
        //handle error ex
    }
    finally
    {
        if(sr!=null)
            sr.Close();
    }


    return result;
}

This has limitations which you can see from the comments.
I suspect using LINQ will be a lot neater. I hope this gives you a starting point?

回复收藏 0 原文

~没有更多了~