如何检索大文件中字符串的所有索引?
想象一下,有一个非常大的 html 文件,其中当然有很多 html 标签。我无法将整个文件加载到内存中。
我的目的是提取此
和此
Imagine there is a very large html file with of course lots of html tags. I cannot load the entire file into memory.
My intention is to extract all indexes for this <p>
and this </p>
strings. How should I achieve it? Please suggest some directions for me to do it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您应该尝试 Html Agility Pack。
You should try Html Agility Pack.
如果您的 html 是纯 XHTML,那么您可以将其视为 XML 文档。将 XHTML 加载到
System.Xml.XmlDocument
中,然后使用GetElementsByTagName("p")
方法返回标记列表。这比尝试直接解析 html 更安全、更容易。
If your html is pure XHTML, then you could treat it as an XML document. Load your XHTML in a
System.Xml.XmlDocument
and then use theGetElementsByTagName("p")
method to return a list of <p>-tags. This is much safer and easier than trying to parse the html directly.我首先创建一个 HTML 标记器,它使用
IEnumerable
、yield return
等会很简单。它可以使用 StreamReader.Read 逐个字符地读取文件,并且状态机 switch 将决定当前的状态并生成一系列标记或元组。我在这里找到了一个旧的 HTML 标记器(Chris Anderson 的旧 BlogX 博客引擎)可以适应成为问题的流式解决方案的基础。
I would start by creating an HTML tokeniser, which using
IEnumerable
,yield return
etc would be straightforward. It could read a file char-by-char usingStreamReader.Read
and a state machineswitch
would decide current state and yield a sequence of tokens orTuple
s.I found an old HTML tokenizer here (part of Chris Anderson's old BlogX blog engine) that could be adapted to become the basis of a streamable solution to the problem.
使用文件流,您应该能够以几 kb 大小的块加载文件。加载每个块时保留当前文件位置的索引。扫描您要查找的字符串的块,并将其偏移量添加到您的索引中。保留您找到的所有索引的列表。
Using file streams you should be able to load the file in chunks of several kb in size. Keep an index of your current file position as you load each chunk. Scan the chunk for the string you are looking for, and add it's offset to you index. Keep a list of all the indexes you find.
使用文件流的示例:
这有一些限制,您可以从注释中看到。
我怀疑使用 LINQ 会更简洁。我希望这能给你一个起点?
An example using file streams:
This has limitations which you can see from the comments.
I suspect using LINQ will be a lot neater. I hope this gives you a starting point?