匹配大文本文件中的字符串?
我有一个字符串列表,其中包含大约 700 万个项目,大小为 152MB 的文本文件。我想知道实现 a 函数的最佳方法是什么,该函数接受单个字符串并返回它是否在该字符串列表中。
I have a list of strings containing about 7 million items in a text file of size 152MB. I was wondering what could be best way to implement the a function that takes a single string and returns whether it is in that list of strings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是否需要多次匹配此文本文件?如果是这样,我将创建一个
HashSet
。否则,只需逐行读取它(我假设每行有一个字符串)并查看它是否匹配。152MB 的 ASCII 在内存中最终会变成超过 300MB 的 Unicode 数据 - 但现代机器有足够的内存,因此将整个数据保存在
HashSet
中确实会使重复查找变得非常快。绝对最简单的方法可能是使用
File.ReadAllLines
,尽管这会创建一个数组,然后该数组将被丢弃 - 对于内存使用来说不太好,但可能不会太糟糕了:Are you going to have to match against this text file several times? If so, I'd create a
HashSet<string>
. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a
HashSet<string>
will make repeated lookups very fast indeed.The absolute simplest way to do this is probably to use
File.ReadAllLines
, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:取决于你想做什么。当您想一次又一次重复搜索匹配项时,我会将整个文件加载到内存中(加载到
HashSet
)。在那里搜索匹配项非常容易。Depends what you want to do. When you want to repeat the search for matches again and again, I'd load the whole file into memory (into a
HashSet
). There it is very easy to search for matches.