我正在尝试编写一个搜索键值类型结构的小程序。我的搜索是找到搜索键值的最快方法。
我更愿意使用 C# 来编写这个程序,除非另一种语言给我带来了显着的优势。我提出的另一个限制是所有内容都必须在同一台计算机上。我不想使用 Oracle 或 SQL Server 数据库,因为我相信其他选项会更快。数据大部分被读取,很少被写入。每当数据发生更改或更新时,都会创建一个新集合,如果数据写入需要时间也没关系。
假设:
数据按数字顺序排序。
结构就这么简单:
Char3文件:(这个文件只会存储3个字符键)
键|值
100|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
999|2,5,6,7:9:3,4:3:2,5
Char5 文件:(此文件仅存储 5 个字符键)
键|值
A1000|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
Char3 和 Char5 遵循相同的存储结构,但具有不同类型的密钥。然而,给定文件中的密钥长度相同,
我有多个这样的文件,每个文件都将遵循相同的结构。唯一的变化是每个文件中的密钥长度。
该任务给定一组1-200(可变长度)的键,找到与每个键相关的所有数据。
我从数据库生成这些数据,因此可以以任何格式创建数据。
对于 FileStream 测试,我将填充给定文件的每一行,然后使用 FileStream.Seek 根据填充快速跳转到给定位置。
我想做的是找出这些方法中哪一个最快?
- FileStream - 我最终也会查看内存映射文件。 (开放其他选项)
- 嵌入式 SQL - SQLite (开放其他选项)
- NoSql - ?? (寻找建议)
我的问题是我应该在每个类别中使用什么来进行适当的比较。例如,如果我使用 FileStream 而不是使用 FileStream.Seek,那么它就不是正确的比较。
我最终也想尽可能并行地运行搜索。我的主要要求是搜索性能。
任何想法或建议都会很棒。
谢谢,
更新:我将在处理时列出选项详细信息和结果
在包含 10K 行、2.28 MB 的文件中查找 5000 个随机条目(按行号或其他类似特征)。
- 文件流选项 - 最佳时间:00: 00:00.0398530 毫秒
I am trying to write a small program that searches a key-value type structure. My search is to find the FASTEST approach possible for searching the key-value.
I would prefer to use C# for this program, unless another language gives me a significant advantage. Another limitation that I am putting is that everything has to be on the same computer. I don't want to use an Oracle or SQL Server database, beacuse I belive the other options will me much faster. The data is mostly read and rarely written. Whenever there are changes or updates to the data a new set is created and it is ok if writing of the data takes time.
Assumptions:
The data is sorted in a numeric order.
The structure is as simple as this:
Char3 file: (This file will only store 3 character keys)
Key|Value
100|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
999|2,5,6,7:9:3,4:3:2,5
Char5 file: (This file will only store 5 character keys)
Key|Value
A1000|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
Char3 and Char5 follow the same storage structure but have different types of keys. The key however will be of same length in a given file
I have multiple files like these each file will follow the same structure. The only variation will be the key length in each file.
The task is given a set of 1-200 (variable lengths) Keys find all the data related to each key.
I am generating this data from a database and thus can create the data in any format.
For the FileStream test I am going to pad each line for a given file and then use FileStream.Seek to quickly jump to a given location based on the padding.
What I want to do is find out which of these apporaches would be the fastest?
- FileStream - I will eventually also look at memory-mapped files. (Open to other options)
- Embedded SQL - SQLite (Open to other options)
- NoSql - ?? (Looking for Suggstions)
My question is what should I be using in each of these categories for a proper comparisson. For example, if I was using FileStream and not using FileStream.Seek than it would not be a proper comparisson.
I would eventually also like to run the searches in parallel as much as I can. My pripary requirement is SEARCH performance.
Any ideas or suggestions would be great.
Thanks,
UPDATE: I will list the option details and results as I process them
Find 5000 random entries (by line numebr or some other similar charateristic) in a file that contains 10K lines, 2.28 MB.
- FileStream options - Best time: 00:00:00.0398530 ms
发布评论
评论(2)
您最好的选择是 Berkeley DB,通过 C# API(使用键值对存储)。 Berkeley DB 是一个库,因此它链接到您的应用程序。无需安装单独的服务器,也没有客户端/服务器开销。 Berkeley DB 速度极快、可扩展且可靠,旨在完全执行您在此描述的操作。
免责声明:我是 Berkeley DB 的产品经理,所以我有点偏见。但当我说这正是 Berkeley DB 设计的场景时,我是认真的。
You're best bet is Berkeley DB, via the C# API (which uses key-value pair storage). Berkeley DB is a library, so it links into your application. There is no separate server to install and no client/server overhead. Berkeley DB is extremely fast, scalable and reliable and is designed to do exactly what you describe here.
Disclaimer: I'm the Product Manager for Berkeley DB, so I'm a little biased. But I'm serious when I say that this is exactly the scenario that Berkeley DB is designed for.
据我了解,您的数据已经在数据库中,已编入索引并可供搜索。您想要做的是从数据库中提取它并实现您的自定义搜索方案,您可以在其中手动操作文件中的字节偏移量等。恕我直言,这种方法注定会失败。
众所周知,因为个人信念而不使用数据库并不是性能调优的最佳方法。 :-)
As far as I understand, your data is already in a database, indexed and ready to be searched. What you want to do is to extract it from the database and implement your custom search scheme, where you manually manipulate byte offsets in a file etc. IMHO this approach is bound to fail.
Not using a database because of one's beliefs is known to not be the best approach to performance tuning. :-)