快速过滤的数据结构(Delphi)?
我正在优化 Delphi 应用程序的一部分,其中经常使用不同的条件过滤对象列表。对象保存在 TObjectList
结构中,并且通常使用每个过滤器选择整个集合中很小的百分比(例如 1%)。对象总数可以在 100k 范围内,并且在计算过程中主集不会改变。尽管过滤器仅应用于少数属性,但列表无法以优化所有可能条件的方式进行排序。
我正在寻找有关如何组织对象(数据结构)或可用于解决此问题的算法的建议。谢谢你!
过滤器示例:
((Object.A between 5 and 15) AND
(Object.B < 20) AND
(Object.C(AParam) > 0)) OR
(Object.IsRoot(...))
I am optimizing a part of a Delphi application where lists of objects are frequently filtered using different criteria. The objects are kept in TObjectList
structures and it is common to select a very small percentage (ex. 1%) of the entire set with each filter. The total number of objects can be in the 100k range and during computations the main set doesn't change. Although the filters are applied for only a few properties, the lists cannot be sorted in such a way that would optimize all the possible criteria.
I am looking for suggestions on how to organize the objects (data structures) or an algorithm that can be used to approach this problem. Thank you!
Filter example:
((Object.A between 5 and 15) AND
(Object.B < 20) AND
(Object.C(AParam) > 0)) OR
(Object.IsRoot(...))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用字段来使用排序列表,这最多会限制您的搜索/过滤,然后执行二分搜索以获取此条件的索引。
参考上面的示例:生成一个
A
排序列表,搜索5
和15
的索引,因此您将得到(很多)较小的列表(两者之间的所有索引),您必须在其中检查其他字段(B
、C
和IsRoot
)。Delphi (2009+) 排序列表的实现: DeHL.Collections.SortedList
You can use a sorted list using the field, which constricts your search/filtering at most, and then perform a binary search to get the index/indexes of this criteria.
Referring to your example above: Generate an
A
-sorted list, search for the indexes of5
and15
and so you will get a (much) smaller list (all the indexes between the two), where you have to check the other fields (B
,C
andIsRoot
).Delphi (2009+) implementation of a sorted list: DeHL.Collections.SortedList
我知道你不使用 tiOPF,但是从这个项目的源代码中有很多东西可以学习。既然您可能会循环遍历过滤后的对象,为什么不创建一个过滤器迭代器呢?
本单元是一个好的开始,但我认为下载完整的源代码并浏览文件更容易:http://tiopf.svn.sourceforge.net/viewvc/tiopf/tiOPF2/Trunk/Core/tiFilteredObjectList.pas?revision=1469& view=markup
此解决方案可能使用大量 RTTI:只需创建一个包含过滤器(属性名称和值)的列表并循环遍历对象。它将为您提供最大的灵活性,但会牺牲一定的速度。如果你需要速度,我认为 Ulrichb 提供的解决方案会更好。
I know you don't use tiOPF, but there is a lot to learn from the source code of this project. Since you probably will be looping over the filtered objects, why not create a filter iterator?
This unit is a good start, but I think it is easier to download the complete source and browse through the files: http://tiopf.svn.sourceforge.net/viewvc/tiopf/tiOPF2/Trunk/Core/tiFilteredObjectList.pas?revision=1469&view=markup
This solution probably uses a lot of RTTI: just create a list with filters (property names and values) and loop over the objects. It will provide you with maximum flexibility but at the cost of some speed. If you need speed I think the solution provided by Ulrichb will be better.
想法#1
在分析器中运行代码。看看是否有任何缓慢的地方。
想法#2
也许您可以通过将对象按顺序存储在内存中来利用缓存效果。 (我假设您从头到尾按顺序遍历列表。)
一种方法可能是使用记录数组而不是对象列表。如果你的情况可能的话。请记住,Delphi 2006 中的记录可以有方法(但不能有虚拟方法)。
另一个想法可能是编写自己的类分配器。我从未尝试过,但这是我找到的一篇文章。也许尝试使用指针而不是使用 TObjectList 来遍历对象。
Idea #1
Run your code in a profiler. Find out if there are any slow spots.
Idea #2
Possibly you could take advantage of cache effects by storing your objects sequentially in memory. (I'm assuming you are walking your list sequentially from beginning to end.)
One way to do it might be to use an array of records instead of a list of objects. If that's possible in your case. Remember that records in Delphi 2006 can have methods (but not virtual ones).
Another idea might be to write your own class allocator. I've never tried that, but here's an article I found. Maybe try walking the objects using a pointer instead of using the TObjectList.
如果对象的数量很小,那么搜索的效率可能并不重要,但如果经常这样做,缓存结果可能会有所帮助。
如果对象数量很大,我会考虑使用内存数据库并使用SQL来进行查询。然后,数据库可以使用索引来尽可能快地查找内容,然后您将负担交给经过验证的工具。我个人使用 DBISAM 或 ElevateDB,但其他人也会使用内存数据库。通过使用真正的数据库工具,如果数据变得非常大,您可以轻松地将数据移动到磁盘。
If the quantity of objects is small, it probably doesn't matter too much how efficient the search is, though caching a result may help if done often.
If the quantity of objects is large, I'd consider using an in-memory database and using SQL to do the query. The database can then use indexes to find things as fast as possible, and you pass the burden to a proven tool. Personally I use DBISAM or ElevateDB, but others will do in-memory databases too. By using a real database tool, you can move the data to disk easily if it gets really large.
如果要过滤的属性数量很少并且可以排序,为什么不拥有多个列表,每个列表都按另一个属性排序呢?每个列表的引用花费每个对象 4 个字节,加上列表本身的少量开销。当然一切都取决于能否满足内存需求。如果您要处理很多对象,2 GB 并不算多......
If the number of attributes to filter on is small and they can be sorted, why not have multiple lists, each if which is sorted by another attribute? Each list costs you 4 bytes per object for the reference plus a small overhead for the list itself. Of course all depends on whether the memory requirements can be fullfilled. 2 GB is not that much, if your are dealing with a lot of objects...
这是一个非常非常非常难以以通用方式解决的问题。有一种软件整天都在执行此操作,它被称为SQL 查询优化器:每个现代 SQL 引擎中存在的那段代码将查看您想要的内容(查询),然后查看可用于您的数据的索引、可用索引的选择性,并且必须找出使用所有这些索引来为您提供结果集的最佳方法。只是为了证明这个问题非常困难,SQL 查询优化器有时会失败并产生明显低效的计划。
我很确定您不想实现一个成熟的查询优化器,因此这里有一些关于如何使您的查询足够快的提示:
(1)选择一些在查询中经常使用的字段并在其上设置索引他们。这些字段需要提供良好的选择性:不要对“布尔”值进行索引,当您可以同样快速(或更快)查看整个列表时,您只会浪费时间遍历复杂的二分搜索结构!
(2) 对于每个给定的查询,选择一个单个索引来预过滤数据,并一一应用所有其他过滤器(不进行优化)。
在您的示例中:在“A”和“B”字段上创建索引。 “C”似乎是一个函数,因此无法索引。 “IsRoot”似乎返回一个布尔值,不值得索引。
用于数据的数据结构完全取决于您的数据。如果性能至关重要,请实施多个并进行测试。如果不重要,只需使用您最喜欢的列表排序算法即可完成!
This is a very, very very difficult problem to solve in a generic way. There's one kind of software that does this all day long, and it's called an SQL query optimizer: that bit of code present in every single modern SQL engine will take a look at what you want (query), then take a look at the indexes available for your data, the selectivity of available indexes and it has to figure out the optimal way to use all that to give you your result set. Just to prove the problem is very difficult, the SQL query optimizer sometimes fails and produces visibly inefficient plans.
I'm pretty sure you don't want to implement an full-fledged query optimizer so here are some tips on how to make your queries fast enough:
(1) Select some fields that are frequently used in your queries and set up indexes on them. Those fields need to provide good selectivity: Don't index on a "boolean" value, you'd just loose time traversing complex binary search structures when you might just as fast (or faster) look at the whole list!
(2) For every given query select ONE single index to pre-filter the data and the apply all other filters one-by-one (without optimization).
In your example: Create indexes on the "A" and "B" fields. "C" seems to be a function so that's impossible to index. "IsRoot" seems to return an Boolean, that's not worth indexing.
The data structures to use for your data depend entirely on your data. If performance is critical implement several and do tests. If it's not critical just your favorite list sorting algorithm and be done!