一种为对象数据库建立索引的方法

发布于 2024-11-23 16:52:54 字数 663 浏览 6 评论 0原文

我正在使用对象数据库(ZODB)来存储许多对象之间的复杂关系,但遇到了性能问题。因此,我开始构建索引以加快对象检索和插入速度。这是我的故事,希望对您有所帮助。

最初,当我向数据库添加对象时,我会将其插入专用于该对象类型的分支中。为了防止多个对象代表同一实体,我添加了一个方法,该方法将迭代分支中的现有对象以查找重复项。这种方法一开始是有效的,但随着数据库规模的增长,将每个对象加载到内存中并检查属性所需的时间呈指数级增长,令人无法接受。

为了解决这个问题,我开始根据对象中的属性创建索引,以便在添加对象时将其保存在类型分支以及属性值索引分支中。例如,假设我正在保存一个具有属性firstName =“John”和lastName=“Smith”的人员对象,该对象将被附加到人员对象类型分支,并且还将被附加到带有键“的属性索引分支内的列表”约翰”和“史密斯”。

这通过重复检查节省了大量时间,因为可以分析新对象,并且只需要检查在属性索引内相交的对象集。

然而,我很快就遇到了另一个关于更新对象时处理的问题。这些索引需要更新以反映它们可能不再准确的事实。这需要记住旧值以便可以直接访问它们并删除对象,或者迭代属性类型的所有值以便找到然后删除对象。不管怎样,性能很快又开始下降,我找不到解决它的方法。

您以前遇到过此类问题吗?你是怎么解决这个问题的,或者这只是我在使用 OODBMS 时必须处理的问题?

预先感谢您的帮助。

I'm using an object database (ZODB) in order to store complex relationships between many objects but am running into performance issues. As a result I started to construct indexes in order to speed up object retrieval and insertion. Here is my story and I hope that you can help.

Initially when I would add an object to the database I would insert it in a branch dedicated to that object type. In order to prevent multiple objects representing the same entity I added a method that would iterate over existing objects in the branch in order to find duplicates. This worked at first but as the database grew in size the time it took to load each object into memory and check attributes grew exponentially and unacceptably.

To solve that issue I started to create indexes based on the attributes in the object so that when an object would be added it would be saved in the type branch as well as within an attribute value index branch. For example, say I was saving an person object with attributes firstName = 'John' and lastName = 'Smith', the object would be appended to the person object type branch and would also be appended to lists within the attribute index branch with keys 'John' and 'Smith'.

This saved a lot of time with duplicate checking since the new object could be analysed and only the set of objects which intersect within the attribute indexes would need to be checked.

However, I quickly ran into another issue with regards to dealing when updating objects. The indexes would need to updated to reflect the fact that they may not be accurate any more. This requires either remembering old values so that they could be directly accessed and the object removed or iterating over all values of an attribute type in order to find then remove the object. Either way performance is quickly beginning to degrade again and I can't figure out a way to solve it.

Has you had this kind of issue before? What did you do solve it, or is this just something that I have to deal with when using OODBMS's?

Thank in advance for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

疯狂的代价 2024-11-30 16:52:55

考虑使用属性哈希(类似于 Java 的 hashCode()),然后使用 32 位哈希值作为键。 Python有一个哈希函数,但我不太熟悉它。

Think about using an attribute hash (something like Java's hashCode()), then use the 32-bit hash value as the key. Python has a hash function, but I am not real familiar with it.

你怎么敢 2024-11-30 16:52:54

是的,repoze.catalog 很好,并且有详细的文档。

简而言之:不要将索引作为网站结构的一部分!

  1. 考虑使用容器/项目层次结构来存储和遍历内容项目对象;计划能够通过(a)路径(图形边缘看起来像文件系统)或(b)通过识别某个不同位置的单例容器来遍历内容。

  2. 使用 RFC 4122 UUID(uuid.UUID 类型)或 64 位整数标识您的内容。

  3. 使用中央目录进行索引(例如repoze.catalog);该目录应该位于相对于 ZODB 的根应用程序对象的已知位置。您的目录可能会索引对象的属性并在查询时返回记录 ID(通常是整数)。您的工作是将这些整数 ID 映射到(可能通过 UUID 间接)到您存储内容的数据库中的某个物理遍历路径。如果您使用 zope.location 和 zope.container 作为通用接口来从根/应用程序向下遍历对象图,这会有所帮助。

  4. 使用 zope.lifecycleevent 处理程序为内容建立索引并保持新鲜。

问题是——广义的

ZODB 太灵活了:它只是一个带有事务的持久对象图,但这为您在自己的数据结构和接口中沉浮或游泳留下了空间。

解决方案——概括

通常,只需从 ZODB 社区中选取预先存在的习惯用法即可:zope.lifecycleevent 处理程序、使用 zope.container 和 zope.location 进行“容器化”遍历,以及类似 repoze.catalog 的内容。

更具体地说,

只有当您用尽了通用的习惯用法并知道它们为什么不起作用时,才尝试使用 ZODB 中各种风格的 BTree 来构建您自己的索引。事实上,我这样做的次数超出了我愿意承认的范围,但通常都有充分的理由。

在所有情况下,请保持索引(搜索、发现)和站点(遍历和存储)结构不同。

问题域的习语

  • Master ZODB BTrees:你可能想要:

    • 将内容对象作为 Persistent 的子类存储在容器中,这些容器是提供容器接口的 OOBTree 的子类(见下文)。
    • 为目录或全局索引存储 BTree,或使用 repoze.catalog 和 zope.index 等包来提取该详细信息(提示:目录解决方案通常将索引存储为 OIBTree,这将为搜索结果生成整数记录 ID;然后通常有某种文档映射器实用程序,可以将这些记录 id 转换为应用程序中可解析的内容,例如 uuid(假设您可以遍历图形到 UUID)或路径( Zope2 目录确实如此)。
  • 恕我直言,不要费心使用 intids 和键引用等(如果不需要它们,这些就不那么惯用了,而且会更困难) 结果以整数形式转换为 uuid 或路径形式,然后弄清楚如何获取对象。 注意,您可能需要一些实用程序/单例来检索从 a 返回的 id 或 uuid 的对象。恕

  • 使用 zope.lifecycleevent 或提供同步事件回调(处理程序)注册的类似包。每当对对象进行原子编辑时(可能每个事务一次,但不在事务机制中),您都应该调用这些处理程序。

  • 学习 Zope 组件架构;不是绝对要求,但肯定有帮助,即使只是了解 zope.container 等上游包的 zope.interface 接口

  • 了解 Zope2 (ZCatalog) 是如何做到这一点的:一个目录前端用于多个索引或各种排序,每个索引搜索查询,每个查询都有专门的数据结构,并且每个返回整数记录id序列。这些通过执行集合交集的目录在索引之间进行合并,并作为包含元数据存根的“大脑”对象的惰性映射返回(每个大脑都有一个 getObject() 方法来获取实际的内容对象)。从目录搜索中获取实际对象依赖于 Zope2 习惯用法,即使用根应用程序对象中的路径来识别编目项目的位置。

Yes, repoze.catalog is nice, and well documented.

In short : don't make indexing part of your site structure!

  1. Look at using a container/item hierarchy to store and traverse content item objects; plan to be able to traverse content by either (a) path (graph edges look like a filesystem) or (b) by identifying singleton containers at some distinct location.

  2. Identify your content using either RFC 4122 UUIDs (uuid.UUID type) or 64-bit integers.

  3. Use a central catalog to index (e.g. repoze.catalog); the catalog should be at a known location relative to the root application object of your ZODB. And your catalog will likely index attributes of objects and return record-ids (usually integers) on query. Your job is to map those integer ids to (perhaps indrecting via UUIDs) to some physical traversal path in the database where you are storing content. It helps if you use zope.location and zope.container for common interfaces for traversal of your object graph from root/application downward.

  4. Use zope.lifecycleevent handlers to index content and keep things fresh.

The problem -- generalized

ZODB is too flexible: it is just a persistent object graph with transactions, but this leaves room for you to sink or swim in your own data-structures and interfaces.

The solution -- generalized

Usually, just picking pre-existing idioms from the community around the ZODB will work: zope.lifecycleevent handlers, "containerish" traversal using zope.container and zope.location, and something like repoze.catalog.

More particular

Only when you exhaust the generalized idioms and know why they won't work, try to build your own indexes using the various flavors of BTrees in ZODB. I actually do this more than I care to admit, but usually have good cause.

In all cases, keep your indexes (search, discovery) and site (traversal and storage) structure distinct.

The idioms for the problem domain

  • Master ZODB BTrees: you likely want:

    • To store content objects as subclasses of Persistent in containers that are subclasses of OOBTree providing container interfaces (see below).
    • To store BTrees for your catalog or global indexes or use packages like repoze.catalog and zope.index to abstract that detail away (hint: catalog solutions typically store indexes as OIBTrees that will yield integer record ids for search results; you then typically have some sort of document mapper utility that translates those record ids into something resolvable in your application like a uuid (provided you can traverse the graph to the UUID) or a path (the way the Zope2 catalog does).
  • IMHO, don't bother working with intids and key-references and such (these are less idiomatic and more difficult if you don't need them). Just use a Catalog and DocumentMap from repoze.catalog to get results in integer to uuid or path form, and then figure out how to get your object. Note, you likely want some utility/singleton that has the job of retrieving your object given an id or uuid returned from a search.

  • Use zope.lifecycleevent or similar package that provides synchronous event callback (handler) registrations. These handlers are what you should call whenever an atomic edit is made on your object (likely once per transaction, but not in transaction machinery).

  • Learn the Zope Component Architecture; not an absolute requirement, but surely helpful, even if just to understand zope.interface interfaces of upstream packages like zope.container

  • Understanding of how Zope2 (ZCatalog) does this: a catalog fronts for multiple indexes or various sorts, which each search for a query, each have specialized data structures, and each return integer record id sequences. These are merged across indexes by the catalog doing set intersections and returned as a lazy-mapping of "brain" objects containing metadata stubs (each brain has a getObject() method to get the actual content object). Getting actual objects from a catalog search relies upon the Zope2 idiom of using paths from the root application object to identify the location of the item cataloged.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文