如何有效地进行批量索引查找?
我有这些实体类型:
- Molecule
- Atom
- MoleculeAtom
给定一个长度为数百的 list(molecule_ids)
,我需要获得一个 {molecule_id: list(atom_ids)}< 形式的字典/代码>。同样,给定一个长度为数百的
list(atom_ids)
,我需要获取一个 {atom_id: list(molecule_ids)}
形式的字典。
这两个批量查找都需要非常快地进行。现在我正在做类似的事情:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
就像我说的,len(molecule_ids)
有数百个。我需要对几乎每个请求进行这种批量索引查找,并且我需要它很快,但现在它太慢了。
想法:
使用
Molecule.atoms
ListProperty
可以满足我的需要吗?考虑到我正在 MoleculeAtom 节点上存储附加数据,请记住,在分子 -> 原子和原子 -> 分子方向上进行查找对我来说同样重要。缓存?我尝试对由分子 ID 键入的原子 ID 列表进行内存缓存,但我有大量的原子和分子,而缓存无法容纳它们。
如何通过创建一个新的实体类型来对数据进行非规范化,该实体类型的键名称是分子 ID,其值是原子 ID 列表?这个想法是,在 500 个键上调用
db.get
可能比使用过滤器循环 500 次获取更快,对吗?
I have these entity kinds:
- Molecule
- Atom
- MoleculeAtom
Given a list(molecule_ids)
whose lengths is in the hundreds, I need to get a dict of the form {molecule_id: list(atom_ids)}
. Likewise, given a list(atom_ids)
whose length is in the hunreds, I need to get a dict of the form {atom_id: list(molecule_ids)}
.
Both of these bulk lookups need to happen really fast. Right now I'm doing something like:
atom_ids_by_molecule_id = {}
for molecule_id in molecule_ids:
moleculeatoms = MoleculeAtom.all().filter('molecule =', db.Key.from_path('molecule', molecule_id)).fetch(1000)
atom_ids_by_molecule_id[molecule_id] = [
MoleculeAtom.atom.get_value_for_datastore(ma).id() for ma in moleculeatoms
]
Like I said, len(molecule_ids)
is in the hundreds. I need to do this kind of bulk index lookup on almost every single request, and I need it to be FAST, and right now it's too slow.
Ideas:
Will using a
Molecule.atoms
ListProperty
do what I need? Consider that I am storing additional data on the MoleculeAtom node, and remember it's equally important for me to do the lookup in the molecule->atom and atom->molecule directions.Caching? I tried memcaching lists of atom IDs keyed by molecule ID, but I have tons of atoms and molecules, and the cache can't fit it.
How about denormalizing the data by creating a new entity kind whose key name is a molecule ID and whose value is a list of atom IDs? The idea is, calling
db.get
on 500 keys is probably faster than looping through 500 fetches with filters, right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一般来说,您的第三种方法(对数据进行非规范化)是正确的。特别是,按键的 db.get 确实与数据存储的获取速度一样快。
当然,您也需要反规范化(具有键名称原子 ID 的实体,值分子 ID 列表),并且需要在更改、添加或删除原子或分子时仔细更新所有内容 - 如果你需要它是事务性的(多个此类修改可能同时发挥作用),你需要安排祖先关系..但我不知道如何对分子和原子做这件事同时,所以也许这可能是一个问题。也许,如果修改足够罕见(并且取决于应用程序的其他方面),您可以在排队任务中序列化修改。
Your third approach (denormalizing the data) is, generally speaking, the right one. In particular,
db.get
by keys is indeed about as fast as the datastore gets.Of course, you'll need to denormalize the other way around too (entity with key name atom ID, value a list of molecule IDs) and will need to update everything carefully when atoms or molecules are altered, added, or deleted -- if you need that to be transactional (multiple such modifications being potentially in play at the same time) you need to arrange ancestor relationships.. but I don't see how to do it for both molecules and atoms at the same time, so maybe that could be a problem. Maybe, if modifications are rare enough (and depending on other aspects of your application), you could serialize the modifications in queued tasks.