Python:支持索引的内存对象数据库?

发布于 2024-10-19 16:58:53 字数 1540 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

淡忘如思 2024-10-26 16:58:53

通过 sqlite3 标准库模块 使用特殊值 使用内存 SQLite 数据库怎么样:内存: 用于连接?如果您不想编写 SQL 语句,则始终可以使用 ORM,例如 SQLAlchemy ,访问内存中的 SQLite 数据库。

编辑:我注意到你说这些值可能是Python对象,而且你需要避免序列化。要求将任意 Python 对象存储在数据库中也需要序列化。

如果您必须保留这两个要求,我可以提出一个实用的解决方案吗?为什么不直接使用 Python 字典作为 Python 字典集合的索引呢?听起来您对构建每个索引都有特殊的需求;找出要查询的值,然后编写一个函数来为每个值生成和索引。字典列表中一个键的可能值将是索引的键;索引的值将是一个字典列表。通过将要查找的值作为键来查询索引。

import collections
import itertools

def make_indices(dicts):
    color_index = collections.defaultdict(list)
    age_index = collections.defaultdict(list)
    for d in dicts:
        if 'favorite_color' in d:
            color_index[d['favorite_color']].append(d)
        if 'age' in d:
            age_index[d['age']].append(d)
    return color_index, age_index


def make_data_dicts():
    ...


data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
        itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
        itertools.chain.from_iterable(
            v for k, v in age_index.items() if age > 16)
)

What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.

EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.

Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.

import collections
import itertools

def make_indices(dicts):
    color_index = collections.defaultdict(list)
    age_index = collections.defaultdict(list)
    for d in dicts:
        if 'favorite_color' in d:
            color_index[d['favorite_color']].append(d)
        if 'age' in d:
            age_index[d['age']].append(d)
    return color_index, age_index


def make_data_dicts():
    ...


data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
        itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
        itertools.chain.from_iterable(
            v for k, v in age_index.items() if age > 16)
)
策马西风 2024-10-26 16:58:53

如果内存数据库解决方案最终工作量太大,您可能会发现以下一种自行过滤的方法很有用。

get_filter 函数接受参数来定义如何过滤字典,并返回一个可以传递到内置 filter 函数以过滤字典列表的函数。

import operator

def get_filter(key, op=None, comp=None, inverse=False):
    # This will invert the boolean returned by the function 'op' if 'inverse == True'
    result = lambda x: not x if inverse else x
    if op is None:
        # Without any function, just see if the key is in the dictionary
        return lambda d: result(key in d)

    if comp is None:
        # If 'comp' is None, assume the function takes one argument
        return lambda d: result(op(d[key])) if key in d else False

    # Use 'comp' as the second argument to the function provided
    return lambda d: result(op(d[key], comp)) if key in d else False

people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]

print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]

这很容易扩展到更复杂的过滤,例如根据值是否与正则表达式匹配进行过滤:

p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]

If the in memory database solution ends up being too much work, here is a method for filtering it yourself that you may find useful.

The get_filter function takes in arguments to define how you want to filter a dictionary, and returns a function that can be passed into the built in filter function to filter a list of dictionaries.

import operator

def get_filter(key, op=None, comp=None, inverse=False):
    # This will invert the boolean returned by the function 'op' if 'inverse == True'
    result = lambda x: not x if inverse else x
    if op is None:
        # Without any function, just see if the key is in the dictionary
        return lambda d: result(key in d)

    if comp is None:
        # If 'comp' is None, assume the function takes one argument
        return lambda d: result(op(d[key])) if key in d else False

    # Use 'comp' as the second argument to the function provided
    return lambda d: result(op(d[key], comp)) if key in d else False

people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]

print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]

This is pretty easily extensible to more complex filtering, for example to filter based on whether or not a value is matched by a regex:

p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]
月寒剑心 2024-10-26 16:58:53

我知道的唯一解决方案是几年前我在 PyPI 上偶然发现的一个软件包, PyDbLite 。没关系,但有几个问题:

  1. 它仍然想将所有内容序列化到磁盘,作为 pickle 文件。但这对我来说很简单,可以撕掉。 (这也是不必要的。如果插入的对象是可序列化的,那么整个集合也是可序列化的。)
  2. 基本记录类型是一个字典,它在其中插入自己的元数据,键 __id__ 下的两个整数和__版本__
  3. 索引非常简单,仅基于记录字典的值。如果您想要更复杂的东西,例如基于记录中对象的属性,您必须自己编写代码。 (我本来打算自己做一些事情,但从未抽出时间去做。)

作者似乎偶尔会做这件事。我使用它时有一些新功能,包括一些用于复杂查询的漂亮语法。

假设您撕掉了酸洗(我可以告诉您我做了什么),您的示例将是(未经测试的代码):

from PyDbLite import Base

db = Base()
db.create("name", "age", "favourite_color")

# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")

# These should return an object you can iterate over
# to get the matching records.  These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None

# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]

希望它足以让您开始。

The only solution I know is a package I stumbled across a few years ago on PyPI, PyDbLite. It's okay, but there are few issues:

  1. It still wants to serialize everything to disk, as a pickle file. But that was simple enough for me to rip out. (It's also unnecessary. If the objects inserted are serializable, so is the collection as a whole.)
  2. The basic record type is a dictionary, into which it inserts its own metadata, two ints under keys __id__ and __version__.
  3. The indexing is very simple, based only on value of the record dictionary. If you want something more complicated, like based on a the attribute of a object in the record, you'll have to code it yourself. (Something I've meant to do myself, but never got around to.)

The author does seem to be working on it occasionally. There's some new features from when I used it, including some nice syntax for complex queries.

Assuming you rip out the pickling (and I can tell you what I did), your example would be (untested code):

from PyDbLite import Base

db = Base()
db.create("name", "age", "favourite_color")

# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")

# These should return an object you can iterate over
# to get the matching records.  These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None

# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]

Hopefully it will be enough to get you started.

随波逐流 2024-10-26 16:58:53

就“身份”而言,任何可散列的东西都应该能够进行比较,以跟踪对象身份。

Zope 对象数据库 (ZODB):
http://www.zodb.org/

PyTables 效果很好:
http://www.pytables.org/moin

Metakit for Python 也运行良好:
http://equi4.com/metakit/python.html
支持列和子列,但不支持非结构化数据

研究“流处理”,如果您的数据集非常大,这可能会很有用:
http://www.trinhhaiianh.com/stream.py/

任何内存数据库,可以序列化(写入磁盘)的文件将会有您的身份问题。如果可能的话,我建议将要存储的数据表示为本机类型(列表、字典)而不是对象。

请记住,NumPy 旨在对内存数据结构执行复杂的操作,如果您决定推出自己的解决方案,它可能会成为您的解决方案的一部分。

As far as "identity" anything that is hashable you should be able to compare, to keep track of object identity.

Zope Object Database (ZODB):
http://www.zodb.org/

PyTables works well:
http://www.pytables.org/moin

Also Metakit for Python works well:
http://equi4.com/metakit/python.html
supports columns, and sub-columns but not unstructured data

Research "Stream Processing", if your data sets are extremely large this may be useful:
http://www.trinhhaianh.com/stream.py/

Any in-memory database, that can be serialized (written to disk) is going to have your identity problem. I would suggest representing the data you want to store as native types (list, dict) instead of objects if at all possible.

Keep in mind NumPy was designed to perform complex operations on in-memory data structures, and could possibly be apart of your solution if you decide to roll your own.

我的鱼塘能养鲲 2024-10-26 16:58:53

我编写了一个名为 Jsonstore 的简单模块,它解决了 (2) 和 (3) 问题。您的示例如下:

from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists

db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'})  # alternative syntax

db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists())  # again, 2 different syntaxes

I wrote a simple module called Jsonstore that solves (2) and (3). Here's how your example would go:

from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists

db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'})  # alternative syntax

db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists())  # again, 2 different syntaxes
余罪 2024-10-26 16:58:53

不确定它是否符合您的所有要求,但 TinyDB(使用内存存储)也可能值得一试:

>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]

它的简单性和强大的查询引擎使其成为某些用例的非常有趣的工具。有关更多详细信息,请参阅 http://tinydb.readthedocs.io/

Not sure if it complies with all your requirements, but TinyDB (using in-memory storage) is also probably worth the try:

>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]

Its simplicity and powerful query engine makes it a very interesting tool for some use cases. See http://tinydb.readthedocs.io/ for more details.

断肠人 2024-10-26 16:58:53

如果您愿意解决序列化问题,MongoDB 可以为您工作。 PyMongo 提供的界面与您所描述的几乎相同。如果您决定序列化,那么命中不会那么糟糕,因为 Mongodb 是内存映射的。

If you are willing to work around serializing, MongoDB could work for you. PyMongo provides an interface almost identical to what you describe. If you decide to serialize, the hit won't be as bad since Mongodb is memory mapped.

ぶ宁プ宁ぶ 2024-10-26 16:58:53

只需使用 isinstance()、hasattr()、getattr() 和 setattr() 就可以完成您想要做的事情。

然而,在完成之前事情会变得相当复杂!

我想可以将所有对象存储在一个大列表中,然后对每个对象运行查询,确定它是什么并查找给定的属性或值,然后将值和对象作为元组列表返回。然后你可以很容易地对你的返回值进行排序。 copy.deepcopy 将是你最好的朋友和最大的敌人。

听起来很有趣!祝你好运!

It should be possible to do what you are wanting to do with just isinstance(), hasattr(), getattr() and setattr().

However, things are going to get fairly complicated before you are done!

I suppose one could store all the objects in a big list, then run a query on each object, determining what it is and looking for a given attribute or value, then return the value and the object as a list of tuples. Then you could sort on your return values pretty easily. copy.deepcopy will be your best friend and your worst enemy.

Sounds like fun! Good luck!

雨夜星沙 2024-10-26 16:58:53

我昨天开始开发一个,但尚未发布。它为您的对象建立索引并允许您运行快速查询。所有数据都保存在 RAM 中,我正在考虑智能加载和保存方法。出于测试目的,它通过 cPickle 加载和保存。

如果您仍然感兴趣,请告诉我。

I started developing one yesterday and it isn't published yet. It indexes your objects and allows you to run fast queries. All data is kept in RAM and I'm thinking about smart load and save methods. For testing purposes it is loading and saving through cPickle.

Let me know if you are still interested.

酒解孤独 2024-10-26 16:58:53

ducks 正是您所描述的。

  • 它在 Python 对象上构建索引
  • 它不会序列化或持久化任何内容
  • 正确处理缺失的属性
  • 它使用 C 库,因此速度非常快且 RAM 效率高

pip install ducks

from ducks import Dex, ANY

objects = [
    {"name": "Joe", "age": 16},
    {"name": "Jane", "favourite_color": "red"},
]


# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])

# Look up by any combination of attributes
dex[{'age': {'>=': 16}}]  # Returns Joe

# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane

此示例使用 dict,但 ducks 适用于任何对象类型。

ducks is exactly what you are describing.

  • It builds indexes on Python objects
  • It does not serialize or persist anything
  • Missing attributes are handled correctly
  • It uses C libraries so it's very fast and RAM-efficient

pip install ducks

from ducks import Dex, ANY

objects = [
    {"name": "Joe", "age": 16},
    {"name": "Jane", "favourite_color": "red"},
]


# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])

# Look up by any combination of attributes
dex[{'age': {'>=': 16}}]  # Returns Joe

# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane

This example uses dicts, but ducks works on any object type.

阳光①夏 2024-10-26 16:58:53

添加另一个选项:odex(我是作者)

from odex import IndexedSet, attr, and_

class X:
    def __init__(self, a, b):
        self.a = a
        self.b = b

iset = IndexedSet(
    [
        X(a=1, b=4),
        X(a=2, b=5),
        X(a=2, b=6),
        X(a=3, b=7),
    ], 
    indexes=["a"]
)

# Filter objects with SQL-like expressions:
iset.filter("a = 2 AND b = 5") == {X(a=2, b=5)}

# Or, using the fluent interface:
iset.filter(
    and_(
        attr("a").eq(2),
        attr("b").eq(5)
    )
) == {X(a=2, b=5)}

与鸭子类似,这会构建索引在 Python 对象上,并且不会序列化或持久化任何内容。

与 sqlite 类似,odex 支持成熟的逻辑表达式。

Throwing another option into the mix: odex (I’m the author)

from odex import IndexedSet, attr, and_

class X:
    def __init__(self, a, b):
        self.a = a
        self.b = b

iset = IndexedSet(
    [
        X(a=1, b=4),
        X(a=2, b=5),
        X(a=2, b=6),
        X(a=3, b=7),
    ], 
    indexes=["a"]
)

# Filter objects with SQL-like expressions:
iset.filter("a = 2 AND b = 5") == {X(a=2, b=5)}

# Or, using the fluent interface:
iset.filter(
    and_(
        attr("a").eq(2),
        attr("b").eq(5)
    )
) == {X(a=2, b=5)}

Similar to ducks, this builds indexes on Python objects and doesn't serialize or persist anything.

Similar to sqlite, odex supports full-fledged logical expressions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文