仅加载一次 pickled 列表 - Django\Python

发布于 2024-10-16 07:07:04 字数 196 浏览 5 评论 0原文

我有一个 pickle 文件,其中包含已编译的正则表达式和其他数据的列表。

加载大约需要 1-1.5 秒。

在我的视图中使用此列表的好方法是什么,但让 pickle 对文件只工作一次?

编辑:

导入到 settings.py 会被认为可以吗?


有什么想法吗?

I have a pickle file that contains a list of compiled regexes and other data.

It takes about 1-1.5 seconds to load.

What could a good way of using this list into my views, but have pickle work on the file just once?

Edit:

would importing into settings.py be considered ok?


Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

以为你会在 2024-10-23 07:07:04

如何操作

创建一个名为 cache.py 的模块,然后:

import cache
data = getattr(cache, 'data', '') or get_my_data()

这将仅由服务器进程重新加载数据一次(这取决于您的设置、Web 服务器以及使用 WSGI 或 CGI 的位置)。在开发 Web 服务器(./manage.py runserver)中,每次修改文件时,缓存都会失效。

工作原理

Python 中的模块对于每个 Python 进程仅导入一次。如果多次使用import,它只会返回对已导入模块的引用。因此,如果您有一个运行 mod_wsgi 且有 4 个工作进程的 Apache,则 get_my_data() 将仅被调用 4 次,因为只有 4 个 Python 进程在运行。请记住,worker 可能会死亡、被重新加载、被杀死等。但它应该将对 get_my_data() 的调用保持在最低限度。

问题:如果一个进程修改了缓存数据,其他进程不会知道。如果您的数据是静态的,那就没问题。如果您需要使其保持最新,它将无法工作。对于此方法或任何暗示使用单例的方法都是如此,除非您可以确保只有一个 Python 进程在运行(您可以,但这不是本答案的目的)。

语法说明:

getattr(cache, 'data', '') 返回对象 'cache' 的名称为 'data' 的属性。如果不存在,则返回最后一个参数,此处为空字符串。

在 Python 中,or 是惰性的,如果可以返回,将停止评估参数。在我们的例子中,如果“data”是缓存的一个属性,那么在布尔上下文中它将为 Trueor 将认为它已经完成了它的工作(因为它需要只有一个值为 True 才能返回 True),并且将在不运行 get_my_data() 的情况下返回 True。但是,如果“data”不是缓存的属性,那么如果 or 将计算空字符串,则将其视为 False,然后运行 ​​get_my_data()< /代码>。

为什么您可能不想这样做

  1. 如果您为网站的每个页面加载每个请求需要 2 秒才能生成的内容,则说明有问题。您可能想重新考虑您的架构。
  2. 如果数据并不意味着返回值,而是在用户操作后运行进程,那么最好使用 芹菜
  3. 无论如何,re 模块都会缓存正则表达式,因此您可能不再需要编译它们。其他数据可能可以表示为原始数据。将它们全部作为字符串和其他基元存储在缓存后端,例如 memcached 或 redis,它会变得更加干净。 另外,如果一个 Python 进程更新了缓存,那么其他进程也会意识到这一点。他们不会使用上面的代码片段。

关于settings.py的最后一句话

你不应该放入settings.py文件中:

  • 如果你硬编码它,你的设置文件将不可读,并且放入源代码很烦人控制工具。
  • 你不能动态地将它放在这里,因为设置模块在 Django 中是只读的,除非你使用一些丑陋的技巧,否则可能会导致意想不到的问题。

How you can do it

Create a module called cache.py, then:

import cache
data = getattr(cache, 'data', '') or get_my_data()

This will reload the data only once by server process (which will depend on your setup, your web server and wherever you use WSGI or CGI). In the dev web server (./manage.py runserver), every time you will modify a file, the cache will be invalidated.

How it works

Modules in Python are imported only once for each Python process. If you use import several times, it will only return a reference to the already imported module. So if you have an Apache running mod_wsgi with 4 workers, get_my_data() will be called only 4 times as there are only 4 Python processes running. Remember that worker can die, be reloaded, be killed, etc. But it should keep calls to get_my_data() to a minimum.

Gotcha: if one process modifies the cache data, others won't know about it. If your data is meant to be static, it's ok. If you need to keep it up to date, it won't work. It's true for this method or any method implying the use of a singleton, unless you can ensure you have only one Python process running (which you can, but this is not the purpose of this answer).

About the syntax:

getattr(cache, 'data', '') return the attribute with the name 'data' of the object 'cache'. If it doesn't exist, it returns the last parameters, here an empty string.

In Python, or is lazy and will stop evaluating parameters if it can return. In our case, if 'data' is an attribute of cache, it will be True in a boolean context, or will consider that it already did it's job (as it needs only one value to be True to return True) and will return True without running get_my_data(). However, if 'data' is not an attribute of cache, then if or will evaluate an empty string, consider it as False, then run get_my_data().

Why you probably don't want to do it anyway

  1. If you load for every page of you website something that take 2 seconds to generate for each request, something is wrong. You may want to rethink your architecture.
  2. If the data is not meant to return value, but rather run a process after a user action, then it's probably better to run an asynchronous function, using tools such as Celery.
  3. The re module caches regex anyway, so you probably don't need to compile them anymore. The other data probably can be expressed as primitive. Store all of them as strings and other primitives in a cache backend such as memcached or redis, it's going to be much cleaner. Plus, if one Python processes update the cache, then the others will be aware of it. They wont with the above snippet.

Last word about settings.py

You should not put in in the settings.py file:

  • If you hardcode it, you settings file is going to be unreadable, and annoying to put in a source control tool.
  • You can't put it here dynamically as the settings module is read only in Django, unless you use some ugly hacks, that can lead to unexpected problems.
少年亿悲伤 2024-10-23 07:07:04

我会编写一个 python 模块 - 一个带有 init 方法的单例类,该方法将 pickled 数据读取到 python 对象中,然后使用您需要的任何“get”方法来获取信息。

然后在你的settings.py中你只需调用初始化方法。任何需要从中获取信息的东西都只需导入模块并使用 get 方法。

I'd write a python module - a singleton class with an init method that reads the pickled data into a python object, and then whatever 'get' methods you need to get the info out.

Then in your settings.py you just call the initialisation method. Anything that needs to get info from it just imports the module and uses the get methods.

深爱成瘾 2024-10-23 07:07:04

您可以加载它,然后使用 django 缓存框架来存储它,这样它只会加载一次。

http://docs.djangoproject.com/en/dev/topics/cache/

You could load it in and then use the django cacheing framework to store it, that way it would only be loaded once.

http://docs.djangoproject.com/en/dev/topics/cache/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文