Python 中的标志
我正在使用一个大矩阵(250x250x30 = 1,875,000 个单元),并且我想要一种方法来为该矩阵中的每个单元设置任意数量的标志,以某种易于使用且合理节省空间的方式。
我最初的计划是一个 250x250x30 的列表数组,其中每个元素类似于:["FLAG1","FLAG8","FLAG12"]
。 然后我将其更改为仅存储整数:[1,8,12]
。 这些整数通过 getter/setter 函数在内部映射到原始标志字符串。 这只使用 250mb,每个点有 8 个标志,这在内存方面是很好的。
我的问题是:我是否缺少另一种明显的方式来构建此类数据?
感谢大家的建议。 我最终将一些建议汇总为一个,遗憾的是我只能选择一个答案,并且不得不忍受对其他答案的投票:
编辑:erm我在这里的初始代码(使用集合作为 3d numpy 数组的基本元素)使用了 A很多内存。 当填充 randint(0,2**1000)
时,这个新版本使用大约 500mb。
import numpy
FLAG1=2**0
FLAG2=2**1
FLAG3=2**2
FLAG4=2**3
(x,y,z) = (250,250,30)
array = numpy.zeros((x,y,z), dtype=object)
def setFlag(location,flag):
array[location] |= flag
def unsetFlag(location,flag):
array[location] &= ~flag
I'm working with a large matrix (250x250x30 = 1,875,000 cells), and I'd like a way to set an arbitrary number of flags for each cell in this matrix, in some manner that's easy to use and reasonably space efficient.
My original plan was a 250x250x30 list array, where each element was something like: ["FLAG1","FLAG8","FLAG12"]
. I then changed it to storing just integers instead: [1,8,12]
. These integers are mapped internally by getter/setter functions to the original flag strings. This only uses 250mb with 8 flags per point, which is fine in terms of memory.
My question is: am I missing another obvious way to structure this sort of data?
Thanks all for your suggestions. I ended up rolling a few suggestions into one, sadly I can only pick one answer and have to live with upvoting the others:
EDIT: erm the initial code I had here (using sets as the base element of a 3d numpy array) used A LOT of memory. This new version uses around 500mb when filled with randint(0,2**1000)
.
import numpy
FLAG1=2**0
FLAG2=2**1
FLAG3=2**2
FLAG4=2**3
(x,y,z) = (250,250,30)
array = numpy.zeros((x,y,z), dtype=object)
def setFlag(location,flag):
array[location] |= flag
def unsetFlag(location,flag):
array[location] &= ~flag
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
进一步采纳 Robbie 的建议...
您还可以创建一个辅助类。
您还可以实现 Python 的特殊方法,例如 __contains__ ,以使其更易于使用。
Taking Robbie's suggestion one step further...
You can also create a helper class.
You could also implement Python's special methods like
__contains__
to make it easier to work with.如果每个单元格都有一个标志,那么您的解决方案就很好。 但是,如果您正在使用稀疏数据集,其中只有一小部分单元格具有标志,那么您真正想要的是字典。 您需要设置字典,因此键是单元格位置的元组,值是标志列表,就像您在解决方案中一样。
这里,我们的 1,1,1 单元格具有标志 1,2 和 3,单元格 250,250,30 具有标志 4,5 和 6
编辑固定键元组,感谢 Andre 和字典语法。
Your solution is fine if every single cell is going to have a flag. However if you are working with a sparse dataset where only a small subsection of your cells will have flags what you really want is a dictionary. You would want to set up the dictonary so the key is a tuple for the location of the cell and the value is a list of flags like you have in your solution.
Here we have the 1,1,1 cell have the flags 1,2, and 3 and the cell 250,250,30 have the flags 4,5, and 6
edit- fixed key tuples, thanks Andre, and dictionary syntax.
我通常会使用 numpy 数组(大概是短整数,每个 2 个字节,因为您可能需要超过 256 个不同的值)——对于<200万个单元来说,这将花费不到4MB。
如果由于某种原因我无法承受 numpy 依赖项(例如在 App Engine 上,它不支持 numpy),我会使用标准库 array 模块 - 它只支持一维数组,但对于大型同质数组,它与 numpy 一样节省空间,以及您提到的 getter/setter 例程可以完美地“线性化”一个 3 项元组,该元组是一维数组中单个整数索引的自然索引。
一般来说,只要你有大的同质、密集向量或数字矩阵,就考虑 numpy (或数组)——Python 内置列表在这个用例中非常浪费空间(由于它们的通用性,你没有使用并且这里不需要!-),并且节省内存也间接转化为节省时间(更好的缓存、更少的间接级别等)。
I would generally use a numpy array (presumably of short ints, 2 bytes each, since you may need more than 256 distinct values) -- that would take less than 4MB for the <2 million cells.
If for some reason I couldn't afford the numpy dependency (e.g on App Engine, which doesn't support numpy), I'd use the standard library array module - it only supports 1-dimensional arrays, but it's just as space-efficient as numpy for large homogeneous arrays, and the getter/setter routines you mention can perfectly well "linearize" a 3-items tuple that's your natural index into the single integer index into the 1-D array.
In general, consider numpy (or array) any time you have large homogeneous, dense vectors or matrices of numbers -- Python built-in lists are highly wasteful of space in this use case (due to their generality which you're not using and don't need here!-), and saving memory indirectly translates to saving time too (better caching, fewer levels of indirection, etc, etc).
您可以定义一些具有不同的两个值的幂的常量,如下所示:
并将它们与布尔逻辑一起使用以仅将标志存储在一个整数中,pe:
要检查标志是否已启用,您可以使用
& 运算符:
如果启用该标志,则此表达式将返回一个非零值,该值在任何布尔运算中都会被评估为 True。 如果禁用该标志,则表达式将返回 0,即在布尔运算中计算为 False。
You can define some constants with different, power of two values as:
And use them with boolean logic to store the flags in only one integer, p.e.:
To check if a flag is enabled, you can use the
&
operator:If the flag is enabled, this expression will return a non-zero value, that will be evaluated as True in any boolean operation. If the flag is disabled, the expression will return 0, that is evaluated as False in boolean operations.
考虑使用 Flyweight 模式来共享单元格属性:
http://en.wikipedia.org/wiki/Flyweight_pattern
Consider using Flyweight pattern to share cell properties:
http://en.wikipedia.org/wiki/Flyweight_pattern
BitSet 就是您想要的,因为它允许您仅使用固定的值一次存储多个标志大小整数(Int 类型)
BitSet is what you want, since it allows you to store many flags at once using only an fixed size integer (Int type)