更改 tabular.tabarray 或 numpy.recarray 的数据类型 (dtype)

发布于 2024-12-18 12:45:36 字数 2355 浏览 0 评论 0原文

我想用 Python 中的电子表格来表示数据。心想“好吧，肯定有人写了这样的模块！”我去了 PyPI，在那里我发现了 Tabular，它用强大的数据操作函数包装了 NumPy 的重新排列。伟大的！遗憾的是，当涉及到字符串时，它似乎根本不像电子表格。

>>> import tabular as tb
>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], names=['a','b','c'])
>>> t
tabarray([('bork', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>> t['a'][0] = 'gorkalork, but not mork'
>>> t
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])

嗯...tabarray！你在那里截断了我的字符串！真的吗？！ NumPy dtype '|S5' 表示是一个包含 5 个或更少字符的字符串，但是来吧！更新数据类型。如果需要，重新格式化整个列。任何。但不要默默地丢弃我的数据！

我尝试了其他几种方法，但没有一种能解决问题。例如，它在创建 tabarray 时直观地知道数据类型/大小，但在添加记录时则不然：

>>> t.addrecords(('mushapushalussh', 3, 4.44))
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0), ('musha', 3, 4.44)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])

我尝试切出整个列，更改其类型，设置值并重新分配它：

>>> firstcol_long = firstcol.astype('|S15')
>>> firstcol_long
tabarray(['gorka', 'stork'], 
      dtype='|S15')
>>> firstcol_long[0] = 'morkapork'
>>> firstcol_long
tabarray(['morkapork', 'stork'], 
      dtype='|S15')
>>> t['a'] = firstcol_long
>>> t
tabarray([('morka', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>>

它正确地进行了值分配，但原始数据类型仍然有效，我之前正确的数据再次被默默地截断。我什至尝试了显式数据类型设置：

>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], dtype=[('a', str),('b', int),('c', float)])
>>> t
tabarray([('', 1, 3.5), ('', 2, -4.0)], 
      dtype=[('a', '|S0'), ('b', '<i8'), ('c', '<f8')])

天哪！那就更糟了！它正确映射了 int 和 float 类型，但它猜测 str 意味着我想要 0 长度的字符串，并将所有数据截断为空。长话短说，表格不仅不像开箱即用的电子表格，而且我找不到让它工作的方法。性能对我来说不是一个大问题。我的电子表格最多可能有数百或数千行，我很乐意让系统进行一些数据复制以使我的代码变得简单。表格在许多其他方面似乎都非常符合要求。

我想我可以将 tabular 子类化，将所有字符串默认为大得不可思议（比如 1024 或 4096 字节），并使用 __setitem__ 方法在分配更大的字符串时引发异常。相当草率......但是有更好的选择吗？我扎根于 numpy.recarray 等，但没有看到明确的方法......但我将是第一个承认我对 NumPy 完全不专业的人。现实情况是，数据操作程序可能会增加字符串的长度超过其初始最大值。当然，高性能模块应该能够满足这一点。 “只是截断它！” 1974 年面向记录的数据库中常见的方法不可能成为 2011 年 Python 的最佳技术！

想法和建议？

原文

I want to represent data as a spreadsheet would in Python. Thinking "well, someone's certainly written such a module!" I went to PyPI, where I found Tabular, which wraps NumPy's recarrays with powerful data manipulations functions. Great! Sadly, it doesn't seem to act like a spreadsheet at all when it comes to strings.

>>> import tabular as tb
>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], names=['a','b','c'])
>>> t
tabarray([('bork', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>> t['a'][0] = 'gorkalork, but not mork'
>>> t
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])

Um...tabarray! You truncated my string there! Really?! The NumPy dtype '|S5' means is a string of 5 or fewer characters, but come on! Update the dtype. Reformat the entire column, if need be. Whatever. But don't silently throw away my data!

I tried several other approaches, none of which do the trick. E.g., it intuits the data type/size on tabarray creation, but not when adding records:

>>> t.addrecords(('mushapushalussh', 3, 4.44))
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0), ('musha', 3, 4.44)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])

I tried slicing out the entire column, changing its type, setting the value, and reassigning it:

>>> firstcol_long = firstcol.astype('|S15')
>>> firstcol_long
tabarray(['gorka', 'stork'], 
      dtype='|S15')
>>> firstcol_long[0] = 'morkapork'
>>> firstcol_long
tabarray(['morkapork', 'stork'], 
      dtype='|S15')
>>> t['a'] = firstcol_long
>>> t
tabarray([('morka', 1, 3.5), ('stork', 2, -4.0)], 
      dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>>

It does the value assignment correctly, but the original datatype is still in force, and my previously-correct data is again silently truncated. I even tried an explicit data type setting:

>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], dtype=[('a', str),('b', int),('c', float)])
>>> t
tabarray([('', 1, 3.5), ('', 2, -4.0)], 
      dtype=[('a', '|S0'), ('b', '<i8'), ('c', '<f8')])

Good Lord! That's worse! It correctly mapped int and float types, but it guessed that str meant I wanted 0-length strings, and truncated all of the data to nothing. Long story short, not only does tabular not act like a spreadsheet out of the box, I can't find a way to make it work. Performance is not a huge issue for me. My spreadsheets might have hundreds or thousands of rows, max, and I'd gladly have the system do a bit of data copying to make my code easy. Tabular seems in many other respects to fit the bill very nicely.

I guess I could subclass tabular with something that defaults all strings to something improbably large (1024 or 4096 bytes, say), with a __setitem__ method that raises an exception should a larger string be assigned. Rather sloppy...but are there better alternatives? I rooted around numpy.recarray and such, a bit, and didn't see a clear way...but I'll be the first to admit that I'm completely inexpert at NumPy. The reality is that data manipulation programs may increase the length of strings beyond their initial max. Surely high-function modules should accomodate that. The "just truncate it!" approach common in record-oriented databases of 1974 cannot be the right state-of-the-art for Python in 2011!

Thoughts and suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风筝有风，海豚有海 2024-12-25 12:45:36

作为表格的设计者之一......我不得不说，我很大程度上认为第一个回答者有点击中要害。

OP，您所谴责的“截断”行为是 Tabular 所基于的 NumPy 的一个基本问题。但说这是一个应该修复的“错误”并不准确，它更像是一个“限制”，呼应/强化了 NumPy（和 Tabular）的整个要点。

正如第一个回答者指出的那样，NumPy 对数据结构的大小有统一的绝对要求。一旦分配给定数据类型的 numpy 数组，该数组必须保持该数据类型 - 否则，必须初始化具有新内存的新数组。对于字符串数据类型，字符串的长度是数据类型不可或缺的固定部分——您不能只是将长度为 N 的字符串数组“转换”为长度为 M 的字符串数组。

固定数据类型对于 NumPy 相对于标准 Python 对象实现巨大性能提升至关重要。这是因为，对于固定数据类型，NumPy 对象知道已为每个对象分配了多少字节，并且可以在内存空间中“跳转”到给定条目“应该”所在的位置，而无需读取和处理该内容所有中间条目，与 Python 列表不同。当然，这限制了自然可以成为 numpy 数组的对象类型……或者实际上，它限制了可以对 numpy 数组执行的操作类型。与完全可变的 Python 列表不同（例如，您可以用任何其他 Python 对象替换任何元素，而不会干扰列表中所有其他对象的内存分配），您不能将 numpy 数组的值变异为 a 的对象不同的数据类型——因为那么字节计数将如何工作？如果突然第 N 个项目变得比数组中的所有其他项目都大，那么所有剩余项目的数据/位置会发生什么？

您可能不喜欢 NumPy 的默认行为，即当您尝试进行破坏数据类型的“非法”赋值时会发生什么情况 - 也许您希望发出错误而不是静默截断？如果是这样，您应该在 NumPy 列表上发布有关此问题的信息，因为我认为这是一个比 Tabular 可以处理的更基本的问题 - 并且无论我们个人对错误处理的感受如何，我们都希望与 NumPy 在这里所做的任何事情保持一致。

您可能也不喜欢 Tabular 进行数据类型推断的方式。事实上，NumPy 远离数据类型推断，并且基本上总是要求用户显式指定数据类型。这很好，因为它要求用户考虑这些问题，但它很烦人，因为它有时非常麻烦。 Tabular 试图找到大多数时候有用的折衷办法，但有时会失败——在这种情况下，只需指定与 NumPy 构造函数相同的关键字参数即可覆盖默认值。

我确实认为，当您说“1974 年面向记录的数据库中常见的方法不可能是 2011 年 Python 的最新技术”时，您并不太正确。事实上，NumPy 内存管理的基础确实与 1970 年代使用的工具完全相同 - 这可能令人惊讶，但优化的 NumPy 的大部分仍然构建在 Fortran 上！尽管 NumPy 在大多数情况下确实提供了更干净、更简单的界面，但当时的内存分配问题实际上仍然无法避免。但必须说的是，如果您“很高兴让系统进行一些数据复制以使我的代码变得简单”——那么 NumPy 和 Tabular 可能不适合您，因为静默数据复制及其代表的所有内容都是明确的与这些包的设计意图背道而驰。

所以问题就变成了：你的目标是什么？如果您确实需要类似数组操作的性能，那么可以使用 NumPy——在这种情况下，Tabular 提供类似电子表格的操作——但存在 NumPy 的限制。如果您不需要性能，那么一开始就没有必要使用类似数组的对象，您可以更加灵活。然而，Tabular 类似电子表格的操作并没有扩展到一般的 Python 对象——甚至还不清楚如何进行这种扩展。

而且，让我再添加一件事（非常重要）——OP，如果性能不是您的主要问题，但您仍然想使用 Tabular 作为电子表格操作的来源，您可以执行您想要的所有操作通过对表格数组构造函数的新调用来更改数据类型。也就是说，如果在给定的操作中您可能需要对新的更大的字符串数据类型进行赋值，则只需每次构造一个新的 Tabarray 即可。这显然对性能不太好，如果这不是你的限制，那么应该没问题。

这里的关键点是 Tabular 和 NumPy 为什么是“快”或“慢”设定了某些标准——然后，迫使您明确哪些操作会很慢。它们绝不允许您在幕后隐藏（就像 Matlab 那样）非常缓慢的操作。语法上简单的事情应该很快——如果你想做一些会很慢的事情，你应该在代码中更加努力地工作才能做到这一点，因此要注意正在发生的事情。因此，您的代码最终会变得更干净、更好，但仍然比直接使用 C 或 Fortran 更容易编写。事实上，这个原则在很大程度上也适用于 Python 本身——尽管对于“快”或“慢”的标准有些不同。

哈特哈，
D

As one of the designers of tabular ... I have to say that I largely think the first answerer sort of hits the nail on the head.

OP, the "truncation" behavior that you deplore is a fundamental issue with NumPy, on which Tabular is based. But it's not really accurate to say that it's a "bug" that should fixed, it's more a "limitation" that echoes / reinforces the whole point of NumPy (and Tabular) to begin with.

As the first answerer noted, NumPy has an absolute requirement for data structures to be uniform in their size. Once you allocate a numpy array of a given datatype, the array must remain that datatype -- or otherwise, a new array with new memory must be initialized. With string datatypes, the length of the string is an integral fixed part of the datatype -- you can't just "convert" an array of length-N strings to an array of length-M strings.

Fixed dataypes are critical for the way NumPy achieves huge performance gains over standard Python objects. This is because, with fixed datatypes, NumPy objects know how many bytes have been allocated to each object, and can just "jump" in memory space out to where a given entry "should" be, without having to read and process that contents of all the intervening entries, unlike Python lists. Of course, this limits the kinds of objects that can naturally BE numpy arrays ... or really, it limits the kinds of operations that can be done to a numpy array. Unlike a Python list which is completely mutable (e.g. you can replace any element with any other python object, without disturbing the memory allocation of all the other objects in the list), you can't mutate a numpy array's value to a object of a different datatype -- because how would byte accounting work then? If suddenly the Nth item gets larger than all the other items in the array, what happens to the data/locations of all the remaining items?

You may not like NumPy's default behavior for what happens when you TRY to make an "illegal" assignment that breaks the datatype -- perhaps you want an error to be issued instead of silent truncation? If so, you should post on the NumPy list about this, since I think it's more fundamental an issue than Tabular can handle -- and regardless of our personal feelings about error handling, we'd want to be consistent with whatever NumPy does here.

You may also not like how Tabular does datatype inference. Infact, NumPy stays away from dtype inferences and basically always requires the user to explicitly specify datatypes. This is good in the sense that it demands the user think about these issues, but it's annoying in that it is quite cumbersome at times. Tabular tries to hit the happy medium that is useful most of the time, but sometimes this will fail -- in which case, the defaults can be overridden by just specifying the same keyword arguments as NumPy constructors.

I do think that you're not quite right when you say that the "approach common in record-oriented databases of 1974 cannot be the right state-of-the-art for Python in 2011". In fact, the foundations of NumPy memory management are indeed the exact same tools as used in the 1970's -- it may be surprising, but big pieces of optimized NumPy are still built on Fortran! The memory allocation issues of those days are not really avoidable even today, though NumPy does provide a much cleaner and simpler interface most of the time. But it must be said that if you would "gladly have the system do a bit of data copying to make my code easy" -- then probably NumPy and Tabular are not for you, since silent data copying, and everything it represents, is explicitly counter to the design intent of these packages.

So the question becomes: what is your objective? If you really need performance with array-like operations, than use NumPy -- in which case, Tabular provides spreadsheet like operations -- but live within NumPy's limitations. If you don't need performance, there's no point in having array-like objects to begin with, and you can be more flexible. However, Tabular's spreadsheet-like operations don't extend to general python objects -- and it's not even exactly clear how to make that extension.

And, let me add one more (quite important) thing -- OP, if performance is not your main issue, but you still want to use Tabular as a source of spreadsheet operations, you could just do all the operations that you want that might change datatypes with new calls to the Tabular array constructor. That is, if in a given operation you might need to make an assignment to a new larger string datatype, just construct a new Tabarray every time. This is obviously not as good for performance, if that's not your limitation, then it should be no problem.

The key point here is that Tabular and NumPy set certain standards for what counts as "fast" or "slow" -- and then, force you to be explicit about operations that are going to be slow. They never allow you to hide (the way, e.g. Matlab, does) very slow operations under the hood. Something that's easy syntactically should be fast -- and if you want to do something that's going to be slow, you should have to work a bit harder in your code to do it and therefore pay attention to what is going on. As a result, your code ends up being cleaner and better, but still easier to write than if you had been working directly in C or Fortran. In fact, this principle largely applies to all of Python itself as well -- though with somewhat different standards for what counts as "fast" or "slow".

HTH,
D

回复收藏 0 原文