NumPy:使用 loadtxt 或 genfromtxt 读取参差不齐的结构
我需要将一个 ASCII 文件读入 Python,该文件的摘录如下所示:
E M S T N...
...
9998 1 1 128 10097 10098 10199 10198 20298 20299 20400 20399
9999 1 1 128 10098 10099 10200 10199 20299 20300 20401 20400
10000 1 1 128 10099 10100 10201 10200 20300 20301 20402 20401
10001 1 2 44 2071 2172 12373 12272
10002 1 2 44 2172 2273 12474 12373
理想情况下,上述内容应遵循 NumPy 架构:
array([(9998, 1, 1, 128, (10097, 10098, 10199, 10198, 20298, 20299, 20400, 20399)),
(9999, 1, 1, 128, (10098, 10099, 10200, 10199, 20299, 20300, 20401, 20400)),
(10000, 1, 1, 128, (10099, 10100, 10201, 10200, 20300, 20301, 20402, 20401)),
(10001, 1, 2, 44, (2071, 2172, 12373, 12272)),
(10002, 1, 2, 44, (2172, 2273, 12474, 12373))],
dtype=[('E', '<i4'), ('M', '<i4'), ('S', '<i4'), ('T', '<i4'), ('N', '|O4')])
其中最后一个对象 N
是一个 tuple< /code> 包含 2 到 8 个整数。
我想使用 np. loadtxt
或 np.genfromtxt
,只是我不确定这是否可行。有任何内置技巧,还是我需要进行自定义 split-cast-for-loop?
I need to read an ASCII file into Python, where an excerpt of the file looks like this:
E M S T N...
...
9998 1 1 128 10097 10098 10199 10198 20298 20299 20400 20399
9999 1 1 128 10098 10099 10200 10199 20299 20300 20401 20400
10000 1 1 128 10099 10100 10201 10200 20300 20301 20402 20401
10001 1 2 44 2071 2172 12373 12272
10002 1 2 44 2172 2273 12474 12373
The above should ideally be following NumPy schema:
array([(9998, 1, 1, 128, (10097, 10098, 10199, 10198, 20298, 20299, 20400, 20399)),
(9999, 1, 1, 128, (10098, 10099, 10200, 10199, 20299, 20300, 20401, 20400)),
(10000, 1, 1, 128, (10099, 10100, 10201, 10200, 20300, 20301, 20402, 20401)),
(10001, 1, 2, 44, (2071, 2172, 12373, 12272)),
(10002, 1, 2, 44, (2172, 2273, 12474, 12373))],
dtype=[('E', '<i4'), ('M', '<i4'), ('S', '<i4'), ('T', '<i4'), ('N', '|O4')])
Where the last object, N
, is a tuple
with between 2 and 8 integers.
I would like to load this ragged structure using either np.loadtxt
or np.genfromtxt
, except that I'm not sure if this is possible. Any built-in tips, or do I need to do a custom split-cast-for-loop?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知,您确实需要一个自定义的“split-cast”for 循环。
事实上,NumPy 可以读取像您这样的嵌套结构,但它们必须具有固定的形状,例如
当尝试使用您需要的 dtype 读取数据时,NumPy 仅读取每个元组的第一个数字:
因此打印
So, I would假设继续使用 for 循环而不是 numpy.loadtxt() 。
您还可以使用可能更快的中间方法:让 NumPy 使用上述代码加载文件,然后手动“更正”“N”字段:
这种方法可能比在 for 循环中解析整个数组更快。这会产生您想要的结果:
You do need a custom "split-cast" for loop, as far as I know.
In fact, NumPy can read nested structures like yours, but they must have a fixed shape, like in
When trying to read your data with the dtype that you need, NumPy only reads the first number of each tuple:
thus prints
So, I would say go ahead and use a for loop instead of
numpy.loadtxt()
.You might also use an intermediate approach that might be faster: you let NumPy load the file with the above code, and then you manually "correct" the 'N' field:
This approach might be faster than parsing the whole array in a for loop. This produces the result you want: