我正在尝试将一个大文本文件(几百万行)读入 Matlab。最初我使用 importdata(file_name),这似乎是一个简洁的解决方案。但是我需要使用 Matlab 7(是的,我知道它很旧)并且似乎不支持 importdata。因此我尝试了以下方法:
while ~feof(fid)
fline = fgetl(fid);
fdata{1,lno} = fline ;
lno = lno + 1;
end
但这真的很慢。我猜测是因为它在每次迭代时都会调整数组的大小。有没有更好的方法来做到这一点。请记住,输入数据的前 20 行是字符串类型数据,其余数据是 3 到 6 列十六进制值。
I'm trying to read a large text file (a few million lines) into Matlab. Initially I was using importdata(file_name), which seemed like a concise solution. However I need to use Matlab 7 (yeah I know its old) and it seems importdata isn't supported. As such I tried the following:
while ~feof(fid)
fline = fgetl(fid);
fdata{1,lno} = fline ;
lno = lno + 1;
end
But this is really slow. I'm guessing its because its resizing the array on each iteration. Is there a better way of doing this. Bearing in mind the first 20 lines of the input data are string type data and the remainder of the data is 3 to 6 columns of hexadecimal values.
发布评论
评论(3)
你将不得不做一些重塑,但你的另一个选择是你可以使用 fread。
但正如前面提到的,这实际上将您锁定在矩形导入中。因此,另一种选择是使用 textscan。正如我在另一篇笔记中提到的,我不能 100% 确定它何时实现,我所知道的是你没有使用 textscan 的“importdata()”
,你将能够获得每个字符的单元格数组然后您可以随意操纵该线。正如我在评论中所说,无论线条长度是否相同,这都不再重要。现在您可以更快地解析元胞数组。但正如 gnovice 提到的那样,他也确实有一个非常优雅的解决方案,您可能必须关心内存要求。
如果可以避免的话,在 matlab 中你永远不想使用的一件事就是循环结构。它们在 C/C++ 等中速度很快,但在 matlab 中,它们是到达目的地的最慢方式。
编辑:刚刚查了一下,看起来 textscan 确实是在版本 7 (R14) 中实现的,所以如果这就是你所拥有的,你应该很好地使用它。
you will have to do some reshaping, but another option for you will be you could use fread.
But as was mentioned this essentially locks you into a rectangular import. So another option would be to use textscan. As I mention in another note, I'm not 100% sure when it was implemented, all I know is you dont have "importdata()"
with the use of textscan, you will be able to get a cell array of characters for each line which you can then manipulate however you want. And as I say in my comments, this no longer matters whether the lines are the same length or not. NOW you can parse the cell array more quickly. But as gnovice mentions, and he also does have a very elegant solution, you may have to concern yourself with memory requirements.
The one thing you never want to use in matlab if you can avoid it, is looping structures. They are fast in C/C++ etc, but in matlab, they are the slowest way of getting where you are going.
EDIT: Just looked it up, and it looks like textscan WAS implemented literally in version 7 (R14) so if thats what you have, you should be good to use that.
我看到两个选择:
I see two options:
一种解决方案是使用 FSCANF< 将文件的全部内容读取为字符串/a>,使用 MAT2CELL 将字符串在换行符出现的点处拆分为单独的单元格,使用 STRTRIM 删除末端多余的空白,然后根据需要处理每个单元格中的字符串数据。例如,使用此示例文本文件
'junk.txt'
:以下代码会将每一行放入元胞数组
cellData
的一个单元格中:现在,如果您想要转换所有十六进制数据(示例数据文件中的第 3 行到第 6 行)从字符串到数字向量,您可以使用 CELLFUN 和 SSCANF 类似所以:
注意:由于您正在处理如此大的数组,因此您必须注意变量使用的内存量。上面的解决方案是矢量化的,但可能会占用大量内存。您可能需要覆盖或清除大变量,例如
strData< /code> 当您创建
)。cellData
时。或者,您可以循环nCharPerLine
中的元素,并将较大字符串strData
的每个片段单独处理为您需要的向量,您可以预先分配 现在您知道有多少行数据(即 nDataLines = numel(nCharPerLine)-nHeaderLines;One solution is to read the entire contents of the file as a string of characters with FSCANF, split the string into individual cells at the points where newline characters occur using MAT2CELL, remove extra white space on the ends with STRTRIM, then process the string data in each cell as needed. For example, using this sample text file
'junk.txt'
:The following code will put each line in a cell of a cell array
cellData
:Now if you want to convert all of the hexadecimal data (lines 3 through 6 in my sample data file) from strings to vectors of numbers, you can use CELLFUN and SSCANF like so:
NOTE: Since you are dealing with such large arrays, you will have to be mindful of the amount of memory being used by your variables. The above solution is vectorized, but may take up a lot of memory. You may have to overwrite or clear large variables like
strData
when you createcellData
. Alternatively, you could loop over the elements innCharPerLine
and individually process each segment of the larger stringstrData
into the vectors you need, which you can preallocate now that you know how many lines of data you have (i.e.nDataLines = numel(nCharPerLine)-nHeaderLines;
).