在 Matlab 中读取和处理大型文本文件

发布于 2024-11-06 09:26:41 字数 354 浏览 4 评论 0 原文

我正在尝试将一个大文本文件（几百万行）读入 Matlab。最初我使用 importdata(file_name)，这似乎是一个简洁的解决方案。但是我需要使用 Matlab 7（是的，我知道它很旧）并且似乎不支持 importdata。因此我尝试了以下方法：

while ~feof(fid)    
    fline = fgetl(fid);
    fdata{1,lno} =  fline ;
    lno = lno + 1;
end

但这真的很慢。我猜测是因为它在每次迭代时都会调整数组的大小。有没有更好的方法来做到这一点。请记住，输入数据的前 20 行是字符串类型数据，其余数据是 3 到 6 列十六进制值。

原文

I'm trying to read a large text file (a few million lines) into Matlab. Initially I was using importdata(file_name), which seemed like a concise solution. However I need to use Matlab 7 (yeah I know its old) and it seems importdata isn't supported. As such I tried the following:

while ~feof(fid)    
    fline = fgetl(fid);
    fdata{1,lno} =  fline ;
    lno = lno + 1;
end

But this is really slow. I'm guessing its because its resizing the array on each iteration. Is there a better way of doing this. Bearing in mind the first 20 lines of the input data are string type data and the remainder of the data is 3 to 6 columns of hexadecimal values.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森林散布 2024-11-13 09:26:41

你将不得不做一些重塑，但你的另一个选择是你可以使用 fread。
但正如前面提到的，这实际上将您锁定在矩形导入中。因此，另一种选择是使用 textscan。正如我在另一篇笔记中提到的，我不能 100% 确定它何时实现，我所知道的是你没有使用 textscan 的“importdata()”

fid = fopen('textfile.txt')
Out  = textscan(fid,'%s','delimiter',sprintf('\n'));
fclose(fid)

，你将能够获得每个字符的单元格数组然后您可以随意操纵该线。正如我在评论中所说，无论线条长度是否相同，这都不再重要。现在您可以更快地解析元胞数组。但正如 gnovice 提到的那样，他也确实有一个非常优雅的解决方案，您可能必须关心内存要求。

如果可以避免的话，在 matlab 中你永远不想使用的一件事就是循环结构。它们在 C/C++ 等中速度很快，但在 matlab 中，它们是到达目的地的最慢方式。

编辑：刚刚查了一下，看起来 textscan 确实是在版本 7 (R14) 中实现的，所以如果这就是你所拥有的，你应该很好地使用它。

you will have to do some reshaping, but another option for you will be you could use fread.
But as was mentioned this essentially locks you into a rectangular import. So another option would be to use textscan. As I mention in another note, I'm not 100% sure when it was implemented, all I know is you dont have "importdata()"

fid = fopen('textfile.txt')
Out  = textscan(fid,'%s','delimiter',sprintf('\n'));
fclose(fid)

with the use of textscan, you will be able to get a cell array of characters for each line which you can then manipulate however you want. And as I say in my comments, this no longer matters whether the lines are the same length or not. NOW you can parse the cell array more quickly. But as gnovice mentions, and he also does have a very elegant solution, you may have to concern yourself with memory requirements.

The one thing you never want to use in matlab if you can avoid it, is looping structures. They are fast in C/C++ etc, but in matlab, they are the slowest way of getting where you are going.

EDIT: Just looked it up, and it looks like textscan WAS implemented literally in version 7 (R14) so if thats what you have, you should be good to use that.

回复收藏 0 原文

爱她像谁 2024-11-13 09:26:41

我看到两个选择：

您可以仅在必要时将数组的大小加倍，而不是每次都增加1。这大大减少了所需的重新分配次数。
进行两次通过的方法。第一遍只是计算行数，而不存储它们。第二遍实际上填充了数组（已预先分配到正确的大小）。

回复收藏 0 原文

情仇皆在手 2024-11-13 09:26:41

一种解决方案是使用 FSCANF< 将文件的全部内容读取为字符串/a>，使用 MAT2CELL 将字符串在换行符出现的点处拆分为单独的单元格，使用 STRTRIM 删除末端多余的空白，然后根据需要处理每个单元格中的字符串数据。例如，使用此示例文本文件 'junk.txt'：

hi
hello
1 2 3
FF 00 FF
12 A6 22 20 20 20
FF FF FF

以下代码会将每一行放入元胞数组 cellData 的一个单元格中：

>> fid = fopen('junk.txt','r');
>> strData = fscanf(fid,'%c');
>> fclose(fid);
>> nCharPerLine = diff([0 find(strData == char(10)) numel(strData)]);
>> cellData = strtrim(mat2cell(strData,1,nCharPerLine))

cellData = 

    'hi'    'hello'    '1 2 3'    'FF 00 FF'    '12 A6 22 20 20 20'    'FF FF FF'

现在，如果您想要转换所有十六进制数据（示例数据文件中的第 3 行到第 6 行）从字符串到数字向量，您可以使用 CELLFUN 和 SSCANF 类似所以：

>> cellData(3:end) = cellfun(@(s) {sscanf(s,'%x',[1 inf])},cellData(3:end));
>> cellData{3:end}    %# Display contents

ans =

     1     2     3

ans =

   255     0   255

ans =

    18   166    34    32    32    32

ans =

   255   255   255

注意：由于您正在处理如此大的数组，因此您必须注意变量使用的内存量。上面的解决方案是矢量化的，但可能会占用大量内存。您可能需要覆盖或清除大变量，例如strData< /code> 当您创建 cellData 时。或者，您可以循环 nCharPerLine 中的元素，并将较大字符串 strData 的每个片段单独处理为您需要的向量，您可以预先分配现在您知道有多少行数据（即 nDataLines = numel(nCharPerLine)-nHeaderLines;）。

One solution is to read the entire contents of the file as a string of characters with FSCANF, split the string into individual cells at the points where newline characters occur using MAT2CELL, remove extra white space on the ends with STRTRIM, then process the string data in each cell as needed. For example, using this sample text file 'junk.txt':

hi
hello
1 2 3
FF 00 FF
12 A6 22 20 20 20
FF FF FF

The following code will put each line in a cell of a cell array cellData:

>> fid = fopen('junk.txt','r');
>> strData = fscanf(fid,'%c');
>> fclose(fid);
>> nCharPerLine = diff([0 find(strData == char(10)) numel(strData)]);
>> cellData = strtrim(mat2cell(strData,1,nCharPerLine))

cellData = 

    'hi'    'hello'    '1 2 3'    'FF 00 FF'    '12 A6 22 20 20 20'    'FF FF FF'

Now if you want to convert all of the hexadecimal data (lines 3 through 6 in my sample data file) from strings to vectors of numbers, you can use CELLFUN and SSCANF like so:

>> cellData(3:end) = cellfun(@(s) {sscanf(s,'%x',[1 inf])},cellData(3:end));
>> cellData{3:end}    %# Display contents

ans =

     1     2     3

ans =

   255     0   255

ans =

    18   166    34    32    32    32

ans =

   255   255   255

NOTE: Since you are dealing with such large arrays, you will have to be mindful of the amount of memory being used by your variables. The above solution is vectorized, but may take up a lot of memory. You may have to overwrite or clear large variables like strData when you create cellData. Alternatively, you could loop over the elements in nCharPerLine and individually process each segment of the larger string strData into the vectors you need, which you can preallocate now that you know how many lines of data you have (i.e. nDataLines = numel(nCharPerLine)-nHeaderLines;).