如何使用textscan读取文件?
我有一个很大的制表符分隔文件(10000 行,15000 列),想将其导入 Matlab。
我尝试按以下方式使用 textscan 函数导入它:
function [C_text, C_data] = ReadDataFile(filename, header, attributesCount, delimiter,
attributeFormats, attributeFormatCount)
AttributeTypes = SetAttributeTypeMatrix(attributeFormats, attributeFormatCount);
fid = fopen(filename);
if(header == 1)
%read column headers
C_text = textscan(fid, '%s', attributesCount, 'delimiter', delimiter);
C_data = textscan(fid, AttributeTypes{1, 1}, 'headerlines', 1);
else
C_text = '';
C_data = textscan(fid, AttributeTypes{1, 1});
end
fclose(fid);
AttributeTypes{1, 1} 是一个字符串,描述每列的变量类型(在本例中,有 14740 个浮点型变量和 260 个字符串类型变量,因此 AttributeTypes{1 , 1} 是 '%f%f......%f%s%s...%s,其中 %f 重复 14740 次,%s 重复 260 次)。
当我尝试执行
>> [header, data] = ReadDataFile('data/orange_large_train.data.chunk1', 1, 15000, '\t', types, size);
header 数组似乎是正确的(列名称已正确读取)。
data 是一个 1 x 15000 数组(仅导入第一行而不是 10000),并且不知道是什么导致了这种行为。
我猜问题是由这一行引起的:
C_data = textscan(fid, AttributeTypes{1, 1});
但不知道可能出了什么问题,因为帮助参考中描述了类似的示例。
如果你们中有人建议解决该问题 - 如何读取所有 10000 行,我将非常感激。
I have a large tab delimited file (10000 rows, 15000 columns) and would like to import it into Matlab.
I've tried to import it using textscan function the following way:
function [C_text, C_data] = ReadDataFile(filename, header, attributesCount, delimiter,
attributeFormats, attributeFormatCount)
AttributeTypes = SetAttributeTypeMatrix(attributeFormats, attributeFormatCount);
fid = fopen(filename);
if(header == 1)
%read column headers
C_text = textscan(fid, '%s', attributesCount, 'delimiter', delimiter);
C_data = textscan(fid, AttributeTypes{1, 1}, 'headerlines', 1);
else
C_text = '';
C_data = textscan(fid, AttributeTypes{1, 1});
end
fclose(fid);
AttributeTypes{1, 1} is a string wich describes variable types for each column (in this case there are 14740 float and 260 string type variables so the value of AttributeTypes{1, 1} is '%f%f......%f%s%s...%s where %f is repeated 14740 times and %s 260 times).
When I try to execute
>> [header, data] = ReadDataFile('data/orange_large_train.data.chunk1', 1, 15000, '\t', types, size);
header array seems to be correct (column names have been read correctly).
data is a 1 x 15000 array (only first row has been imported instead of 10000) and don't know what is causing such behavior.
I guess the problem is caused in this line:
C_data = textscan(fid, AttributeTypes{1, 1});
but don't know what could be wrong because there is a similar example described in the help reference.
I would be very thankful if anyone of you suggested any fix for the issue - How to read all 10000 rows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我相信你所有的数据都在那里。如果查看
data
内部,其中的每个单元格都应包含整个列 (10000x1)。您可以使用data{i}
将第 i 个单元格提取为数组。您可能想要分隔双精度数据和字符串数据。我不知道什么是
attributeFormats
,您可能可以使用这个数组。但您也可以使用AttributeTypes{1, 1}
。要将字符串数据合并到一个字符串元胞数组中,您可以执行以下操作:
I believe all your data are there. If you look inside
data
, every cell there should contains the whole column (10000x1). You can extract i-th cell as an array withdata{i}
.You would probably want to separate double and string data. I don't know what is
attributeFormats
, you probably can use this array. But you can also use theAttributeTypes{1, 1}
.To combine string data into one cell array of strings you can do: