Matlab:如何读取以逗号作为小数分隔符的数字?

发布于 2024-12-17 03:43:31 字数 414 浏览 0 评论 0原文

我有很多(数十万)相当大(>0.5MB)的文件,其中数据是数字,但以逗号作为小数分隔符。 对于我来说使用 sed "s/,/./g" 这样的外部工具是不切实际的。 当分隔符是点时,我只使用 textscan(fid, '%f%f%f'),但我看不到更改小数点分隔符的选项。 如何有效地读取这样的文件?

文件中的示例行:

5,040000    18,040000   -0,030000

注意:有一个 R 的类似问题,但我使用 Matlab。

I have a whole lot (hundreds of thousands) of rather large (>0.5MB) files, where data are numerical, but with a comma as decimal separator.
It's impractical for me to use an external tool like sed "s/,/./g".
When the separator is a dot, I just use textscan(fid, '%f%f%f'), but I see no option to change the decimal separator.
How can I read such a file in an efficient manner?

Sample line from a file:

5,040000    18,040000   -0,030000

Note: There is a similar question for R, but I use Matlab.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

讽刺将军 2024-12-24 03:43:31

通过测试脚本,我发现系数小于 1.5。我的代码如下所示:

tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

A = txt2mat(filename, tmco{:});

请注意不同的“ReplaceChar”值和“ReadMode”“块”。

我在我的(不是太新)机器上得到了约 5MB 文件的以下结果:

  • txt2mat test comma avg。时间:0.63231
  • txt2mat 测试点平均值。时间:0.45715
  • textscan 测试点平均值。时间:0.4787

我的测试脚本的完整代码:

%% generate sample files

fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';

c = 5;       % # columns
r = 100000;  % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];

tcom = strrep(tdot,'.',',');

% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);

disp('-----')

%% read back sample files with txt2mat and textscan

% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block'} ;

% textscan-options
tsco = {'HeaderLines'   , 1      , ...
        'CollectOutput' , true   } ;


A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});

fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};

disp(['txt2mat  test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat  test dot   (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot   (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')

%% speed test

numTest = 20;

% A) txt2mat with comma
tic
for k = 1:numTest
    A = txt2mat(fcom, tmco{:});
    clear A
end
ttmc = toc;
disp(['txt2mat  test comma avg. time: ' num2str(ttmc/numTest) ])

% B) txt2mat with dot
tic
for k = 1:numTest
    B = txt2mat(fdot, tmdo{:});
    clear B
end
ttmd = toc;
disp(['txt2mat  test dot   avg. time: ' num2str(ttmd/numTest) ])

% C) textscan with dot
tic
for k = 1:numTest
    fid = fopen(fdot);
    C = textscan(fid, repmat('%f',1,c) , tsco{:} );
    fclose(fid);
    C = C{1};
    clear C
end
ttsc = toc;
disp(['textscan test dot   avg. time: ' num2str(ttsc/numTest) ])
disp('-----')

With a test script I've found a factor of less than 1.5. My code would look like:

tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

A = txt2mat(filename, tmco{:});

Note the different 'ReplaceChar' value and 'ReadMode' 'block'.

I get the following results for a ~5MB file on my (not too new) machine:

  • txt2mat test comma avg. time: 0.63231
  • txt2mat test dot avg. time: 0.45715
  • textscan test dot avg. time: 0.4787

The full code of my test script:

%% generate sample files

fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';

c = 5;       % # columns
r = 100000;  % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];

tcom = strrep(tdot,'.',',');

% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);

disp('-----')

%% read back sample files with txt2mat and textscan

% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block'} ;

% textscan-options
tsco = {'HeaderLines'   , 1      , ...
        'CollectOutput' , true   } ;


A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});

fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};

disp(['txt2mat  test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat  test dot   (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot   (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')

%% speed test

numTest = 20;

% A) txt2mat with comma
tic
for k = 1:numTest
    A = txt2mat(fcom, tmco{:});
    clear A
end
ttmc = toc;
disp(['txt2mat  test comma avg. time: ' num2str(ttmc/numTest) ])

% B) txt2mat with dot
tic
for k = 1:numTest
    B = txt2mat(fdot, tmdo{:});
    clear B
end
ttmd = toc;
disp(['txt2mat  test dot   avg. time: ' num2str(ttmd/numTest) ])

% C) textscan with dot
tic
for k = 1:numTest
    fid = fopen(fdot);
    C = textscan(fid, repmat('%f',1,c) , tsco{:} );
    fclose(fid);
    C = C{1};
    clear C
end
ttsc = toc;
disp(['textscan test dot   avg. time: ' num2str(ttsc/numTest) ])
disp('-----')
小嗲 2024-12-24 03:43:31

我的解决方案(假设逗号仅用作小数占位符,并且空格界定列):

fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');

如果您碰巧需要删除单个标题行,就像我所做的那样,那么这应该有效:

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data

My solution (assumes commas are only used as decimal place holders and that white space delineates columns):

fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');

If you should happen to need to remove a single header line, as I did, then this should work:

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data
忆沫 2024-12-24 03:43:31

您可以使用txt2mat

A = txt2mat('data.txt');

它将自动处理数据。但你可以明确地说:

A = txt2mat('data.txt','ReplaceChar',',.');

PS 它可能效率不高,但如果你只需要特定的数据格式,你可以从源文件中复制该部分。

You may use txt2mat.

A = txt2mat('data.txt');

It will handle the data automatically. But you can explicitly say:

A = txt2mat('data.txt','ReplaceChar',',.');

P.S. It may not be efficient, but you can copy the part from the source file if you need it only for your specific data formats.

童话里做英雄 2024-12-24 03:43:31

您可以尝试通过添加标题行数以及(如果可能)列数作为输入来绕过其文件分析来加速 txt2mat。与使用点分隔小数的 textscan 导入相比,因子不应为 25。 (您也可以使用 mathworks 网站上的作者页面与我联系。)
如果您找到在 matlab 中处理逗号分隔小数的更有效方法,请告诉我们。

You may try to speed up txt2mat by also adding the number of header lines, and, if possible, the number of columns as inputs to bypass its file analysis. There shouldn't be a factor of 25 compared to a textscan import with dot-separated decimals then. (You may also contact me using the author page on the mathworks site.)
Please let us know if you find a more efficient way to handle comma-separated decimals in matlab.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文