Matlab：如何读取以逗号作为小数分隔符的数字？

发布于 2024-12-17 03:43:31 字数 414 浏览 0 评论 0原文

我有很多（数十万）相当大（>0.5MB）的文件，其中数据是数字，但以逗号作为小数分隔符。对于我来说使用 sed "s/,/./g" 这样的外部工具是不切实际的。当分隔符是点时，我只使用 textscan(fid, '%f%f%f')，但我看不到更改小数点分隔符的选项。如何有效地读取这样的文件？

文件中的示例行：

5,040000    18,040000   -0,030000

注意：有一个 R 的类似问题，但我使用 Matlab。

原文

I have a whole lot (hundreds of thousands) of rather large (>0.5MB) files, where data are numerical, but with a comma as decimal separator.
It's impractical for me to use an external tool like sed "s/,/./g".
When the separator is a dot, I just use textscan(fid, '%f%f%f'), but I see no option to change the decimal separator.
How can I read such a file in an efficient manner?

Sample line from a file:

5,040000    18,040000   -0,030000

Note: There is a similar question for R, but I use Matlab.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

讽刺将军 2024-12-24 03:43:31

通过测试脚本，我发现系数小于 1.5。我的代码如下所示：

tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

A = txt2mat(filename, tmco{:});

请注意不同的“ReplaceChar”值和“ReadMode”“块”。

我在我的（不是太新）机器上得到了约 5MB 文件的以下结果：

txt2mat test comma avg。时间：0.63231
txt2mat 测试点平均值。时间：0.45715
textscan 测试点平均值。时间：0.4787

我的测试脚本的完整代码：

%% generate sample files

fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';

c = 5;       % # columns
r = 100000;  % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];

tcom = strrep(tdot,'.',',');

% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);

disp('-----')

%% read back sample files with txt2mat and textscan

% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block'} ;

% textscan-options
tsco = {'HeaderLines'   , 1      , ...
        'CollectOutput' , true   } ;


A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});

fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};

disp(['txt2mat  test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat  test dot   (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot   (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')

%% speed test

numTest = 20;

% A) txt2mat with comma
tic
for k = 1:numTest
    A = txt2mat(fcom, tmco{:});
    clear A
end
ttmc = toc;
disp(['txt2mat  test comma avg. time: ' num2str(ttmc/numTest) ])

% B) txt2mat with dot
tic
for k = 1:numTest
    B = txt2mat(fdot, tmdo{:});
    clear B
end
ttmd = toc;
disp(['txt2mat  test dot   avg. time: ' num2str(ttmd/numTest) ])

% C) textscan with dot
tic
for k = 1:numTest
    fid = fopen(fdot);
    C = textscan(fid, repmat('%f',1,c) , tsco{:} );
    fclose(fid);
    C = C{1};
    clear C
end
ttsc = toc;
disp(['textscan test dot   avg. time: ' num2str(ttsc/numTest) ])
disp('-----')

With a test script I've found a factor of less than 1.5. My code would look like:

tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

A = txt2mat(filename, tmco{:});

Note the different 'ReplaceChar' value and 'ReadMode' 'block'.

I get the following results for a ~5MB file on my (not too new) machine:

txt2mat test comma avg. time: 0.63231
txt2mat test dot avg. time: 0.45715
textscan test dot avg. time: 0.4787

The full code of my test script:

%% generate sample files

fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';

c = 5;       % # columns
r = 100000;  % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];

tcom = strrep(tdot,'.',',');

% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);

disp('-----')

%% read back sample files with txt2mat and textscan

% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block'} ;

% textscan-options
tsco = {'HeaderLines'   , 1      , ...
        'CollectOutput' , true   } ;


A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});

fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};

disp(['txt2mat  test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat  test dot   (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot   (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')

%% speed test

numTest = 20;

% A) txt2mat with comma
tic
for k = 1:numTest
    A = txt2mat(fcom, tmco{:});
    clear A
end
ttmc = toc;
disp(['txt2mat  test comma avg. time: ' num2str(ttmc/numTest) ])

% B) txt2mat with dot
tic
for k = 1:numTest
    B = txt2mat(fdot, tmdo{:});
    clear B
end
ttmd = toc;
disp(['txt2mat  test dot   avg. time: ' num2str(ttmd/numTest) ])

% C) textscan with dot
tic
for k = 1:numTest
    fid = fopen(fdot);
    C = textscan(fid, repmat('%f',1,c) , tsco{:} );
    fclose(fid);
    C = C{1};
    clear C
end
ttsc = toc;
disp(['textscan test dot   avg. time: ' num2str(ttsc/numTest) ])
disp('-----')

回复收藏 0 原文

小嗲 2024-12-24 03:43:31

我的解决方案（假设逗号仅用作小数占位符，并且空格界定列）：

fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');

如果您碰巧需要删除单个标题行，就像我所做的那样，那么这应该有效：

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data

My solution (assumes commas are only used as decimal place holders and that white space delineates columns):

fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');

If you should happen to need to remove a single header line, as I did, then this should work:

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data

回复收藏 0 原文

忆沫 2024-12-24 03:43:31

您可以使用txt2mat。

A = txt2mat('data.txt');

它将自动处理数据。但你可以明确地说：

A = txt2mat('data.txt','ReplaceChar',',.');

PS 它可能效率不高，但如果你只需要特定的数据格式，你可以从源文件中复制该部分。

You may use txt2mat.

A = txt2mat('data.txt');

It will handle the data automatically. But you can explicitly say:

A = txt2mat('data.txt','ReplaceChar',',.');

P.S. It may not be efficient, but you can copy the part from the source file if you need it only for your specific data formats.

回复收藏 0 原文

童话里做英雄 2024-12-24 03:43:31

您可以尝试通过添加标题行数以及（如果可能）列数作为输入来绕过其文件分析来加速 txt2mat。与使用点分隔小数的 textscan 导入相比，因子不应为 25。（您也可以使用 mathworks 网站上的作者页面与我联系。）
如果您找到在 matlab 中处理逗号分隔小数的更有效方法，请告诉我们。

回复收藏 0 原文

~没有更多了~

关于作者

趁年轻赶紧闹

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Matlab：如何读取以逗号作为小数分隔符的数字？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

Matlab：如何读取以逗号作为小数分隔符的数字？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。