使用 C# 或其他程序转置 csv 文件

发布于 2024-11-05 10:39:04 字数 86 浏览 0 评论 0 原文

我使用 C# 并将数据写入 csv 文件(以供进一步使用)。然而我的文件已经变得很大,我必须转置它们。最简单的方法是什么?在任何程序中?

吉尔

I'm using C# and I write my data into csv files (for further use). However my files have grown into a large scale and i have to transpose them. what's the easiest way to do that. in any program?

Gil

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

神妖 2024-11-12 10:39:04

按照复杂性的增加顺序(以及处理大文件的能力的增加顺序):

  • 将整个内容读入二维数组(或锯齿状数组,又称数组数组)。
    • 所需内存:等于文件大小

  • 跟踪每行内的文件偏移量。首先找到每个(未引用的)换行符,将当前位置存储到 List 中。然后遍历所有行,对于每一行:查找保存的位置,将一个单元格复制到输出,保存新位置。重复此操作,直到用完列(所有行都到达换行符)。
    • 所需内存:每行八个字节
    • 频繁的文件查找分散在比磁盘缓存大得多的文件中会导致磁盘抖动和性能不佳,但不会崩溃。

  • 与上面类似,但工作在例如 8k 行的块上。这将创建一组文件,每个文件有 8k 列。输入块和输出都适合磁盘缓存,因此不会发生颠簸。构建条带文件后,迭代条带,从每个条带中读取一行并附加到输出。对所有行重复此操作。这会导致对每个文件进行顺序扫描,这也具有非常合理的缓存行为。
    • 所需内存:第一次需要 64k,第二次需要(列数/8k)个文件描述符。
    • 对于每个维度多达数百万的表具有良好的性能。对于更大的数据集,只需将几个(例如 1k)条带文件组合在一起,形成一组较小的较大条带,重复此操作,直到只有一个条带包含一个文件中的所有数据。

最后评论:通过使用 C++(或任何具有适当指针支持的语言)、内存映射文件和指针而不是文件偏移量,您可能会获得更多性能。

In increasing order of complexity (and also increasing order of ability to handle large files):

  • Read the whole thing into a 2-D array (or jagged array aka array-of-arrays).
    • Memory required: equal to size of file

  • Track the file offset within each row. Start by finding each (non-quoted) newline, storing the current position into a List<Int64>. Then iterate across all rows, for each row: seek to the saved position, copy one cell to the output, save the new position. Repeat until you run out of columns (all rows reach a newline).
    • Memory required: eight bytes per row
    • Frequent file seeks scattered across a file much larger than the disk cache results in disk thrashing and miserable performance, but it won't crash.

  • Like above, but working on blocks of e.g. 8k rows. This will create a set of files each with 8k columns. The input block and output all fit within disk cache, so no thrashing occurs. After building the stripe files, iterate across the stripes, reading one row from each and appending to the output. Repeat for all rows. This results in sequential scan on each file, which also has very reasonable cache behavior.
    • Memory required: 64k for first pass, (column count/8k) file descriptors for second pass.
    • Good performance for tables of up to several million in each dimension. For even larger data sets, combine just a few (e.g. 1k) of the stripe files together, making a smaller set of larger stripes, repeat until you have only a single stripe with all data in one file.

Final comment: You might squeeze out more performance by using C++ (or any language with proper pointer support), memory-mapped files, and pointers instead of file offsets.

灯下孤影 2024-11-12 10:39:04

这确实取决于。您是否从数据库中获取这些内容?您可以使用 MySql import 语句。 http://dev.mysql.com/doc/refman/5.1 /en/load-data.html

或者您可以使用循环数据将其添加到使用 Streamwriter 对象的文件流中。

StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}

It really depends. Are you getting these out of a database? The you could use a MySql import statement. http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Or you could use could loop through the data add it to a file stream using streamwriter object.

StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}
苄①跕圉湢 2024-11-12 10:39:04

我用 python 写了一个概念验证脚本。我承认它有问题,并且可能需要进行一些性能改进,但它会做到这一点。我对 40x40 文件运行它并得到了所需的结果。我开始针对更像您的示例数据集的东西运行它,但我等待的时间太长了。

path = mkdtemp()
try :
    with open('/home/user/big-csv', 'rb') as instream:
        reader = csv.reader(instream)        
        for i, row in enumerate(reader):
            for j, field in enumerate(row):                
                with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
                    contents = [ '{0},'.format(field) ]
                    new_row_stream.writelines(contents)
            print 'read row {0:0>2}'.format(i)
    with open('/home/user/transpose-csv', 'wb') as outstream:
        files = glob(join(path, '*'))
        files.sort()
        for filename in files:
            with open(filename, 'rb') as row_file:
                contents = row_file.readlines()          
                outstream.writelines(contents + [ '\n' ]) 
finally:
    print "done"
    rmtree(path)

I wrote a little proof-of-concept script here in python. I admit it's buggy and there are likely some performance improvements to be made, but it will do it. I ran it against a 40x40 file and got the desired result. I started to run it against something more like your example data set and it took too long for me to wait.

path = mkdtemp()
try :
    with open('/home/user/big-csv', 'rb') as instream:
        reader = csv.reader(instream)        
        for i, row in enumerate(reader):
            for j, field in enumerate(row):                
                with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
                    contents = [ '{0},'.format(field) ]
                    new_row_stream.writelines(contents)
            print 'read row {0:0>2}'.format(i)
    with open('/home/user/transpose-csv', 'wb') as outstream:
        files = glob(join(path, '*'))
        files.sort()
        for filename in files:
            with open(filename, 'rb') as row_file:
                contents = row_file.readlines()          
                outstream.writelines(contents + [ '\n' ]) 
finally:
    print "done"
    rmtree(path)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文