在Python中读取scipy/numpy中的csv文件

发布于 2024-09-02 07:50:23 字数 1791 浏览 5 评论 0原文

我在 python 中读取由制表符分隔的 csv 文件时遇到问题。我使用以下函数:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

问题是 genfromtxt 抱怨我的文件,例如错误:

Line #27100 (got 12 columns instead of 16)

我不确定这些错误来自哪里。有什么想法吗?

这是导致问题的示例文件:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

Is there a better way to write a generic csv2array function?谢谢。

I am having trouble reading a csv file, delimited by tabs, in python. I use the following function:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

the problem is that genfromtxt complains about my files, e.g. with the error:

Line #27100 (got 12 columns instead of 16)

I am not sure where these errors come from. Any ideas?

Here's an example file that causes the problem:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

Is there a better way to write a generic csv2array function? thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

秋风の叶未落 2024-09-09 07:50:23

查看 python CSV 模块: http://docs.python.org/library/csv.html< /a>

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.

Check out the python CSV module: http://docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.
卷耳 2024-09-09 07:50:23

请问你为什么不使用内置的 csv 阅读器?
http://docs.python.org/library/csv.html

我已经与 numpy/scipy 一起非常有效地使用它。我会分享我的代码,但不幸的是它由我的雇主所有,但编写您自己的代码应该非常简单。

May I ask why you're not using the built-in csv reader?
http://docs.python.org/library/csv.html

I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.

谢绝鈎搭 2024-09-09 07:50:23

我成功地使用了两种方法; (1):如果我只需要读取任意 CSV,我使用了 CSV 模块(正如其他用户指出的那样),以及 (2):如果我需要重复处理已知的 CSV(或任何)格式,我会编写一个简单的解析器。

看来您的问题属于第二类,解析器应该非常简单:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

您可以在阅读器中添加一行以跳过注释(`if tokens[0] == '#': continue')或处理空白行('如果标记== []:继续')。你明白了。

I have successfully used two methodologies; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser.

It seems that your problem fits in the second category, and a parser should be very simple:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). You get the idea.

眼角的笑意。 2024-09-09 07:50:23

我认为 Nick T 的方法是更好的方法。我会做出一处改变。正如我将替换以下代码:

for row, record in enumerate(reader):
if len(record) != fields:
    print "Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

with

records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

列表理解将返回一个 numpy 数组并利用预编译的库,这应该会大大加快速度。另外,我建议使用 print() 作为函数而不是 print "" ,因为前者是 python3 的标准,这很可能是未来,我会使用 记录 打印。

I think Nick T's approach would be the better way to go. I would make one change. As I would replace the following code:

for row, record in enumerate(reader):
if len(record) != fields:
    print "Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

with

records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print.

燃情 2024-09-09 07:50:23

可能它来自数据文件中的第 27100 行...并且它有 12 列而不是 16 列。即它有:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

并且它期待这样的事情:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

我不确定您想要如何转换数据,但是如果您线路长度不规则,最简单的方法是这样的:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline

Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. I.e. it had:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

And it was expecting something like this:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文