使用 python 对 CSV 中的特定字符进行切片

发布于 2024-10-05 22:55:09 字数 664 浏览 4 评论 0原文

我的数据采用制表符分隔格式，如下所示：

0/0:23:-1.03,-7.94,-83.75:69.15    0/1:34:-1.01,-11.24,-127.51:99.00    0/0:74:-1.02,-23.28,-301.81:99.00

我只对每个条目的前 3 个字符（即 0/0 和 0/1）感兴趣。我认为最好的方法是在 numpy 中使用 match 和 genfromtxt 。据我所知，这个例子是这样的：

import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
    m = re.match('[0-9]/[0-9]', i)
        if m:
        print m.group(0),
        else:
        print "NA",

这适用于数据的第一行，但我很难弄清楚如何为输入文件的每一行扩展它。

我应该将其设为一个函数并将其分别应用于每一行还是有更Pythonic的方法来做到这一点？

原文

I have data in tab delimited format that looks like:

0/0:23:-1.03,-7.94,-83.75:69.15    0/1:34:-1.01,-11.24,-127.51:99.00    0/0:74:-1.02,-23.28,-301.81:99.00

I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match and the genfromtxt in numpy. This example is as far as I have gotten:

import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
    m = re.match('[0-9]/[0-9]', i)
        if m:
        print m.group(0),
        else:
        print "NA",

This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.

Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

桃扇骨 2024-10-12 22:55:09

除非您真的想使用 NumPy，请尝试以下操作：

file = open('home/python/batch1.hg19.table')
for line in file:
    for cell in line.split('\t'):
        print(cell[:3])

它只会迭代文件的每一行，使用制表符作为分隔符标记该行，然后打印您要查找的文本片段。

Unless you really want to use NumPy, try this:

file = open('home/python/batch1.hg19.table')
for line in file:
    for cell in line.split('\t'):
        print(cell[:3])

Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.

回复收藏 0 原文

尬尬 2024-10-12 22:55:09

当您想要加载数字数组时，Numpy 非常有用。
这里的格式对于 numpy 来说太复杂了，无法识别，所以你只得到一个字符串数组。这并没有真正发挥 numpy 的优势。

简单方法：

result=[]
with open(csvfile,'r') as f:
    for line in f:
        row=[]
        for text in line.split('\t'):
            match=re.search('([0-9]/[0-9])',text)
            if match:
                row.append(match.group(1))
            else:
                row.append("NA")
        result.append(row)
print(result)

这是一种无需 numpy:即可实现

# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]

此数据的

0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00

Numpy is great when you want to load in an array of numbers.
The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.

Here's a simple way to do it without numpy:

result=[]
with open(csvfile,'r') as f:
    for line in f:
        row=[]
        for text in line.split('\t'):
            match=re.search('([0-9]/[0-9])',text)
            if match:
                row.append(match.group(1))
            else:
                row.append("NA")
        result.append(row)
print(result)

yields

# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]

on this data:

0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00   0/0:74:-1.02,-23.28,-301.81:99.00

回复收藏 0 原文

醉酒的小男人 2024-10-12 22:55:09

无需正则表达式即可轻松解析整个文件：

for line in open('yourfile').read().split('\n'):
    for token in line.split('\t'):
        print token[:3] if token else 'N\A'

Its pretty easy to parse the whole file without regular expressions:

for line in open('yourfile').read().split('\n'):
    for token in line.split('\t'):
        print token[:3] if token else 'N\A'

回复收藏 0 原文

左秋 2024-10-12 22:55:09

好久没写python了。但我可能会这样写。

file = open("home/python/batch1.hg19.table")
for line in file:
    columns = line.split("\t")
    for column in columns:
        print column[:3]
file.close()

当然，如果您需要验证前三个字符，您仍然需要正则表达式。

I haven't written python in a while. But I would probably write it as such.

file = open("home/python/batch1.hg19.table")
for line in file:
    columns = line.split("\t")
    for column in columns:
        print column[:3]
file.close()

Of course if you need to validate the first three characters, you'll still need the regex.

回复收藏 0 原文

~没有更多了~