使用 python 编写具有精确格式参数的 csv 文件

发布于 2024-09-04 03:47:36 字数 2399 浏览 6 评论 0原文

我在处理项目的一些 csv 数据文件时遇到问题。有人建议使用 python/csv reader 来帮助分解文件,我已经取得了一些成功,但不是我可以使用的方式。

这段代码与我之前尝试的有点不同。我本质上是在尝试创建一个数组。在原始数据格式中,前7行不包含数据,然后每列包含50个实验,每个实验4000行,总共200000行。我想要做的是获取每一列,并使其成为一个单独的 csv 文件,每个实验都在其自己的列中。因此,对于每种数据类型来说,这将是一个包含 50 列和 4000 行的数组。这里的代码确实分解了正确的值,我认为逻辑是好的,但它分解的方式与我想要的相反。我想要不带引号的分隔符(逗号和空格),并且我想要引号中的元素值。现在,它对不带引号的元素值和带引号的分隔符执行相反的操作。我花了几个小时试图弄清楚如何做到这一点,但无济于事,

import csv  

ifile  = open('00_follow_maverick.csv')  
epistemicfile = open('00_follower_maverick_EP.csv', 'w')  

reader = csv.reader(ifile)  

colnum = 0  
rownum = 0  
y = 0  
z = 8   
for column in reader:  
    rownum = 4000 * y + z  
    for element in column:  
        writer = csv.writer(epistemicfile)  
        if y <= 50:  
            y = y + 1  
            writer.writerow([element])  
            writer.writerow(',')  
            rownum = x * y + z  
        if y > 50:  
            y = 0  
            z = z + 1  
            writer.writerow(' ')  
            rownum = x * y + z  
        if z >= 4008:  
            break  

发生了什么事:我以 4000 次迭代的方式获取原始数据文件中的每一行,以便我可以在 50 个实验中用逗号分隔它们。当这里的实验指标 y 达到 50 时,它会重置回实验 0,并在 z 上加 1,通过 4000 * y + z 的公式告诉它要查看哪一行。当它完成所有 50 个实验的行时,它就完成了。这里的问题是我不知道如何让 python 将实际值写在引号中,并将分隔符写在引号之外。

任何帮助将不胜感激。如果这看起来是一个愚蠢的问题,我很抱歉,我没有编程经验,这是我的第一次尝试。谢谢。

抱歉,我会尽力让这一点更清楚。原始 csv 文件有几列,每一列都是不同的数据集。

原始文件的一个微型示例如下所示:

column1             column2            column3
exp1data1time1      exp1data2time1     exp1data3time1
exp1data1time2      exp1data2time2     exp1data3time2
exp2data1time1      exp2data2time1     exp2data3time1
exp2data1time2      exp2data2time2     exp2data3time2
exp3data1time1      exp3data2time1     exp3data3time1
exp3data1time2      exp3data2time2     exp3data3time2

因此,实际版本的每个新实验有 4000 行,而不是 2 行。实际版本中有 40 列,但基本上原始文件中的数据类型与列号匹配。我想将每个数据类型或列分成单独的 csv 文件。

这看起来像:

csv file1

exp1data1time1   exp2data1time1   exp3data1time1   
exp1data1time2   exp2data1time2   exp3data1time2

csv file2

exp1data2time1   exp2data2time1   exp3data2time1   
exp1data2time2   exp2data2time2   exp3data2time2

csv file3

exp1data3time1   exp2data3time1   exp3data3time1   
exp1data3time2   exp2data3time2   exp3data3time2

因此,我将文件中的原始数据移动到新列,并将每种数据类型移动到其自己的文件中。现在我只想做一个文件,直到我可以将单独的实验移至新文件中的单独列。所以,在代码中,上面的代码会将 4000 变成 2。我希望这更有意义,但如果没有,我会再试一次。

I'm having trouble with processing some csv data files for a project. Someone suggested using python/csv reader to help break down the files, which I've had some success with, but not in a way I can use.

This code is a little different from what I was trying before. I am essentially attempting to create an array. In the raw data format, the first 7 rows contain no data, and then each column contains 50 experiments, each with 4000 rows, for 200000 some rows total. What I want to do is take each column, and make it an individual csv file, with each experiment in its own column. So it would be an array of 50 columns and 4000 rows for each data type. The code here does break down the correct values, I think the logic is okay, but it is breaking down the opposite of how I want it. I want the separators without quotes (the commas and spaces) and I want the element values in quotes. Right now it is doing just the opposite for both, element values with no quotes, and the separators in quotes. I've spent several hours trying to figure out how to do this to no avail,

import csv  

ifile  = open('00_follow_maverick.csv')  
epistemicfile = open('00_follower_maverick_EP.csv', 'w')  

reader = csv.reader(ifile)  

colnum = 0  
rownum = 0  
y = 0  
z = 8   
for column in reader:  
    rownum = 4000 * y + z  
    for element in column:  
        writer = csv.writer(epistemicfile)  
        if y <= 50:  
            y = y + 1  
            writer.writerow([element])  
            writer.writerow(',')  
            rownum = x * y + z  
        if y > 50:  
            y = 0  
            z = z + 1  
            writer.writerow(' ')  
            rownum = x * y + z  
        if z >= 4008:  
            break  

What is going on: I am taking each row in the raw data file in iterations of 4000, so that I can separate them with commas for the 50 experiments. When y, the experiment indicator here, reaches 50, it resets back to experiment 0, and adds 1 to z, which tells it which row to look at, by the formula of 4000 * y + z. When it completes the rows for all 50 experiments, it is finished. The problem here is that I don't know how to get python to write the actual values in quotes, and my separators outside of quotes.

Any help will be most appreciated. Apologies if this seems a stupid question, I have no programming experience, this is my first attempt ever. Thank you.

Sorry, I'll try to make this more clear. The original csv file has several columns, each of which are different sets of data.

A miniature example of the raw file looks like:

column1             column2            column3
exp1data1time1      exp1data2time1     exp1data3time1
exp1data1time2      exp1data2time2     exp1data3time2
exp2data1time1      exp2data2time1     exp2data3time1
exp2data1time2      exp2data2time2     exp2data3time2
exp3data1time1      exp3data2time1     exp3data3time1
exp3data1time2      exp3data2time2     exp3data3time2

So, the actual version has 4000 rows instead of 2 for each new experiment. There are 40 columns in the actual version, but basically, the data type in the raw file matches the column number. I want to separate each data type or column into an individual csv file.

This would look like:

csv file1

exp1data1time1   exp2data1time1   exp3data1time1   
exp1data1time2   exp2data1time2   exp3data1time2

csv file2

exp1data2time1   exp2data2time1   exp3data2time1   
exp1data2time2   exp2data2time2   exp3data2time2

csv file3

exp1data3time1   exp2data3time1   exp3data3time1   
exp1data3time2   exp2data3time2   exp3data3time2

So, I'd move the raw data in the file to a new column, and each data type to its own file. Right now I'm only going to do one file, until I can move the separate experiments to separate columns in the new file. So, in the code, the above would make the 4000 into 2. I hope this makes more sense, but if not, I will try again.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

落叶缤纷 2024-09-11 03:47:36

如果我每次看到这种状态的生物、心理或化学数据库时都有一只猫:

“每列包含 50 个实验,
每个有 4000 行,大约 200000
总行数。我想做的是采取
每一列,并使其成为一个单独的列
csv 文件,每个实验都在其
自己的专栏。所以这将是一个数组
每个数据 50 列和 4000 行
类型”

我有太多的猫。

我什至没有看你的代码,因为你提出的重新修改只是另一个必须解决的问题。我不会责怪你,你声称作为一个新手,你的所有同行都会犯同样的错误,那些尚未了解如何使用数组的初学者常常会遇到这样的变量声明:

integer response01, response02, response03, response04, ...

然后当他们尝试查看是否每个响应都是 - 时,会遇到非常非常冗余的代码。 ,纸质模型并不是建模数据的最佳方法。

说 - 1.我认为这是生物信息学中的一个诱人的错误,因为它实际上很好地模拟了它们来自的纸质符号,不幸的是 阅读并理解为什么数据库规范化被开发、编纂并开始主导人们对结构化的思考一篇维基百科文章可能还不够,让我尝试解释一下我的想法。您的数据由观察结果组成;换句话说,主要数据是单一的观察结果。不过,该观察结果有一个背景:它是一组 4000 个观察结果之一,其中每组观察结果属于 50 个实验之一。如果您必须为每个观察附加一个上下文,您最终会得到一个如下所示的寻址方案:

<experiment_number, observation_number, value>

用数据库术语来说,这是一个元组,它能够毫无歧义且完美对称地表示整个数据。我不确定我是否理解您数据的确切结构,所以也许它更像是:

<experiment_number, protocol_number, observation_number, value>

其中协议可能是某种形式的变量处理类型 - 比如说 pH 值。但请注意,我没有将该协议称为 pH,也没有将其记录在数据库中。然后我需要的是一个显示协议相关参数的辅助表,例如

<protocol_number, acidity, temperature, pressure>

现在我们刚刚构建了一个那些数据库人员喜欢谈论的“关系”;我们还开始对数据进行标准化。如果您需要了解给定方案的 pH 值,只有一个地方可以找到它,那就是方案表的正确行。请注意,我已经将数据表上非常适合的数据分开,并且从观察表中我看不到特定数据的 pH 值。但这没关系,因为如果需要,我可以在协议表中查找它。这是一个“关系连接”,如果需要,我可以合并所有不同表中的所有不同参数,并以原始的、非结构化的方式重建原始数据表。

我希望这个答案对你有一些用处。我确信我什至不知道您的数据来自哪个研究领域,但这些原则适用于从药物试验到采购申请处理的各个领域。请理解,我试图根据您的要求告知您,并且没有任何居高临下的意思。我欢迎就此事提出进一步问题。

If I had a cat for each time I saw a bio or psych or chem database in this state:

"each column contains 50 experiments,
each with 4000 rows, for 200000 some
rows total. What I want to do is take
each column, and make it an individual
csv file, with each experiment in its
own column. So it would be an array of
50 columns and 4000 rows for each data
type"

I'd have way too farking many cats.

I didn't even look at your code because the re-mangling you are proposing is just another problem that will have to be solved. I don't fault you, you claim to be a novice and all your peers make the same sort of error. Beginning programmers who have yet to understand how to use arrays often wind up with variable declarations like:

integer response01, response02, response03, response04, ...

and then very, very redundant code when they try to see if every response is - say - 1. I think this is such a seductive error in bio-informatics because it actually models the paper notations they come from rather well. Unfortunately, the sheet-of-paper model isn't the best way to model data.

You should read and understand why database normalization was developed, codified and has come to dominate how people think about structured data. One Wikipedia article may not be sufficient. Using the example I excerpted let me try to explain how I think of it. Your data consists of observations; put the other way the primary datum is a singular observation. That observation has a context though: it is one of a set of 4000 observations, where each set belongs to one of 50 experiments. If you had to attach a context to each observation you'd wind up with an addressing scheme that looks like:

<experiment_number, observation_number, value>

In database jargon, that's a tuple, and it is capable of representing, with no ambiguity and perfect symmetry the entirety of your data. I'm not certain that I've understood the exact structure of your data, so perhaps it is something more like:

<experiment_number, protocol_number, observation_number, value>

where the protocol may be some form of variable treatment type - let's say pH. But note that I didn't call the protocol a pH and I don't record it as such in the database. What I would then need is an ancillary table showing the relevant parameters of the protocol, e.g.:

<protocol_number, acidity, temperature, pressure>

Now we've just built a "relation" that those database people like to talk about; we've also begun normalizing the data. If you need to know the pH for a given protocol, there is one and only one place to find it, in the proper row of the protocol table. Note that I've divorced the data that fit so nicely together on a data-sheet and from the observation table I can't see the pH for a particular dataum. But that's okay, because I can just look it up in my protocol table if needed. This is a "relational join" and if I needed to, I could coalesce all the various parameters from all the various tables and reconstitute the original datasheet in its original, unstructured glory.

I hope this answer is of some use to you. I'm certain that I don't even know what field of study your data is from, but these principles apply across domains from drug trials to purchase requisition processing. Please understand that I'm trying to inform, per your request, and there is zero condescension intended. I welcome further questions on the matter.

吝吻 2024-09-11 03:47:36

数据集标准化

感谢您提供示例。你已经了解了我所描述的背景,也许我可以说得更清楚。

column1             column2            column3
exp1data1time1      exp1data2time1     exp1data3time1
exp1data1time2      exp1data2time2     exp1data3time2

这些柱子是最后一个人设计的。也就是说,它们不携带任何相关信息。当解析为正常形式时,您的数据看起来就像我第一个提出的元组:

<experiment_number, time, response_number, response>

我怀疑 time 实际上可能意味着“subject_id”或“Trial_number”。将所有不同的响应值连接到同一个数据集中可能看起来很不协调;事实上,根据您想要的输出,我怀疑它确实如此。乍一看,反对意见“但是主体对椅子认知属性问题的回答与他们关于颜色的元认知信念没有联系”,但这是错误的。数据相关是因为它们有共同的实验对象,而自相关是社会学分析中的一个重要概念。

例如,您可能会发现受访者 A 给出了与受访者 B 相同的答案,但由于受试者对标准的理解方式,A 的所有答案都偏向较高。这会对数据的绝对值产生非常实际的影响,但我希望你能看到这个问题:“A 和 B 实际上有不同的认知模型吗?”是突出且有效的。一种数据建模方法可以轻松回答这个问题,但您想要的方法却不能。

很快就会开始工作解析代码。

Normalization of the dataset

Thanks for giving the example. You have the context I described already, perhaps I can make it more clear.

column1             column2            column3
exp1data1time1      exp1data2time1     exp1data3time1
exp1data1time2      exp1data2time2     exp1data3time2

The columns are an artifice made by the last guy; that is, they carry no relevant information. When parsed into a normal form, your data looks just like my first proposed tuple:

<experiment_number, time, response_number, response>

where I suspect time may actually mean "subject_id" or "trial_number". It may very well look incongruous to you to conjoin all the different response values into the same dataset; indeed based on your desired output, I suspect that it does. At first blush, the objection "but the subject's response to a question about epistemic properties of chairs has no connection to their meta-epistemic beliefs regarding color", but this would be mistaken. The data are related because they have a common experimental subject, and self-correlation is an important concept in sociological analytics.

For example, you may find that respondent A gives the same responses as respondent B, except all of A's responses are biased one higher because of how the subject understood the criteria. This would make a very real difference in the absolute values of the data, but I hope you can see that the question "do A and B actually have different epistemic models?" is salient and valid. One method of data modeling allows this question to be answered easily, your desired method does not.

Working parsing code to follow shortly.

少女七分熟 2024-09-11 03:47:36

规范化代码

#!/usr/bin/python

"""parses a csv file containing a particular data layout and normalizes

    The raw data set is a csv file of the form::

        column1                column2               column3
        exp01data01time01      exp01data02time01     exp01data03time01
        exp01data01time02      exp01data02time02     exp01data03time02

    where there are 40 such columns and the literal column title
    is added as context to the output row

    it is assumed that the columns are comma separated but
    the lexical form of the subcolumns is unspecified.

    Output will consist of a single CSV output stream
    on stdout of the form::

        exp01, time01, data01, column1

    for varying actual values of each field.
"""

import csv
import sys

def split_subfields(s):
    """returns a list of subfields of s
       this function is expected to be re-written to match the actual,
       unspecified lexical structure of s."""
    return [s[0:5], s[5:11], s[11:17]]


def normalise_data(reader, writer):
    """returns a list of the column headings from the reader"""

    # obtain the headings for use in normalization
    names = reader.next()

    # get the data rows, split them out by column, add the column name
    for row in reader:
        for column, datum in enumerate(row):
            fields = split_subfields(datum)
            fields.append(names[column])
            writer.writerow(fields)

def main():
    if len(sys.argv) != 2:
        print  >> sys.stderr,  ('usage: %s input.csv' % sys.argv[0])
        sys.exit(1)

    in_file = sys.argv[1]

    reader = csv.reader(open(in_file))
    writer = csv.writer(sys.stdout)
    normalise_data(reader, writer)

if __name__ == '__main__': main()

使得命令python epistem.py raw_data.csv > Cooked_data.csv 产生如下所示的摘录输出:

exp01,data01,time01,column1
...
exp01,data40,time01,column40
exp01,data01,time02,column1
exp01,data01,time03,column1
...
exp02,data40,time15,column40

The normalizing code

#!/usr/bin/python

"""parses a csv file containing a particular data layout and normalizes

    The raw data set is a csv file of the form::

        column1                column2               column3
        exp01data01time01      exp01data02time01     exp01data03time01
        exp01data01time02      exp01data02time02     exp01data03time02

    where there are 40 such columns and the literal column title
    is added as context to the output row

    it is assumed that the columns are comma separated but
    the lexical form of the subcolumns is unspecified.

    Output will consist of a single CSV output stream
    on stdout of the form::

        exp01, time01, data01, column1

    for varying actual values of each field.
"""

import csv
import sys

def split_subfields(s):
    """returns a list of subfields of s
       this function is expected to be re-written to match the actual,
       unspecified lexical structure of s."""
    return [s[0:5], s[5:11], s[11:17]]


def normalise_data(reader, writer):
    """returns a list of the column headings from the reader"""

    # obtain the headings for use in normalization
    names = reader.next()

    # get the data rows, split them out by column, add the column name
    for row in reader:
        for column, datum in enumerate(row):
            fields = split_subfields(datum)
            fields.append(names[column])
            writer.writerow(fields)

def main():
    if len(sys.argv) != 2:
        print  >> sys.stderr,  ('usage: %s input.csv' % sys.argv[0])
        sys.exit(1)

    in_file = sys.argv[1]

    reader = csv.reader(open(in_file))
    writer = csv.writer(sys.stdout)
    normalise_data(reader, writer)

if __name__ == '__main__': main()

Such that the command python epistem.py raw_data.csv > cooked_data.csv yields excerpted output looking like:

exp01,data01,time01,column1
...
exp01,data40,time01,column40
exp01,data01,time02,column1
exp01,data01,time03,column1
...
exp02,data40,time15,column40
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文