使用 python 比较/提取矩阵中的数据 (2.6.1)
我有两个 .csv 文件,其中包含从 R 导出的相关矩阵。一个文件包含 P 值,另一个文件包含 r 值。两个文件之间的行标题和列标题完全匹配。
仅当 P 值 < 时,我才尝试提取成对的 r 值以及相应的行和列标题。 0.05。以下是 r 值输入文件中数据的示例(我有 1700 多个相关项目,而不仅仅是显示的两个):
Species1 Species2
Species1 1 0.9
Species2 0.9 1
P 值输入文件是相同的,除了包含 P 值代替r 值。
我对 Python 比较陌生,不知道如何处理这种类型的文件。我尝试了一些策略,包括使用 csv 库来迭代文件。我研究过使用 numpy,但它似乎对我不起作用(?)。我还研究了使用 scipy 在 Python 中计算 r 值和 P 值(Pearson),但这似乎只适用于比较两个一维数组(我有 1700 多列数据需要关联)。
我首先使用代码向您展示我导入的内容:
import csv
infileP = open('AllcorrP.csv', 'rU')
infileR = open('AllcorrR.csv', 'rU')
问题 任何人都可以帮助我根据 p 值文件中的显着 (< 0.05) P 值从 r 值文件中提取列标题和行标题以及 r 值吗?
或
直接使用 Python 计算多列数据之间所有可能相关性的 r 值和 P 值,并仅提取具有显着 P 值的结果?
最后,我想在两个文件中输出。
第一个文件:
Species1 Species2 Species4 ...
Species2 Species1 Species7 ...
等...(其中“Species1”是第一个具有显着相关性的物种,行中的下一项是与其显着相关的物种(Species2、Species4 等)
第二个文件:
Species1 (corr) Species2 = 0.87
Species2 (corr) Species7 = 0.72
...
等,显示每个成对的物种相关性和随之而来的 r 值
此时,我很高兴能够提取我想要的 r 值和物种列表,并稍后找出最后两个文件格式,谢谢!
I have two .csv files containing correlation matrices exported from R. One file contains the P-values and one contains the r-values. The row and column headers match exactly between the two files.
I am trying to extract the r-values and corresponding row and column header for pairs only when the P-value < 0.05. Here is a sample of what the data in the r-value input file looks like (I have 1700+ correlated items, rather than only the two shown):
Species1 Species2
Species1 1 0.9
Species2 0.9 1
The P-value input file is identical, except containing P-values in place of r-values.
I am relatively new to Python, and am not sure how to handle files of this type. I have tried a few strategies, including using the csv library to iterate through the files. I looked into using numpy, but it doesn't seem that it will work for me (?). I also looked into using scipy to calculate r- and P-values (Pearsons) in Python, but it seems that this only works for comparing two one dimensional arrays (I have 1700+ columns of data to correlate).
Code I am starting with, to show you what I have imported:
import csv
infileP = open('AllcorrP.csv', 'rU')
infileR = open('AllcorrR.csv', 'rU')
The question
Can anyone help me extract the column and row headers and r-values from my r-value file based on significant (< 0.05) P-values from my p-value file?
OR
Calculate the r- and P-values for all possible correlations between many columns of data directly using Python and extract only the results with significant P-values?
In the end, I would like output in two files.
First file:
Species1 Species2 Species4 ...
Species2 Species1 Species7 ...
etc...(where "Species1" is the first species with significant correlations and the next items on the line are the species that it significantly correlated with (Species2, Species4 etc.)
Second file:
Species1 (corr) Species2 = 0.87
Species2 (corr) Species7 = 0.72
...
etc. which shows each pairwise correlation and the r-value that goes with it
At this point, I'd be happy to just be able to extract a list of the r-values and species that I want and figure out the final two file formatting later. Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要读取数据,您应该能够使用 numpy.genfromtext。请参阅文档,该函数中有大量功能。要阅读上面的示例,您可能会这样做:
[:,1:] 是在读入时忽略数据的第一列。该函数没有像行那样的“忽略前 x 列”的输入(通过skip_header)。不知道为什么他们不实施这个,它总是困扰着我。
这只会读取 P 的数据(也可以对 r 执行此操作)。然后你就可以很容易地过滤数据。您可以阅读分开的第一行和第一列来获取标题。或者,如果您看到 genfromtxt 文档,您也可以为它们命名(创建一个记录)。
要找到 r 小于 0.50 的索引(值),您可以简单地进行比较,numpy 会自动为您创建一个布尔数组:
这可以用作 rdata 的索引(确保有相同的行数/列):
To read the data, you should be able to use numpy.genfromtext. See the documentation, there is a ton of functionality within this function. To read your example above, you might do:
The [:,1:] is to ignore the first column of data when read in. The function doesn't have an input to "ignore the first x columns" like it does for rows (via skip_header). Not sure why they didn't implement this, it always bugged me.
This would just read the data for P (can also do this for r). Then you can filter the data pretty easily. You could read in the first row and column separated to get the headings. Or if you see the genfromtxt documentation, you could also name them (create a recarray).
To find the indices (values) where r is less then 0.50, you can simply do a comparison and numpy automagically creates a boolean array for you:
This can be used as an index into rdata (make sure there are the same number of rows/columns):
您可以执行类似的操作来获取元组列表,其中包含行标题和列标题以及您感兴趣的数据元素的 r 和 P 值:
You can do something like this to get a list of tuples, containing the row and column headers, and the r and P values of the data elements you're interested in: