什么更容易？合并变量还是指示变量？

发布于 2024-11-26 11:42:00 字数 1954 浏览 6 评论 0原文

我有两组数据想要调查。第一个是给定不同“细胞状态”的基因/基因组相关数据。第二组数据将基因与生物途径联系起来。我相信我的问题是一个关系数据库问题。

'如何显示与一个数据帧相关的数据并将其与另一个数据帧相关联。换句话说，我想绘制细胞状态数据的图表，并将其与通路及其特定基因联系起来。 （我认为在图片中，所以这里是。）
dataframe1-来自 affymetrix 基因芯片的数据
基因、细胞状态1、细胞状态2...
基因1，x1，y1，...
基因2、x2、y2、...
基因.x，... ...

“1”“基因”“log_b”“log_b_rich”“Fc_cdt_rich_tot”“fc_Etoh_CDT_tot_mono”“fc_Etoh_CDT_tot_poly”“fc_Etoh_CDT_mono_poly”“fc_Etoh_Rich_tot_mono” “fc_Etoh_Rich_tot_poly”“fc_Etoh_Rich_mono_poly” "2" "PHF13" -2.712616698 -1.47923545 -0.791138043 -0.549610558 0.143808182 0.69341874 0.320812876 1.089260116 0.76844724
“3”“SPSB1”-1.808348454 -1.965601198 -1.349135752 -0.780105329 0.410647447 1.190752776 0.587287796 1.260350195 0.673062399

dataframe2-来自 kegg 数据库的数据
通路1、基因-x1、基因-x2、...
通路2，基因-y1，基因-y2，...
通路3，基因-z1，...

“1”“KEGG_GLYCOLYSIS_GLUCONEOGENESIS”“PHF13”“LDHB”“LDHA”“PGAM1”“ADH1C”“PGAM2”“ADH1B”“ADH1A”“ACSS2”“PDHB” “ACSS1” “PGAM4” “PDHA2” “PDHA1” “LDHAL6B” “PFKL” “LDHAL6A” “FBP1” “PFKP” “ALDH3B2” “FBP2” “PFKM” “ALDH3B1” “PGM2” “G6PC” “ALDH7A1” “ALDH1B1” “PKM2” “PGM1” “DLD” “PKLR” “ALDH9A1” “ALDOA ” “ALDOC” “ALDOB” “ADH5” “HK2” “HK1” “ADH6” “ADH7” “ALDH3A2” “G6PC2” “ALDH3A1” “GALM” “TPI1” “AKR1A1” “ADH4” “HK3” “ALDH1A3” “ENO2” “ENO3” “GAPDH” “ENO1” “BPGM” “DLAT ” “PCK2” “PCK1” “GPI” “GCK” “ALDH2” “PGK1” “PGK2”
“2”“KEGG_CITRATE_CYCLE_TCA_CYCLE”“PHF13”“OGDHL”“OGDH”“PDHB”“IDH3G”“LOC283398”“IDH2”“IDH1”“PDHA2”“PDHA1” “SUCLA2” “FH” “DLST” “ACO2” “SUCLG2” “ACO1”

“PHF13” 突出显示以显示每个步骤的相关性。

我想做的是，看看“cell-state1”是否（in-）激活了“cell-state2”的不同基因/途径。此外，我想测试特定路径的细胞状态 1 与 2 之间的相关性（t 检验，也可能是绘图）。

我的问题是，哪种命令或方法可以让我最轻松/最有效地执行此操作：合并或使用虚拟变量？

HTH

原文

I have two sets of data that I would like to investigate. The first is gene/genome related data given different 'cell-states'. The second set of data is relates the genes to a biological pathway. I believe my question is a relational db one.

'How can I show the data related from one dataframe and relate it to another. In other words, I want to graph the cell-state data and relate it to pathways and their specific genes. (I think in pictures so here goes.)
dataframe1-data from an affymetrix gene-chip
gene, cell-state1, cell-state2...
gene1, x1, y1,...
gene2, x2, y2,...
gene.x, ... ...

"1" "gene" "log_b" "log_b_rich" "Fc_cdt_rich_tot" "fc_Etoh_CDT_tot_mono" "fc_Etoh_CDT_tot_poly" "fc_Etoh_CDT_mono_poly" "fc_Etoh_Rich_tot_mono" "fc_Etoh_Rich_tot_poly" "fc_Etoh_Rich_mono_poly"
"2" "PHF13" -2.712616698 -1.47923545 -0.791138043 -0.549610558 0.143808182 0.69341874 0.320812876 1.089260116 0.76844724
"3" "SPSB1" -1.808348454 -1.965601198 -1.349135752 -0.780105329 0.410647447 1.190752776 0.587287796 1.260350195 0.673062399

dataframe2-data from the kegg db
pathway1, gene-x1, gene-x2, ...
pathway2, gene-y1, gene-y2, ...
pathway3, gene-z1, ...

"1" "KEGG_GLYCOLYSIS_GLUCONEOGENESIS" "PHF13" "LDHB" "LDHA" "PGAM1" "ADH1C" "PGAM2" "ADH1B" "ADH1A" "ACSS2" "PDHB" "ACSS1" "PGAM4" "PDHA2" "PDHA1" "LDHAL6B" "PFKL" "LDHAL6A" "FBP1" "PFKP" "ALDH3B2" "FBP2" "PFKM" "ALDH3B1" "PGM2" "G6PC" "ALDH7A1" "ALDH1B1" "PKM2" "PGM1" "DLD" "PKLR" "ALDH9A1" "ALDOA" "ALDOC" "ALDOB" "ADH5" "HK2" "HK1" "ADH6" "ADH7" "ALDH3A2" "G6PC2" "ALDH3A1" "GALM" "TPI1" "AKR1A1" "ADH4" "HK3" "ALDH1A3" "ENO2" "ENO3" "GAPDH" "ENO1" "BPGM" "DLAT" "PCK2" "PCK1" "GPI" "GCK" "ALDH2" "PGK1" "PGK2"
"2" "KEGG_CITRATE_CYCLE_TCA_CYCLE" "PHF13" "OGDHL" "OGDH" "PDHB" "IDH3G" "LOC283398" "IDH2" "IDH1" "PDHA2" "PDHA1" "SUCLA2" "FH" "DLST" "ACO2" "SUCLG2" "ACO1"

"PHF13" is highlighted to show relevance in each step.

What I want to do is, see if 'cell-state1' (in-)activates different genes / pathways from 'cell-state2.' Furthermore, I would like to test for correlation (t-test and maybe graphing) between the cell-states 1 Vs 2 for specific pathways.

My question is, which commands or method would allow me to do this most easily/efficiently: merge or using dummy variable?

HTH

分享到QQ

分享到微博