精炼存储在 SQLite 中的数据 - 如何连接多个联系人?
我正在存储不同元素之间的联系。我想消除某种类型的元素并存储由消除的元素互连的元素的新联系人。
问题背景
想象一下这个问题。你有一个水分子与其他分子接触(如果接触是氢键,那么我的水周围可能有 4 个其他分子)。如下图所示(A、B、C、D是其他一些原子,点表示接触)。
A B
| |
H H
. .
O
/ \
H H
. .
C D
我有所有点的信息,我需要消除中心的水并创建描述 AC、AD、AB、BC、BD 和 CD 接触的记录。
数据库结构
目前,我在数据库中有以下结构:
表atoms
:
“id”整数PRIMARY KEY,
“amino”char(3)NOT NULL,
(HOH 表示水或其他值)- 标识原子的其他列表
contacts
:
“acceptor_id”整数NOT NULL,
(靠近我的氢的原子,这里C 或 D)"donor_id" 整数 NOT NULL,
(此处为 A 或 B)"directness" char(1) NOT NULL,
(这应该是 D 代表直接,W 代表 W对于水介导的)- 有关接触的其他列,例如距离
编辑: 前面描述的案例中的数据会是什么样子?
atoms
:
id|atom|amino
1 | O | HOH
2 | N | ARG <- atom A from image
3 | S | CYS <- B
4 | O | SER <- C
5 | N | ARG <- D
contacts
:
donor_id|acceptor_id|directness
1 4 D
1 5 D
2 1 D
3 1 D
我需要从中创建
contacts
:
donor_id|acceptor_id|directness
3 4 W <- B-C
3 5 W <- B-D
2 4 W <- A-C
2 5 W <- A-D
2 3 X <- A-B (These last two rows are escaping me,
4 5 X <- C-D there could be also row D-C, but not
both C-D and D-C. A char 'X' could
be used to mark "no donor/acceptor")
当前解决方案(不足)
现在,我正在浏览所有具有以下功能的联系人: 捐赠者.amino =“HOH”
。在此示例案例中,这将从 C 和 D 中选择联系人。对于每个选定的联系人,我都会查找与当前的 donor_id
具有相同 acceptor_id
的联系人。选定的联系人。根据此信息,我创建了新联系人。最后,我删除了 HOH 的所有联系人。
这样,我显然无法创建 CD 和 AB 联系人(其他 4 个都可以)。
如果我尝试类似的方法 - 试图找到两个具有相同捐赠者 ID 的联系人,我最终会得到重复的联系人(CD 和 DC)。
有没有一种简单的方法可以检索所有六个联系人而不重复?
我梦想着一个一页长的 SQL 查询,它只检索这六个想要的行。 :-)
但是,欢迎任何其他想法。
最好保存有关捐赠者的信息(如果可能),但并非绝对必要。
非常感谢到目前为止阅读此问题的所有人。
I'm storing contacts between different elements. I want to eliminate elements of certain type and store new contacts of elements which were interconnected by the eliminated element.
Problem background
Imagine this problem. You have a water molecule which is in contact with other molecules (if the contact is a hydrogen bond, there can be 4 other molecules around my water). Like in the following picture (A, B, C, D are some other atoms and dots mean the contact).
A B
| |
H H
. .
O
/ \
H H
. .
C D
I have the information about all the dots and I need to eliminate the water in the center and create records describing contacts of A-C, A-D, A-B, B-C, B-D, and C-D.
Database structure
Currently, I have the following structure in the database:
Table atoms
:
"id" integer PRIMARY KEY,
"amino" char(3) NOT NULL,
(HOH for water or other value)- other columns identifying the atom
Table contacts
:
"acceptor_id" integer NOT NULL,
(the atom near to my hydrogen, here C or D)"donor_id" integer NOT NULL,
(here A or B)"directness" char(1) NOT NULL,
(this should be D for direct and W for water-mediated)- other columns about the contact, such as the distance
EDIT:
How would look the data in the case depicted earlier.
atoms
:
id|atom|amino
1 | O | HOH
2 | N | ARG <- atom A from image
3 | S | CYS <- B
4 | O | SER <- C
5 | N | ARG <- D
contacts
:
donor_id|acceptor_id|directness
1 4 D
1 5 D
2 1 D
3 1 D
From which I need to make
contacts
:
donor_id|acceptor_id|directness
3 4 W <- B-C
3 5 W <- B-D
2 4 W <- A-C
2 5 W <- A-D
2 3 X <- A-B (These last two rows are escaping me,
4 5 X <- C-D there could be also row D-C, but not
both C-D and D-C. A char 'X' could
be used to mark "no donor/acceptor")
Current solution (insufficient)
Now, I'm going through all the contacts which have donor.amino = "HOH"
. In this sample case, this would select contacts from C and D. For each of these selected contacts, I look up contacts having the same acceptor_id
as is the donor_id
in the currently selected contact. From this information, I create the new contact. At the end, I delete all contacts to or from HOH.
This way, I am obviously unable to create C-D and A-B contacts (the other 4 are OK).
If I try a similar approach - trying to find two contacts having the same donor_id, I end up with duplicate contacts (C-D and D-C).
Is there a simple way to retrieve all six contacts without duplicates?
I'm dreaming about some one page long SQL query which retrieves just these six wanted rows. :-)
However, any other ideas are welcome.
It is preferable to conserve information about who is donor (where possible), but not strictly necessary.
Big thanks to all of you who read this question to this point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的解释有一个困难。
首先是一个有向图,其中每条边代表一个连接 X=>Y,其中 X 是供体,Y 是受体。表
atoms
是该图的 SQL 表示形式。您似乎想要的是无向的东西。因此,连接 XY 意味着 X 和 Y 通过水分子(当然也可以是其他一些物质)连接,但 X 和 Y 都可以是供体或受体。因此,您的最后一个表格有歧义(您注意到),因此某些链接可能以任何方式出现。在我看来,这意味着最后一列中的列标题
donor_id
和acceptor_id
没有您所解释的任何含义。这当然可能是我的困惑。如果您想要的只是一张包含所有 6 个链接的表,每个链接作为一行,但不要太担心跟踪捐赠者/接受者的事情,那么这对我在 sqlite3 中有用:
我使用了临时视图让事情变得更清楚。如果您愿意,您可以将所有这些放入一个大查询中,方法是用第一个查询替换对
hoh_view
的每个引用。我觉得有点恶心,也许有办法整理一下。如果您确实想跟踪供体/受体关系,您需要解释当两个氨基酸都是受体或供体时您如何决定做什么(即示例中的最后两行)。
如果这没有达到你想要的效果,那么也许我可以修复它,这样就可以了。
There's one difficulty with your explanation.
What you start with is a directed graph where each edge represents a connection X=>Y where X is a donor and Y an acceptor. The table
atoms
is the SQL representation of that graph.What you seem to want is something that is undirected. So that a link X-Y means that X and Y are linked via a water molecule (or some other species of course) but that X and Y could both be donors or acceptors. For this reason your last table has an ambiguity (that you note) so that some links could occur either way around. It seems to me that means that the column headings
donor_id
andacceptor_id
on your final column don't have any meaning that you have explained. This may be my confusion of course.If all you want is a table with all 6 links in it, each as one row, but don't worry too much about keeping track of the donor/acceptor thing then this works for me in sqlite3:
Where I have used a temporary view to make things clearer. You can put this all into one big query if you like by replacing each reference to
hoh_view
by the first query. It feels a bit nasty to me and there may be a way of tidying it up.If you do want to keep track of donor/acceptor relationships you need to explain how you decide what to do when both amino acids are acceptors or donors (i.e. the last two rows in your example).
If that doesn't do what you want, then maybe I can fix it up so it does.
好吧,很难在评论中提供示例,我决定发布一个答案:
如果您必须遵循原始数据,则无法区分第一个结构的数据和第二个结构的数据。应该有一个附加的分组条件来消除第一和第二结构之间的方向。
UPD
这是原始数据:
这是查询:
这是结果:
Well, its hard to provide examples in comments, I decided to post an answer:
If you have to following original data, there is no way to distinguish data from the first structure from those of the second. There should be an additional grouping condition to eleminate directions between the first and the second structure.
UPD
Here is the original data:
Here is the query:
Here is the result: