将嵌套字典/xml 转换为 sqlite 的平面文件
我已经在网上搜索过,似乎找不到合适的例子,所以我想我会问...... (顺便说一句,其中大部分对我来说都是新的 - 不是全部,只是大多数。)
问题:尝试将已发布的引文数据的 bio/python 嵌套字典(或 xml)转换为平面(规范化)结构,例如 sqlite。引文数据是使用 biopython 从 pubmed 获取的,并解析为字典,但如果需要,也可以检索为 xml。
并非所有引文都具有所有字段/键,并且并非所有字段/键都具有相同数量的项目(作者、网格术语、参考文献等),并理解这是标准化过程的一部分。
我的实际理解到此为止。
也就是说,我认为这个过程应该是这样的:首先删除/规范化所有唯一字段(每篇论文有 1 个字段,例如标题、摘要、日期、引文等......,但不要说隶属关系,因为这将是链接到第一作者)。没有摘要的论文可以填空吗?
然后继续,比如说,作者并使用 PMID 作为 fk 再次创建一个单独的表,然后对单独表中的各种其他字段/键/项目执行相同的操作,例如网格标题、EC 编号、参考等...
是有没有办法做到这一点,从主词典中删除(弹出?)键/项目,以便我可以直观地看到已完成/需要完成的操作(显然离开 PMID)?
再次,如果我向初学者提出一个非常明显的问题,请提前道歉 - 我确实明白你不能将嵌套结构放入平坦的空间中 - 只是寻找最不愚蠢的方法来解决这个问题,并希望有一个这将使我能够确保所有内容都被正确捕获。
非常感谢, 克里斯
I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一个简单的问题 - 如果您已经拥有 XML 格式的数据,为什么要将其规范化为 SQL 格式?为什么不直接使用原始 XML? Berkeley DB XML 是一个链接到您的应用程序的库(如 SQLite)。无需安装或维护单独的服务器。该库允许您使用 XPath 或 XQuery 存储和查询 XML 数据。它速度非常快,占地面积小。是事务性的、可恢复的并且高度可靠。如果需要的话,它还具有 HA 功能。
将数据保存在 XML 中应该会简化整个数据导入过程,并且仍然允许您查询半结构化数据。
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.