在哪里可以找到用于解析器测试的大型选项卡式分层数据集?
首先,抱歉,因为我意识到这与解析器编程无关。
我花了几个小时寻找一个包含类似以下内容但包含数百个(希望是数千个)子条目的文本文件。如果有完整的生物分类文件就完美了。当我的解析器解析简单的选项卡式文件时,以下的大规模版本将非常有用:
TL,DR - 我需要一个大规模的单文件分层数据集,如下所示:< /strong>
Kindoms
Monera
Protista
Fungi
Plants
Animals
Porifera
Sponges
Coelenterates
Hydra
Coral
Jellyfish
Platyhelminthes
Flatworms
Flukes
Nematodes
Roundworms
Tapeworms
Chordates
Urochordataes
Cephalochordates
Vertebrates
Fish
Amphibians
Reptiles
Birds
Mammals
我能找到的最好的是生命之树图像(我从中转录了上面的示例数据集)。包含大量真实数据的单个文件将是很棒。它不一定是生物分类数据集,但我真的希望这些数据能够反映现实世界中的某些内容。 (我的解析器提供了一个菜单 - 如果我的测试的其余部分是使用真正有意义的数据集,那就太好了!)即使文件没有选项卡式,但数据很容易被正则表达式为选项卡式格式......那太好了。
有什么想法吗?谢谢!
First, apologies as I realize this is only tangentially related to parser programming.
I've spend hours looking for a text file containing something like the following but with hundreds (hopefully thousands) of sub-entries. A complete biological classification file would be perfect. A massive version of the following would be great as my parser parses simple tabbed files:
TL,DR - I need a massive single-file hierarchical data set something like the following:
Kindoms
Monera
Protista
Fungi
Plants
Animals
Porifera
Sponges
Coelenterates
Hydra
Coral
Jellyfish
Platyhelminthes
Flatworms
Flukes
Nematodes
Roundworms
Tapeworms
Chordates
Urochordataes
Cephalochordates
Vertebrates
Fish
Amphibians
Reptiles
Birds
Mammals
The best I've been able to find are tree-of-life images (from which I transcribed the sample data set above). A single file with a TON of real data would be awesome. It doesn't have to be a biological classification data set, but I would really like the data to reflect something in the real-world. (My parser feeds a menu - would be great if the remainder of my testing was with a data set that actually meant something!) Even if the file is not tabbed but the data was fairly easily regex'ed to a tabbed format... that would be great.
Any ideas? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
自上次回答以来,xml 布局可能已更改,但上面提交的代码不再准确。生成的转储是无关的。某些节点具有别名(表示为“其他名称”),这些别名本身被报告为不同的节点。
我使用下面的脚本生成正确的转储。
It is possible that the xml layout was changed since the last answer but the code submitted above is no longer accurate. The resulting dump is extraneous. Some of the nodes have aliases (denoted as 'othername') that are reported as distinct nodes themselves.
I used the script below to generate the correct dump.
事实证明这真是一件令人痛苦的事情。我最终在 tolweb.org 上找到了来自“生命之树网络项目”的数据源。我制作了下面的 php 脚本来提供我的帖子所需的基本功能。
更改node_id以使其打印任何tolweb.org数据的选项卡式表示形式 - 只需从您在其网站上浏览的页面中获取id并更改下面的node_id即可。
但请注意 - 他们的数据源提供大文件,因此如果您要多次点击该文件,请务必将文件下载到您自己的服务器(并更改下面的“打开”方法以指向本地文件) 。
有关 tolweb.org 数据源的更多信息可以在此处找到:
http://tolweb.org/tree/home.pages/downloadtree.html
This turned out to be such a pain in the ass. I finally tracked down a data feed from "The Tree of Life Web Project" at tolweb.org. I made the php script below to provide the basic functionality my post was looking for.
Change the node_id to have it print a tabbed representation of any of tolweb.org's data - just take the id from the page you're browsing on their site and change the node_id below.
Be aware though - their data feeds serve up large files, so definitely download the file to your own server (and change the "open" method below to point to the local file) if you're going to hit it more than once or twice.
More info on tolweb.org data feeds can be found here:
http://tolweb.org/tree/home.pages/downloadtree.html