使用 Hadoop 处理 xml 文件
我是 Hadoop 新手。我对此知之甚少。 我的情况如下: 我有一组具有相同架构的 xml 文件(700GB+)。
<article>
<title>some title</title>
<abstract>some abstract</abstract>
<year>2000</year>
<id>E123456</id>
<authors>
<author id="1">
<firstName>some name1</firstName>
<lastName>some name1</lastName>
<email>[email protected]</email>
<affiliations affid="123">
<org>some organization1</org>
<org>some organization2</org>
</affiliations>
</author>
<author id="2">
<firstName>some name2</firstName>
<lastName>some name2</lastName>
<email>[email protected]</email>
<affiliations affid="123">
<org>some organization1</org>
<org>some organization2</org>
</affiliations>
</author>
<tags>
<tag>medicin</tag>
<tag>inheritance</tag>
</tags>
</authors>
<references>
<reference>some reference text1</reference>
<reference>some reference text2</reference>
</references>
</article>
我将 xml 文件中的数据转换为包含以下表格的关系数据库:
- 文章
- 作者
- 标签
- 参考资料
我有一组工具可以处理这些表,用于生成统计报告列表并执行其他一些操作。由于有一个工具在 References 表上使用全文搜索,因此我将其存储在 Lucene Solr 索引中。
我的问题是: 我可以使用 Hadoop 来:
- 存储 xml 文件中的数据
- 在上面列出的实体之间进行某种分离(作者、文章、标签和参考文献)
- 运行我的工具,对数据执行一组非常复杂的查询,如果可以使用hadoop来完成,性能会很好吗?
如果 Hadoop 不是一个很好的候选案例,那么任何其他 nosql 数据库(如 MongoDB 或 Cassandra)都会是更好的解决方案(因为我的关系系统的大问题是我用来完成工作的复杂算法的性能非常糟糕) ?
I'm new to Hadoop. I know very little about it.
My case is as follows:
I have a set of xml files (700GB+) with the same schema.
<article>
<title>some title</title>
<abstract>some abstract</abstract>
<year>2000</year>
<id>E123456</id>
<authors>
<author id="1">
<firstName>some name1</firstName>
<lastName>some name1</lastName>
<email>[email protected]</email>
<affiliations affid="123">
<org>some organization1</org>
<org>some organization2</org>
</affiliations>
</author>
<author id="2">
<firstName>some name2</firstName>
<lastName>some name2</lastName>
<email>[email protected]</email>
<affiliations affid="123">
<org>some organization1</org>
<org>some organization2</org>
</affiliations>
</author>
<tags>
<tag>medicin</tag>
<tag>inheritance</tag>
</tags>
</authors>
<references>
<reference>some reference text1</reference>
<reference>some reference text2</reference>
</references>
</article>
I convert the data within the xml files into a relational database containing the following tables
- Articles
- Authors
- Tags
- References
I have a set of tools that work on the tables for generating a list of statistical reports and doing some other staff. Because of a tool that uses a full text search on the References table, I stored it in a Lucene Solr index.
My question is:
can I use Hadoop for:
- Storing the data that is in the xml files
- Making some kind of separation between the entities listed above(Authors,Articles,Tag and References)
- Running my tools that perform a very complex set of queries on the data and if that can be done using hadoop, will it be in a good performance?
If Hadoop is not a good candidate for case, will be any other nosql database like MongoDB or Cassandra a better solution (because my big problem with the relational system is the very bad performance with the complex algorithms I'm using to do my job)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你所要求的听起来与谷歌、雅虎、必应等对网络所做的非常相似——以某种形式的标记吸收文档,存储它们,处理它们以提取相关信息,并在此基础上提供一个查询界面。我建议研究一下这些搜索引擎如何利用 MapReduce 和 BigTable 实现(例如 HBase 和 Cassandra) 即可做到这一点。
What you are asking for sounds very similar to what Google, Yahoo, Bing etc do with the web- suck in documents as some form of markup, store them, process them to extract the relevant information, and provide a query interface on top of that. I'd suggest looking into how these search engines leverage MapReduce and BigTable implementations (like HBase and Cassandra) to do exactly that.