使用 Hadoop 处理 xml 文件

发布于 2024-10-21 10:53:49 字数 2135 浏览 4 评论 0原文

我是 Hadoop 新手。我对此知之甚少。我的情况如下：我有一组具有相同架构的 xml 文件（700GB+）。

    <article>
     <title>some title</title>
     <abstract>some abstract</abstract>
     <year>2000</year>
     <id>E123456</id>
     <authors>
      <author id="1">
       <firstName>some name1</firstName>
       <lastName>some name1</lastName>
       <email>[email protected]</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <author id="2">
       <firstName>some name2</firstName>
       <lastName>some name2</lastName>
       <email>[email protected]</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <tags>
       <tag>medicin</tag>
       <tag>inheritance</tag>
      </tags>
     </authors>
     <references>
      <reference>some reference text1</reference>
      <reference>some reference text2</reference>
     </references>
    </article>

我将 xml 文件中的数据转换为包含以下表格的关系数据库：

文章
作者
标签
参考资料

我有一组工具可以处理这些表，用于生成统计报告列表并执行其他一些操作。由于有一个工具在 References 表上使用全文搜索，因此我将其存储在 Lucene Solr 索引中。

我的问题是：我可以使用 Hadoop 来：

存储 xml 文件中的数据
在上面列出的实体之间进行某种分离（作者、文章、标签和参考文献）
运行我的工具，对数据执行一组非常复杂的查询，如果可以使用hadoop来完成，性能会很好吗？

如果 Hadoop 不是一个很好的候选案例，那么任何其他 nosql 数据库（如 MongoDB 或 Cassandra）都会是更好的解决方案（因为我的关系系统的大问题是我用来完成工作的复杂算法的性能非常糟糕）？

原文

I'm new to Hadoop. I know very little about it.
My case is as follows:
I have a set of xml files (700GB+) with the same schema.

    <article>
     <title>some title</title>
     <abstract>some abstract</abstract>
     <year>2000</year>
     <id>E123456</id>
     <authors>
      <author id="1">
       <firstName>some name1</firstName>
       <lastName>some name1</lastName>
       <email>[email protected]</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <author id="2">
       <firstName>some name2</firstName>
       <lastName>some name2</lastName>
       <email>[email protected]</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <tags>
       <tag>medicin</tag>
       <tag>inheritance</tag>
      </tags>
     </authors>
     <references>
      <reference>some reference text1</reference>
      <reference>some reference text2</reference>
     </references>
    </article>

I convert the data within the xml files into a relational database containing the following tables

Articles
Authors
Tags
References

I have a set of tools that work on the tables for generating a list of statistical reports and doing some other staff. Because of a tool that uses a full text search on the References table, I stored it in a Lucene Solr index.

My question is:
can I use Hadoop for:

Storing the data that is in the xml files
Making some kind of separation between the entities listed above(Authors,Articles,Tag and References)
Running my tools that perform a very complex set of queries on the data and if that can be done using hadoop, will it be in a good performance?

If Hadoop is not a good candidate for case, will be any other nosql database like MongoDB or Cassandra a better solution (because my big problem with the relational system is the very bad performance with the complex algorithms I'm using to do my job)?

分享到QQ

分享到微博