如何在SOLR中使用DIH索引不同类型的xml?

发布于 2024-09-24 02:02:40 字数 5254 浏览 2 评论 0原文

我需要索引 5 种不同类型的 xml 文件。它们具有相似的结构,但各自略有不同。

示例1:

<?xml version="1.0"?>

    <manifest> 
  <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>Alberts;Bruce</author> 
                        <author>Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCDEF</category> 
                </categories> 
  </metadata> 
  <resources> 
                <audioresource> 
                        <uuid>123456789</uuid> 
                        <source>03_Mutations_Origin_Cancer.mp3</source> 
                        <mimetype>audio/mpeg</mimetype> 
                        <title>Part Three - Mutations and the Origin of Cancer</title> 
                        <description>123</description> 
                        <chapters> 
                                <chapter>1</chapter> 
                        </chapters> 
                </audioresource> 
  </resources> 
</manifest> 

示例2:

<?xml version="1.0"?> 
<manifest> 
        <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>FN:Alberts;Bruce</author> 
                        <author>FN:Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCGHI</category> 
                </categories> 
        </metadata> 

        <resources> 
                <glossaryresource> 
                        <uuid>123456789</uuid> 
                        <term>A subunit </term> 
                        <definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition> 
                        <chapters> 
                                <chapter>10</chapter> 
                        </chapters> 
                </glossaryresource> 
        </resources> 
</manifest> 

我的dih-config.xml如下:

<dataConfig> 
        <dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/> 
        <document> 
                <entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr"> 
                        <entity name="audioresource" 
                                        rootEntity="true" 
                                        dataSource="fileReader" 
                                        url="${dir.fileAbsolutePath}" 
                                        stream="false" 
                                        logTemplate=" processing ${dir.fileAbsolutePath}" 
                                        logLevel="debug" 
                                        processor="XPathEntityProcessor" 
                                        forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters" 
                                        transformer="DateFormatTransformer"> 

                                        <field column="category" xpath="/manifest/metadata/categories/category" /> 
                                        <field column="author" xpath="/manifest/metadata/authors/author" /> 
                                        <field column="book_title" xpath="/manifest/metadata/title" /> 
                                        <field column="isbn" xpath="/manifest/metadata/isbn"/> 
                                        <field column="id" xpath="/manifest/resources/audioresource/uuid"/> 
                                        <field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" /> 
                                        <field column="title" xpath="/manifest/resources/audioresource/title"/> 
                                        <field column="description" xpath="/manifest/resources/audioresource/description"/> 
                                        <field column="chapter" xpath="/manifest/resources/audioresource/chapters/chapter"/> 
                                        <field column="source" xpath="/manifest/resources/audioresource/source"/> 
                        </entity> 
                </entity> 
        </document> 
</dataConfig> 

我对xpath不太熟悉。我不能在元素名称中使用通配符,可以吗?尝试了一下,没有成功。

非常感谢。

I need to index 5 different kinds of xml files. They share similar structure with slight differences in each of them.

example 1:

<?xml version="1.0"?>

    <manifest> 
  <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>Alberts;Bruce</author> 
                        <author>Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCDEF</category> 
                </categories> 
  </metadata> 
  <resources> 
                <audioresource> 
                        <uuid>123456789</uuid> 
                        <source>03_Mutations_Origin_Cancer.mp3</source> 
                        <mimetype>audio/mpeg</mimetype> 
                        <title>Part Three - Mutations and the Origin of Cancer</title> 
                        <description>123</description> 
                        <chapters> 
                                <chapter>1</chapter> 
                        </chapters> 
                </audioresource> 
  </resources> 
</manifest> 

example 2:

<?xml version="1.0"?> 
<manifest> 
        <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>FN:Alberts;Bruce</author> 
                        <author>FN:Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCGHI</category> 
                </categories> 
        </metadata> 

        <resources> 
                <glossaryresource> 
                        <uuid>123456789</uuid> 
                        <term>A subunit </term> 
                        <definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition> 
                        <chapters> 
                                <chapter>10</chapter> 
                        </chapters> 
                </glossaryresource> 
        </resources> 
</manifest> 

My dih-config.xml is as below:

<dataConfig> 
        <dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/> 
        <document> 
                <entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr"> 
                        <entity name="audioresource" 
                                        rootEntity="true" 
                                        dataSource="fileReader" 
                                        url="${dir.fileAbsolutePath}" 
                                        stream="false" 
                                        logTemplate=" processing ${dir.fileAbsolutePath}" 
                                        logLevel="debug" 
                                        processor="XPathEntityProcessor" 
                                        forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters" 
                                        transformer="DateFormatTransformer"> 

                                        <field column="category" xpath="/manifest/metadata/categories/category" /> 
                                        <field column="author" xpath="/manifest/metadata/authors/author" /> 
                                        <field column="book_title" xpath="/manifest/metadata/title" /> 
                                        <field column="isbn" xpath="/manifest/metadata/isbn"/> 
                                        <field column="id" xpath="/manifest/resources/audioresource/uuid"/> 
                                        <field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" /> 
                                        <field column="title" xpath="/manifest/resources/audioresource/title"/> 
                                        <field column="description" xpath="/manifest/resources/audioresource/description"/> 
                                        <field column="chapter" xpath="/manifest/resources/audioresource/chapters/chapter"/> 
                                        <field column="source" xpath="/manifest/resources/audioresource/source"/> 
                        </entity> 
                </entity> 
        </document> 
</dataConfig> 

I'm not quite familiar with xpath. I can't use wildcard in element name, can I? Tried it and it didn't work.

Many thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

内心旳酸楚 2024-10-01 02:02:40

我目前正在调查类似的问题。您是否尝试过创建 XSLT?实体元素有一个可选的“xsl”属性。

I'm currently investigating a similar issue. Have you tried creating an XSLT? The entity element has an optional "xsl" attribute.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文