如何在SOLR中使用DIH索引不同类型的xml?
我需要索引 5 种不同类型的 xml 文件。它们具有相似的结构,但各自略有不同。
示例1:
<?xml version="1.0"?>
<manifest>
<metadata>
<isbn>9780815341291</isbn>
<title>Essential Cell Biology,Third Edition</title>
<authors>
<author>Alberts;Bruce</author>
<author>Bray;Dennis</author>
</authors>
<categories>
<category>SCABC</category>
<category>SCDEF</category>
</categories>
</metadata>
<resources>
<audioresource>
<uuid>123456789</uuid>
<source>03_Mutations_Origin_Cancer.mp3</source>
<mimetype>audio/mpeg</mimetype>
<title>Part Three - Mutations and the Origin of Cancer</title>
<description>123</description>
<chapters>
<chapter>1</chapter>
</chapters>
</audioresource>
</resources>
</manifest>
示例2:
<?xml version="1.0"?>
<manifest>
<metadata>
<isbn>9780815341291</isbn>
<title>Essential Cell Biology,Third Edition</title>
<authors>
<author>FN:Alberts;Bruce</author>
<author>FN:Bray;Dennis</author>
</authors>
<categories>
<category>SCABC</category>
<category>SCGHI</category>
</categories>
</metadata>
<resources>
<glossaryresource>
<uuid>123456789</uuid>
<term>A subunit </term>
<definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition>
<chapters>
<chapter>10</chapter>
</chapters>
</glossaryresource>
</resources>
</manifest>
我的dih-config.xml如下:
<dataConfig>
<dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr">
<entity name="audioresource"
rootEntity="true"
dataSource="fileReader"
url="${dir.fileAbsolutePath}"
stream="false"
logTemplate=" processing ${dir.fileAbsolutePath}"
logLevel="debug"
processor="XPathEntityProcessor"
forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters"
transformer="DateFormatTransformer">
<field column="category" xpath="/manifest/metadata/categories/category" />
<field column="author" xpath="/manifest/metadata/authors/author" />
<field column="book_title" xpath="/manifest/metadata/title" />
<field column="isbn" xpath="/manifest/metadata/isbn"/>
<field column="id" xpath="/manifest/resources/audioresource/uuid"/>
<field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" />
<field column="title" xpath="/manifest/resources/audioresource/title"/>
<field column="description" xpath="/manifest/resources/audioresource/description"/>
<field column="chapter" xpath="/manifest/resources/audioresource/chapters/chapter"/>
<field column="source" xpath="/manifest/resources/audioresource/source"/>
</entity>
</entity>
</document>
</dataConfig>
我对xpath不太熟悉。我不能在元素名称中使用通配符,可以吗?尝试了一下,没有成功。
非常感谢。
I need to index 5 different kinds of xml files. They share similar structure with slight differences in each of them.
example 1:
<?xml version="1.0"?>
<manifest>
<metadata>
<isbn>9780815341291</isbn>
<title>Essential Cell Biology,Third Edition</title>
<authors>
<author>Alberts;Bruce</author>
<author>Bray;Dennis</author>
</authors>
<categories>
<category>SCABC</category>
<category>SCDEF</category>
</categories>
</metadata>
<resources>
<audioresource>
<uuid>123456789</uuid>
<source>03_Mutations_Origin_Cancer.mp3</source>
<mimetype>audio/mpeg</mimetype>
<title>Part Three - Mutations and the Origin of Cancer</title>
<description>123</description>
<chapters>
<chapter>1</chapter>
</chapters>
</audioresource>
</resources>
</manifest>
example 2:
<?xml version="1.0"?>
<manifest>
<metadata>
<isbn>9780815341291</isbn>
<title>Essential Cell Biology,Third Edition</title>
<authors>
<author>FN:Alberts;Bruce</author>
<author>FN:Bray;Dennis</author>
</authors>
<categories>
<category>SCABC</category>
<category>SCGHI</category>
</categories>
</metadata>
<resources>
<glossaryresource>
<uuid>123456789</uuid>
<term>A subunit </term>
<definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition>
<chapters>
<chapter>10</chapter>
</chapters>
</glossaryresource>
</resources>
</manifest>
My dih-config.xml is as below:
<dataConfig>
<dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr">
<entity name="audioresource"
rootEntity="true"
dataSource="fileReader"
url="${dir.fileAbsolutePath}"
stream="false"
logTemplate=" processing ${dir.fileAbsolutePath}"
logLevel="debug"
processor="XPathEntityProcessor"
forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters"
transformer="DateFormatTransformer">
<field column="category" xpath="/manifest/metadata/categories/category" />
<field column="author" xpath="/manifest/metadata/authors/author" />
<field column="book_title" xpath="/manifest/metadata/title" />
<field column="isbn" xpath="/manifest/metadata/isbn"/>
<field column="id" xpath="/manifest/resources/audioresource/uuid"/>
<field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" />
<field column="title" xpath="/manifest/resources/audioresource/title"/>
<field column="description" xpath="/manifest/resources/audioresource/description"/>
<field column="chapter" xpath="/manifest/resources/audioresource/chapters/chapter"/>
<field column="source" xpath="/manifest/resources/audioresource/source"/>
</entity>
</entity>
</document>
</dataConfig>
I'm not quite familiar with xpath. I can't use wildcard in element name, can I? Tried it and it didn't work.
Many thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我目前正在调查类似的问题。您是否尝试过创建 XSLT?实体元素有一个可选的“xsl”属性。
I'm currently investigating a similar issue. Have you tried creating an XSLT? The entity element has an optional "xsl" attribute.