XML 解析器 +索引数据

发布于 2024-11-16 04:16:19 字数 1195 浏览 8 评论 0原文

我需要使用 Lucene 索引一些 xml 文档，但在此之前，我需要解析这些 XML 并在其标签内提取一些信息。

XML 如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

我需要仅提取标记 begin 和 end 内的时间戳，然后对 p 标记内的文本进行索引。目标是查询索引的文本并了解每次命中的时间戳间隙。

例如，如果我查询单词“Text”，输出应该是这样的：“2 个命中，00:11.76-00:16.04, 00:18.52-00:22.88”

我开始使用 Lucene 对整个 XML 建立索引。现在我想解析该文件，但我不确定解决这个问题的最佳近似值是什么。

欢迎任何帮助或建议:) 谢谢大家！

原文

I need to index some xml documents with Lucene, but before that, i need to parse those XML and extract some info inside their tags.

The XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

I need to extract only the timestamps inside the tags begin and end, and then index the text inside the p tags. The goal is to query the text indexed and know in which timestamp gap are each hit.

For example, if i query the word "Text" the output should say something like: "2 hits, 00:11.76-00:16.04, 00:18.52-00:22.88"

I started indexing the entire XML with Lucene. Now i want to parse the file, but im not sure what is the best approximation to solve this problem.

Any help or advice is welcome :)
Thank you all!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雄赳赳气昂昂 2024-11-23 04:16:19

我使用了 SAX 库（即 org.xml.sax.helpers.DefaultHandler 的子类）为了解析 XML 文件，我从每个 XML 文档中提取所需的信息到我自己的 Document 类中，然后对该 Document 实例建立索引。（间接是由于必须单独解析多个文档格式，但在同一索引中建立索引。）在您的情况下，如果每个的内容是elements 表示一个逻辑文档，您可以将日期信息存储为与特定标记关联的有效负载。将 XML 解析为

level，枚举段落实例，并为每个实例添加一个具有相同名称的新 Field 实例，其中值是文本，有效负载是适当表示的日期信息。（有效负载是二进制的，因此，例如，您可以存储与开始时间和结束时间相对应的两个长值。）当您将多个同名字段实例添加到文档时，它们将被索引为同一字段，但您可以为每个实例分配不同的负载，可以调整文本开始的位置等。

如果不需要每个元素的内容作为单个文档，则可以将每个

视为作为一个单独的文档，然后在其上设置有效负载。或者，您可以将日期存储为单独的字段。

回复收藏 0 原文