XML 解析器 +索引数据

发布于 2024-11-16 04:16:19 字数 1195 浏览 2 评论 0原文

我需要使用 Lucene 索引一些 xml 文档,但在此之前,我需要解析这些 XML 并在其标签内提取一些信息。

XML 如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

我需要仅提取标记 begin 和 end 内的时间戳,然后对 p 标记内的文本进行索引。目标是查询索引的文本并了解每次命中的时间戳间隙。

例如,如果我查询单词“Text”,输出应该是这样的:“2 个命中,00:11.76-00:16.04, 00:18.52-00:22.88”

我开始使用 Lucene 对整个 XML 建立索引。现在我想解析该文件,但我不确定解决这个问题的最佳近似值是什么。

欢迎任何帮助或建议:) 谢谢大家!

I need to index some xml documents with Lucene, but before that, i need to parse those XML and extract some info inside their tags.

The XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

I need to extract only the timestamps inside the tags begin and end, and then index the text inside the p tags. The goal is to query the text indexed and know in which timestamp gap are each hit.

For example, if i query the word "Text" the output should say something like: "2 hits, 00:11.76-00:16.04, 00:18.52-00:22.88"

I started indexing the entire XML with Lucene. Now i want to parse the file, but im not sure what is the best approximation to solve this problem.

Any help or advice is welcome :)
Thank you all!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

雄赳赳气昂昂 2024-11-23 04:16:19

我使用了 SAX 库 (即 org.xml.sax.helpers.DefaultHandler 的子类)为了解析 XML 文件,我从每个 XML 文档中提取所需的信息到我自己的 Document 类中,然后对该 Document 实例建立索引。 (间接是由于必须单独解析多个文档格式,但在同一索引中建立索引。)在您的情况下,如果每个 的内容是elements 表示一个逻辑文档,您可以将日期信息存储为与特定标记关联的有效负载。将 XML 解析为

level,枚举段落实例,并为每个实例添加一个具有相同名称的新 Field 实例,其中值是文本,有效负载是适当表示的日期信息。 (有效负载是二进制的,因此,例如,您可以存储与开始时间和结束时间相对应的两个长值。)当您将多个同名字段实例添加到文档时,它们将被索引为同一字段,但您可以为每个实例分配不同的负载,可以调整文本开始的位置等。

如果不需要每个元素的内容作为单个文档,则可以将每个

视为作为一个单独的文档,然后在其上设置有效负载。或者,您可以将日期存储为单独的字段。

I used the SAX library (i.e., a subclass of org.xml.sax.helpers.DefaultHandler ) to parse XML files, extracted the desired information from each XML document into my own Document class, and then indexed that Document instance. (The indirection was due to having multiple document formats that had to be parsed separately, but indexed in the same index.) In your case, if the contents of each of your <body> elements represents a logical document, you can store the date information as payloads associated with specific tokens. Parse the XML to the <p> level, enumerate the paragraph instances, and for each instance, add a new Field instance with the same name, where the value is the text, and the payload is the date information, suitably represented. (Payloads are binary, so, for example, you could store the two long values corresponding to the start and end times.) When you add multiple field instances with the same name to a document, they get indexed as the same field, but you can assign different payloads to each instance, you can adjust the position of the start of the text, etc.

If you don't need the contents of each element as a single document, you can treat each <p> as a separate document, and then set the payload on that. Alternatively, you can store dates as a separate field.

南巷近海 2024-11-23 04:16:19

我强烈建议将所有 XML 存储在 eXist 数据库中,该数据库具有 Lucene 内置。我已经使用这个组合几个月了,它很容易解决很多搜索和检索问题。

I can highly recommend storing all your XML in an eXist database, which has Lucene built-in. I've been using this combination for a few months now and it solves a lot of search and retrieval problems quite easily.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文