搜索和替换:将方括号转换为 xml 标签

发布于 2024-11-17 01:30:51 字数 2174 浏览 6 评论 0原文

我会尽量简明扼要。

鉴于以下情况,

#!/usr/bin/python
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')
para = etree.SubElement(sect,'para')
para.text = 'this is a [b]long[/b] block of text. Much longer than this example makes it out to be.'

我最好如何将输出转换为下面的内容。注意 [b] 变成了元素

<root> 
  <sect>
    <para>
       this is a <b>long</b> block of text. 
      Much longer than this example makes it out to be.
    </para>
  </sect>
</root>

我的真实输入和 xml 相当复杂。然而,这就是它的要点。我已经获取了标准格式的文本文档,并将其转换为 xml。文档的结构相当静态。因此,这并不像听起来那么疯狂。我目前已将其分成几行。这是相关的,因为当我浏览每一行时,我可以轻松识别 </code>,但很多时候 <code>< para></code> 的行中会有一些额外的格式。在此示例中,需要再次转换 <code>[b]</code>。实现这一目标的最佳方法是什么?

需要记住的事项

  1. 我输入文本的作者并不总是一致。因此,最好开发一个丢失正则表达式来查找 [b] WORD [/b] 或当作者出错时,如 [b[WORD[/b]]。我当前的想法是匹配类似 [b 或 b]

  2. 我当前正在逐行处理我的输入文件。我已经删除了所有空白行。我应该考虑事后处理这个问题吗?我没有强烈的目标,但觉得这可以包含在文本的单个循环中。

  3. 当我输出文档时,这需要与 lxml 配合良好。例如,请参阅下面的编辑以及我对 bbc 解析器的评论

对我整个下午都在研究的 BBC 解析器的评论,并且可以讨论我采取的更多路线。我将整个晚上都在研究这个问题,所以如果我遇到其他需要记住的事项,我将相应地更新这个问题。

编辑:或者我的bbc解析器问题

Paul深思熟虑地建议postmarkup-1.1.4,正如您所看到的,它与 lxml 配合得不好。将元素转换为实体。这是我今天下午通过搜索和替换执行此操作时遇到的问题。最终,这是一个完美的 sed 解决方案。正如所指出的。然而,我希望不是这个脚本的最终用户,而是希望所有内容都包含在一个命令中。

>>> p.text = render_bbcode(p.text)
>>> p.text
'this is a <strong>long</strong> text string'
>>> etree.tostring(root)
'<root><p>this is a &lt;strong&gt;long&lt;/strong&gt; text string</p></root>'

相反地​​这样做会得到同样糟糕的结果

 >>> p.text
 'this is a [b]long[/b] text string
 >>> render_bbcode(etree.tostring(root))
 u'&lt;root&gt;&lt;p&gt;this is a <strong>long</strong> string&lt;/p&gt;&lt;/root&gt;'

I will try and keep this short and to the point.

Given the following

#!/usr/bin/python
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')
para = etree.SubElement(sect,'para')
para.text = 'this is a [b]long[/b] block of text. Much longer than this example makes it out to be.'

how would I be best going about converting the output to what I have below. notice the [b]'s became element <b>

<root> 
  <sect>
    <para>
       this is a <b>long</b> block of text. 
      Much longer than this example makes it out to be.
    </para>
  </sect>
</root>

My real input and xml is considerably more complex. However, this is the gist of it. I have taken a standardly formatted text document and I am converting it to xml. The structure of the document is rather static. Therefore, this is not as crazy as it sounds. I currently have it broken into lines. This is relevant, because as I go through each line I have no trouble identifying <sect> or a <title>, but often times a <para> will have some extra formatting in its line. In this example, a [b], that needs to be converted yet again. What would be the best way of accomplishing this?

Items to keep in mind

  1. the authors of my input texts are not always consistent. therefore, it would be best to develop a lose regexp to find [b] WORD [/b] or when the authors errors something like [b[WORD[/b]. my current idea is to match something like [b or b]

  2. I am currently processing my input file line by line. I have removed any blank lines. should I consider processing this afterwards? I have no strong goal, but feel that this can be contained in a single loop through the text.

  3. This will need to play well with lxml when I output my document. for example see the edit below with my comment on the bbc parser

I have worked on this most of the afternoon, and can discuss more of the routes I have taken. I will be working on this throughout the evening so if I come across other items to keep in mind I will update this question accordingly.

EDIT: Or my problem with bbc parser

Paul thoughtfully suggested postmarkup-1.1.4, however, as you can see it does not play well with lxml. converting the elements to enities. This was a problem I ran into this afternoon when I did this through a search and replace. Ultimately, this is a perfect sed solution. As was pointed out. However, I was hoping to have not be the end user of this script and would rather everything contained within one command.

>>> p.text = render_bbcode(p.text)
>>> p.text
'this is a <strong>long</strong> text string'
>>> etree.tostring(root)
'<root><p>this is a <strong>long</strong> text string</p></root>'

doing this in reverse returns equally poor results

 >>> p.text
 'this is a [b]long[/b] text string
 >>> render_bbcode(etree.tostring(root))
 u'<root><p>this is a <strong>long</strong> string</p></root>'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

戒ㄋ 2024-11-24 01:30:51

后标记库似乎最接近您想要做的事情。

http://pypi.python.org/pypi/postmarkup/1.1.4

不幸的是,它最近没有得到很大的发展,但我没有看到任何其他看起来更好的库。

从那里开始并修改现有元素以适应您的语法可能比从头开始重新发明解析轮更快。

如果这不是一个好的方向,您可能会考虑更底层的语法词法分析和解析,但这会很快变得复杂到您可能会更好地使用简单的重复正则表达式和手动更正。你的语料库有多大?

最后要注意的是,像这样的任务正是 sed 编写的目的。如果您愿意学习如何使用它,它会非常强大。如果您还不太熟悉它,Python 可能会更容易。

The postmarkup library seems to come closest to what you want to do.

http://pypi.python.org/pypi/postmarkup/1.1.4

Unfortunately it hasn't seen a lot of development recently, but I don't see any other libraries that look tons better.

Starting from there and modifying the existing elements to fit your syntax is probably faster than reinventing the parsing wheel from scratch.

If that isn't a good direction, you might look at the more low-level syntax lexing and parsing, but that will rapidly get complex to the point that you might be better of with simple repetitive regexes and hand correction. How big is your corpus?

The final item of note is that tasks like this are precisely what sed was written to do. It can be amazingly powerful if you're willing to learn how to use it. If you're not already comfortable with it though, the Python might be easier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文