搜索和替换：将方括号转换为 xml 标签

发布于 2024-11-17 01:30:51 字数 2174 浏览 6 评论 0原文

我会尽量简明扼要。

鉴于以下情况，

#!/usr/bin/python
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')
para = etree.SubElement(sect,'para')
para.text = 'this is a [b]long[/b] block of text. Much longer than this example makes it out to be.'

我最好如何将输出转换为下面的内容。注意 [b] 变成了元素

<root> 
  <sect>
    <para>
       this is a <b>long</b> block of text. 
      Much longer than this example makes it out to be.
    </para>
  </sect>
</root>

我的真实输入和 xml 相当复杂。然而，这就是它的要点。我已经获取了标准格式的文本文档，并将其转换为 xml。文档的结构相当静态。因此，这并不像听起来那么疯狂。我目前已将其分成几行。这是相关的，因为当我浏览每一行时，我可以轻松识别或 </code>，但很多时候 <code>< para></code> 的行中会有一些额外的格式。在此示例中，需要再次转换 <code>[b]</code>。实现这一目标的最佳方法是什么？

需要记住的事项

我输入文本的作者并不总是一致。因此，最好开发一个丢失正则表达式来查找 [b] WORD [/b] 或当作者出错时，如 [b[WORD[/b]]。我当前的想法是匹配类似 [b 或 b]
我当前正在逐行处理我的输入文件。我已经删除了所有空白行。我应该考虑事后处理这个问题吗？我没有强烈的目标，但觉得这可以包含在文本的单个循环中。
当我输出文档时，这需要与 lxml 配合良好。例如，请参阅下面的编辑以及我对 bbc 解析器的评论

对我整个下午都在研究的 BBC 解析器的评论，并且可以讨论我采取的更多路线。我将整个晚上都在研究这个问题，所以如果我遇到其他需要记住的事项，我将相应地更新这个问题。

编辑：或者我的bbc解析器问题

Paul深思熟虑地建议postmarkup-1.1.4，正如您所看到的，它与 lxml 配合得不好。将元素转换为实体。这是我今天下午通过搜索和替换执行此操作时遇到的问题。最终，这是一个完美的 sed 解决方案。正如所指出的。然而，我希望不是这个脚本的最终用户，而是希望所有内容都包含在一个命令中。

>>> p.text = render_bbcode(p.text)
>>> p.text
'this is a <strong>long</strong> text string'
>>> etree.tostring(root)
'<root><p>this is a &lt;strong&gt;long&lt;/strong&gt; text string</p></root>'

相反地这样做会得到同样糟糕的结果

 >>> p.text
 'this is a [b]long[/b] text string
 >>> render_bbcode(etree.tostring(root))
 u'&lt;root&gt;&lt;p&gt;this is a <strong>long</strong> string&lt;/p&gt;&lt;/root&gt;'

原文

I will try and keep this short and to the point.

Given the following

#!/usr/bin/python
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')
para = etree.SubElement(sect,'para')
para.text = 'this is a [b]long[/b] block of text. Much longer than this example makes it out to be.'

how would I be best going about converting the output to what I have below. notice the [b]'s became element <b>

<root> 
  <sect>
    <para>
       this is a <b>long</b> block of text. 
      Much longer than this example makes it out to be.
    </para>
  </sect>
</root>

My real input and xml is considerably more complex. However, this is the gist of it. I have taken a standardly formatted text document and I am converting it to xml. The structure of the document is rather static. Therefore, this is not as crazy as it sounds. I currently have it broken into lines. This is relevant, because as I go through each line I have no trouble identifying <sect> or a <title>, but often times a <para> will have some extra formatting in its line. In this example, a [b], that needs to be converted yet again. What would be the best way of accomplishing this?

Items to keep in mind

the authors of my input texts are not always consistent. therefore, it would be best to develop a lose regexp to find [b] WORD [/b] or when the authors errors something like [b[WORD[/b]. my current idea is to match something like [b or b]
I am currently processing my input file line by line. I have removed any blank lines. should I consider processing this afterwards? I have no strong goal, but feel that this can be contained in a single loop through the text.
This will need to play well with lxml when I output my document. for example see the edit below with my comment on the bbc parser

I have worked on this most of the afternoon, and can discuss more of the routes I have taken. I will be working on this throughout the evening so if I come across other items to keep in mind I will update this question accordingly.

EDIT: Or my problem with bbc parser

Paul thoughtfully suggested postmarkup-1.1.4, however, as you can see it does not play well with lxml. converting the elements to enities. This was a problem I ran into this afternoon when I did this through a search and replace. Ultimately, this is a perfect sed solution. As was pointed out. However, I was hoping to have not be the end user of this script and would rather everything contained within one command.

>>> p.text = render_bbcode(p.text)
>>> p.text
'this is a <strong>long</strong> text string'
>>> etree.tostring(root)
'<root><p>this is a <strong>long</strong> text string</p></root>'

doing this in reverse returns equally poor results

 >>> p.text
 'this is a [b]long[/b] text string
 >>> render_bbcode(etree.tostring(root))
 u'<root><p>this is a <strong>long</strong> string</p></root>'

分享到QQ

分享到微博