从 scala 中的文本和元素偏移量创建 XML
我需要从一段纯文本以及应插入的每个 XML 元素的开始和结束偏移量创建一个 XML 文档。以下是我希望它通过的一些测试用例:
val text = "The dog chased the cat."
val spans = Seq(
(0, 23, <xml/>),
(4, 22, <phrase/>),
(4, 7, <token/>))
val expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
assert(expected === spansToXML(text, spans))
val text = "aabbccdd"
val spans = Seq(
(0, 8, <xml x="1"/>),
(0, 4, <ab y="foo"/>),
(4, 8, <cd z="42>3"/>))
val expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
assert(expected === spansToXML(text, spans))
val spans = Seq(
(0, 1, <a/>),
(0, 0, <b/>),
(0, 0, <c/>),
(1, 1, <d/>),
(1, 1, <e/>))
assert(<a><b/><c/> <d/><e/></a> === spansToXML(" ", spans))
我的部分解决方案(请参阅下面的答案)通过字符串连接和 XML.loadString
工作。这看起来很奇怪,而且我也不能 100% 确定这个解决方案在所有极端情况下都能正常工作......
还有更好的解决方案吗? (就其价值而言,我很乐意切换到 anti-xml 如果这能让这项任务变得更容易.)
于 2011 年 8 月 10 日更新以添加更多测试用例并提供更清晰的规范。
I need to create an XML document from a piece of plain text and the begin and end offsets of each XML element that should be inserted. Here are a few test cases I'd like it to pass:
val text = "The dog chased the cat."
val spans = Seq(
(0, 23, <xml/>),
(4, 22, <phrase/>),
(4, 7, <token/>))
val expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
assert(expected === spansToXML(text, spans))
val text = "aabbccdd"
val spans = Seq(
(0, 8, <xml x="1"/>),
(0, 4, <ab y="foo"/>),
(4, 8, <cd z="42>3"/>))
val expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
assert(expected === spansToXML(text, spans))
val spans = Seq(
(0, 1, <a/>),
(0, 0, <b/>),
(0, 0, <c/>),
(1, 1, <d/>),
(1, 1, <e/>))
assert(<a><b/><c/> <d/><e/></a> === spansToXML(" ", spans))
My partial solution (see my answer below) works by string concatenation and XML.loadString
. That seems hacky, and I'm also not 100% sure this solution works correctly in all the corner cases...
Any better solutions? (For what it's worth, I'd be happy to switch to anti-xml if that would make this task easier.)
Updated 10 Aug 2011 to add more test cases and provide a cleaner specification.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
鉴于您提出的赏金,我研究了您的问题一段时间并提出了以下解决方案,该解决方案在您的所有测试用例上都取得了成功。
我真的很希望我的答案被接受 - 请告诉我我的解决方案是否有任何问题。
一些评论:
如果您想了解执行期间发生了什么,我将注释掉的打印语句留在里面。
除了您的规范之外,我确实保留了他们现有的孩子(如果有的话) - 有一条评论是这样做的。
我不手动构建 XML 节点,而是修改传入的节点。为了避免分割开始和结束标记,我不得不对算法进行大量更改,但是通过
begin
进行排序的想法和-end
来自您的解决方案。该代码有些高级 Scala,特别是当我构建我需要的不同
Orderings
时。我确实从我得到的第一个版本中稍微简化了它。我通过使用
SortedMap
避免创建表示间隔的树,并在提取后过滤间隔。这个选择有点次优;然而,我听说有“更好”的数据结构来表示嵌套间隔,例如间隔树(它们在计算几何中研究),但它们实现起来相当复杂,我认为这里不需要它。Given the bounty you put forward, I studied your problem for some time and came up with the following solution, which succeeds on all your testcases.
I would really like getting my answer accepted - please tell me if there's anything wrong with my solution.
Some comments:
I left the commented out print statement inside, if you wanna figure what's going on during execution.
In addition to your specification, I do preserve their existing children (if any) - there's a comment where this is done.
I do not build the XML nodes manually, I modify the ones passed in. To avoid splitting the opening and closing tag, I had to change the algorithm quite a lot, but the idea of sorting spans by
begin
and-end
comes from your solution.The code is somewhat advanced Scala, especially when I build the different
Orderings
I need. I did simplify it somewhat from the first version I got.I avoided creating a tree representing the intervals, by using a
SortedMap
, and filtering the intervals after extraction. This choice is somewhat suboptimal; however, I heard that there are "better" data structures for representing nested intervals, like interval trees (they are studied in computational geometry), but they're quite complex to implement, and I don't think it's needed here.这很有趣!
我采取了与史蒂夫类似的方法。
通过对“开始标签”和“结束标签”中的元素进行排序,然后计算将它们放在哪里。
我无耻地从 Blaisorblade 窃取了测试,并添加了更多有助于我开发代码的测试。
于 2011-08-14 编辑
我对 test-5 中插入空标签的方式感到不满意。然而,这个位置是 test-3 如何制定的结果,
这使得很难在跨越标签之间放置空标签,这可能是有用的。
因此,我稍微改变了一些测试并提供了替代解决方案。
在替代解决方案中,我以相同的方式启动,但有 3 个单独的列表:开始、空和结束标签。我不仅仅进行排序,还进行了第三步,将空标签放入标签列表中。
第一个解决方案:
替代解决方案:
This was fun!
I took an approche similar to Steve's.
By sorting up the element's in "start tags" and "end tags", then calculating where to put them.
I shamelessly stole the test's from Blaisorblade, and added a couple of more that helped me developing the code.
EDITED on 2011-08-14
I was unhappy how the empty tag where inserted in test-5. However this placement where a consequence of how test-3 where formulated
This make it hard to place empty tags between spanning tags which could be useful.
So, I changed some test's around a little and provided an alternative solution.
In the alternative solution I start up in the same way, but have 3 separate lists, start, empty and closing tags. And instead of only sorting I have a third step where the empty tags are placed into the tag list.
First solution:
Alternative solution:
我的解决方案是递归的。我根据需要对输入
Seq
进行排序,并将其转换为List
。之后就是根据规格需求进行基本的模式匹配。我的解决方案的最大缺点是,虽然.toString
在测试方法中生成相同的字符串==
却不会产生 true。My solution is a recursive one. I sort the input
Seq
acording to my needs and convert it to aList
. After that it's basic pattern matching according to the specifications needs. The biggest drawback to my solution is that, while.toString
produces the same Strings in the test method==
doesn't yield true.您可以轻松地动态创建 XML 节点:
这是 Elem.apply 签名
def apply (前缀: String, 标签: String, 属性: MetaData, 范围: NamespaceBinding, child: Node*) : Elem
我认为这种方法的唯一问题是您需要首先构建内部节点。
一些让它变得更容易的东西:
You can easily creates XML nodes dynamically:
Here is the Elem.apply signature
def apply (prefix: String, label: String, attributes: MetaData, scope: NamespaceBinding, child: Node*) : Elem
The only problem I see with this approach is that you will need to build the inner nodes first.
Something to make it easier:
下面是一个使用字符串连接和 XML.loadString 接近正确的解决方案:
它通过了前两个测试用例,但在第三个测试用例上失败了。
Here's a solution that is close to correct using string concatenation and
XML.loadString
:This passes both of the first two test cases, but fails on the third.