从 scala 中的文本和元素偏移量创建 XML

发布于 2024-10-01 07:45:04 字数 1225 浏览 6 评论 0原文

我需要从一段纯文本以及应插入的每个 XML 元素的开始和结束偏移量创建一个 XML 文档。以下是我希望它通过的一些测试用例：

val text = "The dog chased the cat."
val spans = Seq(
    (0, 23, <xml/>),
    (4, 22, <phrase/>),
    (4, 7, <token/>))
val expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
assert(expected === spansToXML(text, spans))

val text = "aabbccdd"
val spans = Seq(
    (0, 8, <xml x="1"/>),
    (0, 4, <ab y="foo"/>),
    (4, 8, <cd z="42>3"/>))
val expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
assert(expected === spansToXML(text, spans))

val spans = Seq(
    (0, 1, <a/>),
    (0, 0, <b/>),
    (0, 0, <c/>),
    (1, 1, <d/>),
    (1, 1, <e/>))
assert(<a><b/><c/> <d/><e/></a> === spansToXML(" ", spans))

我的部分解决方案（请参阅下面的答案）通过字符串连接和 XML.loadString 工作。这看起来很奇怪，而且我也不能 100% 确定这个解决方案在所有极端情况下都能正常工作......

还有更好的解决方案吗？（就其价值而言，我很乐意切换到 anti-xml 如果这能让这项任务变得更容易.)

于 2011 年 8 月 10 日更新以添加更多测试用例并提供更清晰的规范。

原文

I need to create an XML document from a piece of plain text and the begin and end offsets of each XML element that should be inserted. Here are a few test cases I'd like it to pass:

val text = "The dog chased the cat."
val spans = Seq(
    (0, 23, <xml/>),
    (4, 22, <phrase/>),
    (4, 7, <token/>))
val expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
assert(expected === spansToXML(text, spans))

val text = "aabbccdd"
val spans = Seq(
    (0, 8, <xml x="1"/>),
    (0, 4, <ab y="foo"/>),
    (4, 8, <cd z="42>3"/>))
val expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
assert(expected === spansToXML(text, spans))

val spans = Seq(
    (0, 1, <a/>),
    (0, 0, <b/>),
    (0, 0, <c/>),
    (1, 1, <d/>),
    (1, 1, <e/>))
assert(<a><b/><c/> <d/><e/></a> === spansToXML(" ", spans))

My partial solution (see my answer below) works by string concatenation and XML.loadString. That seems hacky, and I'm also not 100% sure this solution works correctly in all the corner cases...

Any better solutions? (For what it's worth, I'd be happy to switch to anti-xml if that would make this task easier.)

Updated 10 Aug 2011 to add more test cases and provide a cleaner specification.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

西瓜 2024-10-08 07:45:04

鉴于您提出的赏金，我研究了您的问题一段时间并提出了以下解决方案，该解决方案在您的所有测试用例上都取得了成功。
我真的很希望我的答案被接受 - 请告诉我我的解决方案是否有任何问题。

一些评论：
如果您想了解执行期间发生了什么，我将注释掉的打印语句留在里面。
除了您的规范之外，我确实保留了他们现有的孩子（如果有的话） - 有一条评论是这样做的。

我不手动构建 XML 节点，而是修改传入的节点。为了避免分割开始和结束标记，我不得不对算法进行大量更改，但是通过 begin 进行排序的想法和 -end 来自您的解决方案。

该代码有些高级 Scala，特别是当我构建我需要的不同 Orderings 时。我确实从我得到的第一个版本中稍微简化了它。

我通过使用 SortedMap 避免创建表示间隔的树，并在提取后过滤间隔。这个选择有点次优；然而，我听说有“更好”的数据结构来表示嵌套间隔，例如间隔树（它们在计算几何中研究），但它们实现起来相当复杂，我认为这里不需要它。

/**
 * User: pgiarrusso
 * Date: 12/8/2011
 */

import collection.mutable.ArrayBuffer
import collection.SortedMap
import scala.xml._

object SpansToXmlTest {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]) = {
        val intOrdering = implicitly[Ordering[Int]] // Retrieves the standard ordering on Ints.

        // Sort spans decreasingly on begin and increasingly on end and their label - this processes spans outwards.
        // The sorting on labels matches the given examples.
        val spanOrder = Ordering.Tuple3(intOrdering.reverse, intOrdering, Ordering.by((_: Elem).label))

        //Same sorting, excluding labels.
        val intervalOrder = Ordering.Tuple2(intOrdering.reverse, intOrdering)
        //Map intervals of the source string to the sequence of nodes which match them - it is a sequence because
        //multiple spans for the same interval are allowed.
        var intervalMap = SortedMap[(Int, Int), Seq[Node]]()(intervalOrder)

        for ((start, end, elem) <- spans.sorted(spanOrder)) {
            //Only nested intervals. Interval nesting is a partial order, therefore we cannot use the filter function as an ordering for intervalMap, even if it would be nice.
            val nestedIntervalsMap = intervalMap.until((start, end)).filter(_ match {
                case ((intStart, intEnd), _) => start <= intStart && intEnd <= end
            })
            //println("intervalMap: " + intervalMap)
            //println("beforeMap: " + nestedIntervalsMap)

            //We call sorted to use a standard ordering this time.
            val before = nestedIntervalsMap.keys.toSeq.sorted

            // text.slice(start, end) must be split into fragments, some of which are represented by text node, some by
            // already computed xml nodes.
            val intervals = start +: (for {
                (intStart, intEnd) <- before
                boundary <- Seq(intStart, intEnd)
            } yield boundary) :+ end

            var xmlChildren = ArrayBuffer[Node]()
            var useXmlNode = false

            for (interv <- intervals.sliding(2)) {
                val intervStart = interv(0)
                val intervEnd = interv(1)
                xmlChildren.++=(
                    if (useXmlNode)
                        intervalMap((intervStart, intervEnd)) //Precomputed nodes
                    else
                        Seq(Text(text.slice(intervStart, intervEnd))))
                useXmlNode = !useXmlNode //The next interval will be of the opposite kind.
            }
            //Remove intervals that we just processed
            intervalMap = intervalMap -- before

            // By using elem.child, you also preserve existing xml children. "elem.child ++" can be also commented out.
            var tree = elem.copy(child = elem.child ++ xmlChildren)
            intervalMap += (start, end) -> (intervalMap.getOrElse((start, end), Seq.empty) :+ tree)
            //println(tree)
        }
        intervalMap((0, text.length)).head
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }
    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

    def main(args: Array[String]) {
        test1()
        test2()
        test3()
    }
}

Given the bounty you put forward, I studied your problem for some time and came up with the following solution, which succeeds on all your testcases.
I would really like getting my answer accepted - please tell me if there's anything wrong with my solution.

Some comments:
I left the commented out print statement inside, if you wanna figure what's going on during execution.
In addition to your specification, I do preserve their existing children (if any) - there's a comment where this is done.

I do not build the XML nodes manually, I modify the ones passed in. To avoid splitting the opening and closing tag, I had to change the algorithm quite a lot, but the idea of sorting spans by begin and -end comes from your solution.

The code is somewhat advanced Scala, especially when I build the different Orderings I need. I did simplify it somewhat from the first version I got.

I avoided creating a tree representing the intervals, by using a SortedMap, and filtering the intervals after extraction. This choice is somewhat suboptimal; however, I heard that there are "better" data structures for representing nested intervals, like interval trees (they are studied in computational geometry), but they're quite complex to implement, and I don't think it's needed here.

/**
 * User: pgiarrusso
 * Date: 12/8/2011
 */

import collection.mutable.ArrayBuffer
import collection.SortedMap
import scala.xml._

object SpansToXmlTest {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]) = {
        val intOrdering = implicitly[Ordering[Int]] // Retrieves the standard ordering on Ints.

        // Sort spans decreasingly on begin and increasingly on end and their label - this processes spans outwards.
        // The sorting on labels matches the given examples.
        val spanOrder = Ordering.Tuple3(intOrdering.reverse, intOrdering, Ordering.by((_: Elem).label))

        //Same sorting, excluding labels.
        val intervalOrder = Ordering.Tuple2(intOrdering.reverse, intOrdering)
        //Map intervals of the source string to the sequence of nodes which match them - it is a sequence because
        //multiple spans for the same interval are allowed.
        var intervalMap = SortedMap[(Int, Int), Seq[Node]]()(intervalOrder)

        for ((start, end, elem) <- spans.sorted(spanOrder)) {
            //Only nested intervals. Interval nesting is a partial order, therefore we cannot use the filter function as an ordering for intervalMap, even if it would be nice.
            val nestedIntervalsMap = intervalMap.until((start, end)).filter(_ match {
                case ((intStart, intEnd), _) => start <= intStart && intEnd <= end
            })
            //println("intervalMap: " + intervalMap)
            //println("beforeMap: " + nestedIntervalsMap)

            //We call sorted to use a standard ordering this time.
            val before = nestedIntervalsMap.keys.toSeq.sorted

            // text.slice(start, end) must be split into fragments, some of which are represented by text node, some by
            // already computed xml nodes.
            val intervals = start +: (for {
                (intStart, intEnd) <- before
                boundary <- Seq(intStart, intEnd)
            } yield boundary) :+ end

            var xmlChildren = ArrayBuffer[Node]()
            var useXmlNode = false

            for (interv <- intervals.sliding(2)) {
                val intervStart = interv(0)
                val intervEnd = interv(1)
                xmlChildren.++=(
                    if (useXmlNode)
                        intervalMap((intervStart, intervEnd)) //Precomputed nodes
                    else
                        Seq(Text(text.slice(intervStart, intervEnd))))
                useXmlNode = !useXmlNode //The next interval will be of the opposite kind.
            }
            //Remove intervals that we just processed
            intervalMap = intervalMap -- before

            // By using elem.child, you also preserve existing xml children. "elem.child ++" can be also commented out.
            var tree = elem.copy(child = elem.child ++ xmlChildren)
            intervalMap += (start, end) -> (intervalMap.getOrElse((start, end), Seq.empty) :+ tree)
            //println(tree)
        }
        intervalMap((0, text.length)).head
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }
    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

    def main(args: Array[String]) {
        test1()
        test2()
        test3()
    }
}

回复收藏 0 原文

爱的那么颓废 2024-10-08 07:45:04

这很有趣！

我采取了与史蒂夫类似的方法。
通过对“开始标签”和“结束标签”中的元素进行排序，然后计算将它们放在哪里。

我无耻地从 Blaisorblade 窃取了测试，并添加了更多有助于我开发代码的测试。

于 2011-08-14 编辑

我对 test-5 中插入空标签的方式感到不满意。然而，这个位置是 test-3 如何制定的结果，

即使在 spans 列表中的跨越标签 (a) 之后的空标签 (c,d) 和 c,d 标签具有与结束标签相同的插入点a 的 c、d 标签位于 a 内部。
这使得很难在跨越标签之间放置空标签，这可能是有用的。

因此，我稍微改变了一些测试并提供了替代解决方案。

在替代解决方案中，我以相同的方式启动，但有 3 个单独的列表：开始、空和结束标签。我不仅仅进行排序，还进行了第三步，将空标签放入标签列表中。

第一个解决方案：

import xml.{XML, Elem, Node}
import annotation.tailrec

object SpanToXml {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
        // Create a Seq of elements, sorted by where it should be inserted
        //  differentiate start tags ('s) and empty tags ('e)
        val startElms = spans sorted Ordering[Int].on[(Int, _, _)](_._1) map {
            case e if e._1 != e._2 => (e._1, e._3, 's)
            case e => (e._1, e._3, 'e)
        }
        //Create a Seq of closing tags ('c), sorted by where they should be inserted
        // filter out all empty tags
        val endElms = spans.reverse.sorted(Ordering[Int].on[(_, Int, _)](_._2))
            .filter(e => e._1 != e._2)
            .map(e => (e._2, e._3, 'c))

        //Combine the Seq's and sort by insertion point
        val elms = startElms ++ endElms sorted Ordering[Int].on[(Int, _, _)](_._1)
        //The sorting need to be refined
        // - end tag's need to come before start tag's if the insertion point is thesame
        val sorted = elms.sortWith((a, b) => a._1 == b._1 && a._3 == 'c && b._3 == 's )

        //Adjust the insertion point to what it should be in the final string
        // then insert the tags into the text by folding left
        // - there are different rules depending on start, empty or close
        val txt = adjustInset(sorted).foldLeft(text)((tx, e) => {
            val s = tx.splitAt(e._1)
            e match {
                case (_, elem, 's) => s._1 + "<" + elem.label + elem.attributes + ">" + s._2
                case (_, elem, 'e) => s._1 + "<" + elem.label + elem.attributes + "/>" + s._2
                case (_, elem, 'c) => s._1 + "</" + elem.label + ">" + s._2
            }
        })
        //Sanity check
        //println(txt)

        //Convert to XML
        XML.loadString(txt)
    }

    def adjustInset(elems: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
        @tailrec
        def adjIns(elems: Seq[(Int, Elem, Symbol)], tmp: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] =
            elems match {
                case t :: Nil => tmp :+ t
                case t :: ts => {
                    //calculate offset due to current element
                    val offset = t match {
                        case (_, e, 's) => e.label.size + e.attributes.toString.size + 2
                        case (_, e, 'e) => e.label.size + e.attributes.toString.size + 3
                        case (_, e, 'c) => e.label.size + 3
                    }
                    //add offset to all elm's in tail, and recurse
                    adjIns(ts.map(e => (e._1 + offset, e._2, e._3)), tmp :+ t)
                }
            }

            adjIns(elems, Nil)
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }

    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

    def test4() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (4, 4, <empty/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b>bb<empty/></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test6() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (2, 4, <c/>),
                        (3, 4, <d/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test7() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab a="a" b="b"/>),
                        (4, 8, <cd c="c" d="d"/>)),
            expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
        )

    def invalidSpans() = {
        val text = "aabbccdd"
        val spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 6, <err/>),
                        (4, 8, <cd/>))
        try {
            val res = spansToXML(text, spans)
            assert(false)
        } catch {
            case e => {
                println("This generate invalid XML:")
                println("<xml><ab>aabb</ab><err><cd>cc</err>dd</cd></xml>")
                println(e.getMessage)
            }
        }
    }

    def main(args: Array[String]) {
        test1()
        test2()
        test3()
        test4()
        test5()
        test6()
        test7()
        invalidSpans()
    }
}

SpanToXml.main(Array())

替代解决方案：

import xml.{XML, Elem, Node}
import annotation.tailrec

object SpanToXmlAlt {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
        // Create a Seq of start tags, sorted by where it should be inserted
        // filter out all empty tags
        val startElms = spans.sorted(Ordering[Int].on[(Int, _, _)](_._1))
            .filterNot(e => e._1 == e._2)
            .map(e => (e._1, e._3, 's))
        //Create a Seq of closing tags, sorted by where they should be inserted
        // filter out all empty tags
        val endElms = spans.reverse.sorted(Ordering[Int].on[(_, Int, _)](_._2))
            .filterNot(e => e._1 == e._2)
            .map(e => (e._2, e._3, 'c))

        //Create a Seq of empty tags, sorted by where they should be inserted
        val emptyElms = spans.sorted(Ordering[Int].on[(Int, _, _)](_._1))
            .filter(e => e._1 == e._2)
            .map(e => (e._1, e._3, 'e))

        //Combine the Seq's and sort by insertion point
        val elms = startElms ++ endElms sorted Ordering[Int].on[(Int, _, _)](_._1)
        //The sorting need to be refined
        // - end tag's need to come before start tag's if the insertion point is the same
        val sorted = elms.sortWith((a, b) => a._1 == b._1 && a._3 == 'c && b._3 == 's )

        //Insert empty tags
        val allSorted = insertEmpyt(spans, sorted, emptyElms) sorted Ordering[Int].on[(Int, _, _)](_._1)
        //Adjust the insertion point to what it should be in the final string
        // then insert the tags into the text by folding left
        // - there are different rules depending on start, empty or close
        val str = adjustInset(allSorted).foldLeft(text)((tx, e) => {
            val s = tx.splitAt(e._1)
            e match {
                case (_, elem, 's) => s._1 + "<" + elem.label + elem.attributes + ">" + s._2
                case (_, elem, 'e) => s._1 + "<" + elem.label + elem.attributes + "/>" + s._2
                case (_, elem, 'c) => s._1 + "</" + elem.label + ">" + s._2
            }
        })
        //Sanity check
        //println(str)
        //Convert to XML
        XML.loadString(str)
    }

    def insertEmpyt(spans: Seq[(Int, Int, Elem)],
        sorted: Seq[(Int, Elem, Symbol)],
        emptys: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {

        //Find all tags that should be before the empty tag
        @tailrec
        def afterSpan(empty: (Int, Elem, Symbol),
            spans: Seq[(Int, Int, Elem)],
            after: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
            var result = after
            spans match {
                case t :: _ if t._1 == empty._1 && t._2 == empty._1 && t._3 == empty._2 => after //break
                case t :: ts if t._1 == t._2 => afterSpan(empty, ts, after :+ (t._1, t._3, 'e))
                case t :: ts => {
                    if (t._1 <= empty._1) result = result :+ (t._1, t._3, 's)
                    if (t._2 <= empty._1) result = result :+ (t._2, t._3, 'c)
                    afterSpan(empty, ts, result)
                }
            }
        }

        //For each empty tag, insert it in the sorted list
        var result = sorted
        emptys.foreach(e => {
            val afterSpans = afterSpan(e, spans, Seq[(Int, Elem, Symbol)]())
            var emptyInserted = false
            result = result.foldLeft(Seq[(Int, Elem, Symbol)]())((res, s) => {
                if (afterSpans.contains(s) || emptyInserted) {
                    res :+ s
                } else {
                    emptyInserted = true
                    res :+ e :+ s
                }
            })
        })
        result
    }

    def adjustInset(elems: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
        @tailrec
        def adjIns(elems: Seq[(Int, Elem, Symbol)], tmp: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] =
            elems match {
                case t :: Nil => tmp :+ t
                case t :: ts => {
                    //calculate offset due to current element
                    val offset = t match {
                        case (_, e, 's) => e.label.size + e.attributes.toString.size + 2
                        case (_, e, 'e) => e.label.size + e.attributes.toString.size + 3
                        case (_, e, 'c) => e.label.size + 3
                    }
                    //add offset to all elm's in tail, and recurse
                    adjIns(ts.map(e => (e._1 + offset, e._2, e._3)), tmp :+ t)
                }
            }

            adjIns(elems, Nil)
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }

    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3alt() =
        test(
            text = "  ",
            spans = Seq(
                (0, 2, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/> </a>
        )

    def test4() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5alt() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (4, 4, <empty/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b>bb</b></ab><empty/><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5b() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 2, <empty1/>),
                        (4, 4, <empty2/>),
                        (2, 4, <b/>),
                        (2, 2, <empty3/>),
                        (4, 4, <empty4/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<empty1/><b><empty3/>bb<empty2/></b></ab><empty4/><cd><ok>cc</ok>dd</cd></xml>
        )

    def test6() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (2, 4, <c/>),
                        (3, 4, <d/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test7() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab a="a" b="b"/>),
                        (4, 8, <cd c="c" d="d"/>)),
            expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
        )

    def failedSpans() = {
        val text = "aabbccdd"
        val spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 6, <err/>),
                        (4, 8, <cd/>))
        try {
            val res = spansToXML(text, spans)
            assert(false)
        } catch {
            case e => {
                println("This generate invalid XML:")
                println("<xml><ab>aabb</ab><err><cd>cc</err>dd</cd></xml>")
                println(e.getMessage)
            }
        }

    }

    def main(args: Array[String]) {
        test1()
        test2()
        test3alt()
        test4()
        test5alt()
        test5b()
        test6()
        test7()
        failedSpans()
    }
}

SpanToXmlAlt.main(Array())

This was fun!

I took an approche similar to Steve's.
By sorting up the element's in "start tags" and "end tags", then calculating where to put them.

I shamelessly stole the test's from Blaisorblade, and added a couple of more that helped me developing the code.

EDITED on 2011-08-14

I was unhappy how the empty tag where inserted in test-5. However this placement where a consequence of how test-3 where formulated

even if the empty tags (c,d) where after the spanning tag (a) in the spans list and the c,d-tags has insertion points same as the end tag of a, the c,d tag goes inside a.
This make it hard to place empty tags between spanning tags which could be useful.

So, I changed some test's around a little and provided an alternative solution.

In the alternative solution I start up in the same way, but have 3 separate lists, start, empty and closing tags. And instead of only sorting I have a third step where the empty tags are placed into the tag list.

First solution:

import xml.{XML, Elem, Node}
import annotation.tailrec

object SpanToXml {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
        // Create a Seq of elements, sorted by where it should be inserted
        //  differentiate start tags ('s) and empty tags ('e)
        val startElms = spans sorted Ordering[Int].on[(Int, _, _)](_._1) map {
            case e if e._1 != e._2 => (e._1, e._3, 's)
            case e => (e._1, e._3, 'e)
        }
        //Create a Seq of closing tags ('c), sorted by where they should be inserted
        // filter out all empty tags
        val endElms = spans.reverse.sorted(Ordering[Int].on[(_, Int, _)](_._2))
            .filter(e => e._1 != e._2)
            .map(e => (e._2, e._3, 'c))

        //Combine the Seq's and sort by insertion point
        val elms = startElms ++ endElms sorted Ordering[Int].on[(Int, _, _)](_._1)
        //The sorting need to be refined
        // - end tag's need to come before start tag's if the insertion point is thesame
        val sorted = elms.sortWith((a, b) => a._1 == b._1 && a._3 == 'c && b._3 == 's )

        //Adjust the insertion point to what it should be in the final string
        // then insert the tags into the text by folding left
        // - there are different rules depending on start, empty or close
        val txt = adjustInset(sorted).foldLeft(text)((tx, e) => {
            val s = tx.splitAt(e._1)
            e match {
                case (_, elem, 's) => s._1 + "<" + elem.label + elem.attributes + ">" + s._2
                case (_, elem, 'e) => s._1 + "<" + elem.label + elem.attributes + "/>" + s._2
                case (_, elem, 'c) => s._1 + "</" + elem.label + ">" + s._2
            }
        })
        //Sanity check
        //println(txt)

        //Convert to XML
        XML.loadString(txt)
    }

    def adjustInset(elems: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
        @tailrec
        def adjIns(elems: Seq[(Int, Elem, Symbol)], tmp: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] =
            elems match {
                case t :: Nil => tmp :+ t
                case t :: ts => {
                    //calculate offset due to current element
                    val offset = t match {
                        case (_, e, 's) => e.label.size + e.attributes.toString.size + 2
                        case (_, e, 'e) => e.label.size + e.attributes.toString.size + 3
                        case (_, e, 'c) => e.label.size + 3
                    }
                    //add offset to all elm's in tail, and recurse
                    adjIns(ts.map(e => (e._1 + offset, e._2, e._3)), tmp :+ t)
                }
            }

            adjIns(elems, Nil)
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }

    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

    def test4() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (4, 4, <empty/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b>bb<empty/></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test6() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (2, 4, <c/>),
                        (3, 4, <d/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test7() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab a="a" b="b"/>),
                        (4, 8, <cd c="c" d="d"/>)),
            expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
        )

    def invalidSpans() = {
        val text = "aabbccdd"
        val spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 6, <err/>),
                        (4, 8, <cd/>))
        try {
            val res = spansToXML(text, spans)
            assert(false)
        } catch {
            case e => {
                println("This generate invalid XML:")
                println("<xml><ab>aabb</ab><err><cd>cc</err>dd</cd></xml>")
                println(e.getMessage)
            }
        }
    }

    def main(args: Array[String]) {
        test1()
        test2()
        test3()
        test4()
        test5()
        test6()
        test7()
        invalidSpans()
    }
}

SpanToXml.main(Array())

Alternative solution:

import xml.{XML, Elem, Node}
import annotation.tailrec

object SpanToXmlAlt {
    def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
        // Create a Seq of start tags, sorted by where it should be inserted
        // filter out all empty tags
        val startElms = spans.sorted(Ordering[Int].on[(Int, _, _)](_._1))
            .filterNot(e => e._1 == e._2)
            .map(e => (e._1, e._3, 's))
        //Create a Seq of closing tags, sorted by where they should be inserted
        // filter out all empty tags
        val endElms = spans.reverse.sorted(Ordering[Int].on[(_, Int, _)](_._2))
            .filterNot(e => e._1 == e._2)
            .map(e => (e._2, e._3, 'c))

        //Create a Seq of empty tags, sorted by where they should be inserted
        val emptyElms = spans.sorted(Ordering[Int].on[(Int, _, _)](_._1))
            .filter(e => e._1 == e._2)
            .map(e => (e._1, e._3, 'e))

        //Combine the Seq's and sort by insertion point
        val elms = startElms ++ endElms sorted Ordering[Int].on[(Int, _, _)](_._1)
        //The sorting need to be refined
        // - end tag's need to come before start tag's if the insertion point is the same
        val sorted = elms.sortWith((a, b) => a._1 == b._1 && a._3 == 'c && b._3 == 's )

        //Insert empty tags
        val allSorted = insertEmpyt(spans, sorted, emptyElms) sorted Ordering[Int].on[(Int, _, _)](_._1)
        //Adjust the insertion point to what it should be in the final string
        // then insert the tags into the text by folding left
        // - there are different rules depending on start, empty or close
        val str = adjustInset(allSorted).foldLeft(text)((tx, e) => {
            val s = tx.splitAt(e._1)
            e match {
                case (_, elem, 's) => s._1 + "<" + elem.label + elem.attributes + ">" + s._2
                case (_, elem, 'e) => s._1 + "<" + elem.label + elem.attributes + "/>" + s._2
                case (_, elem, 'c) => s._1 + "</" + elem.label + ">" + s._2
            }
        })
        //Sanity check
        //println(str)
        //Convert to XML
        XML.loadString(str)
    }

    def insertEmpyt(spans: Seq[(Int, Int, Elem)],
        sorted: Seq[(Int, Elem, Symbol)],
        emptys: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {

        //Find all tags that should be before the empty tag
        @tailrec
        def afterSpan(empty: (Int, Elem, Symbol),
            spans: Seq[(Int, Int, Elem)],
            after: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
            var result = after
            spans match {
                case t :: _ if t._1 == empty._1 && t._2 == empty._1 && t._3 == empty._2 => after //break
                case t :: ts if t._1 == t._2 => afterSpan(empty, ts, after :+ (t._1, t._3, 'e))
                case t :: ts => {
                    if (t._1 <= empty._1) result = result :+ (t._1, t._3, 's)
                    if (t._2 <= empty._1) result = result :+ (t._2, t._3, 'c)
                    afterSpan(empty, ts, result)
                }
            }
        }

        //For each empty tag, insert it in the sorted list
        var result = sorted
        emptys.foreach(e => {
            val afterSpans = afterSpan(e, spans, Seq[(Int, Elem, Symbol)]())
            var emptyInserted = false
            result = result.foldLeft(Seq[(Int, Elem, Symbol)]())((res, s) => {
                if (afterSpans.contains(s) || emptyInserted) {
                    res :+ s
                } else {
                    emptyInserted = true
                    res :+ e :+ s
                }
            })
        })
        result
    }

    def adjustInset(elems: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] = {
        @tailrec
        def adjIns(elems: Seq[(Int, Elem, Symbol)], tmp: Seq[(Int, Elem, Symbol)]): Seq[(Int, Elem, Symbol)] =
            elems match {
                case t :: Nil => tmp :+ t
                case t :: ts => {
                    //calculate offset due to current element
                    val offset = t match {
                        case (_, e, 's) => e.label.size + e.attributes.toString.size + 2
                        case (_, e, 'e) => e.label.size + e.attributes.toString.size + 3
                        case (_, e, 'c) => e.label.size + 3
                    }
                    //add offset to all elm's in tail, and recurse
                    adjIns(ts.map(e => (e._1 + offset, e._2, e._3)), tmp :+ t)
                }
            }

            adjIns(elems, Nil)
    }

    def test(text: String, spans: Seq[(Int, Int, Elem)], expected: Node) {
        val res = spansToXML(text, spans)
        print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
        assert(expected == res)
    }

    def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq(
                (0, 23, <xml/>),
                (4, 22, <phrase/>),
                (4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

    def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

    def test3alt() =
        test(
            text = "  ",
            spans = Seq(
                (0, 2, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/> </a>
        )

    def test4() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5alt() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (4, 4, <empty/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b>bb</b></ab><empty/><cd><ok>cc</ok>dd</cd></xml>
        )

    def test5b() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 2, <empty1/>),
                        (4, 4, <empty2/>),
                        (2, 4, <b/>),
                        (2, 2, <empty3/>),
                        (4, 4, <empty4/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<empty1/><b><empty3/>bb<empty2/></b></ab><empty4/><cd><ok>cc</ok>dd</cd></xml>
        )

    def test6() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (2, 4, <b/>),
                        (2, 4, <c/>),
                        (3, 4, <d/>),
                        (4, 8, <cd/>),
                        (4, 6, <ok/>)),
            expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
        )

    def test7() =
        test(
            text = "aabbccdd",
            spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab a="a" b="b"/>),
                        (4, 8, <cd c="c" d="d"/>)),
            expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
        )

    def failedSpans() = {
        val text = "aabbccdd"
        val spans = Seq((0, 8, <xml/>),
                        (0, 4, <ab/>),
                        (4, 6, <err/>),
                        (4, 8, <cd/>))
        try {
            val res = spansToXML(text, spans)
            assert(false)
        } catch {
            case e => {
                println("This generate invalid XML:")
                println("<xml><ab>aabb</ab><err><cd>cc</err>dd</cd></xml>")
                println(e.getMessage)
            }
        }

    }

    def main(args: Array[String]) {
        test1()
        test2()
        test3alt()
        test4()
        test5alt()
        test5b()
        test6()
        test7()
        failedSpans()
    }
}

SpanToXmlAlt.main(Array())

回复收藏 0 原文

巴黎盛开的樱花 2024-10-08 07:45:04

我的解决方案是递归的。我根据需要对输入 Seq 进行排序，并将其转换为 List。之后就是根据规格需求进行基本的模式匹配。我的解决方案的最大缺点是，虽然 .toString 在测试方法中生成相同的字符串 == 却不会产生 true。

import scala.xml.{NodeSeq, Elem, Text}

object SpansToXml {
  type NodeSpan = (Int, Int, Elem)

  def adjustIndices(offset: Int, spans: List[NodeSpan]) = spans.map {
    case (spanStart, spanEnd, spanNode) => (spanStart - offset, spanEnd - offset, spanNode)
  }

  def sortedSpansToXml(text: String, spans: List[NodeSpan]): NodeSeq = {
    spans match {
      // current span starts and ends at index 0, thus no inner text exists
      case (0, 0, node) :: rest => node +: sortedSpansToXml(text, rest)

      // current span starts at index 0 and ends somewhere greater than 0
      case (0, end, node) :: rest =>
        // partition the text and the remaining spans in inner and outer and process both independently
        val (innerSpans, outerSpans) = rest.partition {
          case (spanStart, spanEnd, spanNode) => spanStart <= end && spanEnd <= end
        }
        val (innerText, outerText) = text.splitAt(end)

        // prepend the generated node to the outer xml
        node.copy(child = node.child ++ sortedSpansToXml(innerText, innerSpans)) +: sortedSpansToXml(outerText, adjustIndices(end, outerSpans))

      // current span has starts at an index larger than 0, convert text prefix to text node
      case (start, end, node) :: rest =>
        val (pre, spanned) = text.splitAt(start)
        Text(pre) +: sortedSpansToXml(spanned, adjustIndices(start, spans))

      // all spans consumed: we can just return the text as node
      case Nil =>
        Text(text)
    }
  }

  def spansToXml(xmlText: String, nodeSpans: Seq[NodeSpan]) = {
    val sortedSpans = nodeSpans.toList.sortBy {
      case (start, end, _) => (start, -end)
    }
    sortedSpansToXml(xmlText, sortedSpans)
  }

  // test code stolen from Blaisorblade and david.rosell

  def test(text: String, spans: Seq[(Int, Int, Elem)], expected: NodeSeq) {
    val res = spansToXml(text, spans)
    print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
    // Had to resort on to string here.
    assert(expected.toString == res.toString)
  }

  def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq((0, 23, <xml/>),(4, 22, <phrase/>),(4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

  def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

  def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

  def test4() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test5() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (2, 4, <b/>),
                      (4, 4, <empty/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aa<b>bb<empty/></b></ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test6() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (2, 4, <b/>),
                      (2, 4, <c/>),
                      (3, 4, <d/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test7() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab a="a" b="b"/>),
                      (4, 8, <cd c="c" d="d"/>)),
          expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
      )

}

My solution is a recursive one. I sort the input Seq acording to my needs and convert it to a List. After that it's basic pattern matching according to the specifications needs. The biggest drawback to my solution is that, while .toString produces the same Strings in the test method == doesn't yield true.

import scala.xml.{NodeSeq, Elem, Text}

object SpansToXml {
  type NodeSpan = (Int, Int, Elem)

  def adjustIndices(offset: Int, spans: List[NodeSpan]) = spans.map {
    case (spanStart, spanEnd, spanNode) => (spanStart - offset, spanEnd - offset, spanNode)
  }

  def sortedSpansToXml(text: String, spans: List[NodeSpan]): NodeSeq = {
    spans match {
      // current span starts and ends at index 0, thus no inner text exists
      case (0, 0, node) :: rest => node +: sortedSpansToXml(text, rest)

      // current span starts at index 0 and ends somewhere greater than 0
      case (0, end, node) :: rest =>
        // partition the text and the remaining spans in inner and outer and process both independently
        val (innerSpans, outerSpans) = rest.partition {
          case (spanStart, spanEnd, spanNode) => spanStart <= end && spanEnd <= end
        }
        val (innerText, outerText) = text.splitAt(end)

        // prepend the generated node to the outer xml
        node.copy(child = node.child ++ sortedSpansToXml(innerText, innerSpans)) +: sortedSpansToXml(outerText, adjustIndices(end, outerSpans))

      // current span has starts at an index larger than 0, convert text prefix to text node
      case (start, end, node) :: rest =>
        val (pre, spanned) = text.splitAt(start)
        Text(pre) +: sortedSpansToXml(spanned, adjustIndices(start, spans))

      // all spans consumed: we can just return the text as node
      case Nil =>
        Text(text)
    }
  }

  def spansToXml(xmlText: String, nodeSpans: Seq[NodeSpan]) = {
    val sortedSpans = nodeSpans.toList.sortBy {
      case (start, end, _) => (start, -end)
    }
    sortedSpansToXml(xmlText, sortedSpans)
  }

  // test code stolen from Blaisorblade and david.rosell

  def test(text: String, spans: Seq[(Int, Int, Elem)], expected: NodeSeq) {
    val res = spansToXml(text, spans)
    print("Text: \"%s\", expected:\n%s\nResult:\n%s\n\n" format (text, expected, res))
    // Had to resort on to string here.
    assert(expected.toString == res.toString)
  }

  def test1() =
        test(
            text = "The dog chased the cat.",
            spans = Seq((0, 23, <xml/>),(4, 22, <phrase/>),(4, 7, <token/>)),
            expected = <xml>The <phrase><token>dog</token> chased the cat</phrase>.</xml>
        )

  def test2() =
        test(
            text = "aabbccdd",
            spans = Seq(
                (0, 8, <xml x="1"/>),
                (0, 4, <ab y="foo"/>),
                (4, 8, <cd z="42>3"/>)),
            expected = <xml x="1"><ab y="foo">aabb</ab><cd z="42>3">ccdd</cd></xml>
        )

  def test3() =
        test(
            text = " ",
            spans = Seq(
                (0, 1, <a/>),
                (0, 0, <b/>),
                (0, 0, <c/>),
                (1, 1, <d/>),
                (1, 1, <e/>)),
            expected = <a><b/><c/> <d/><e/></a>
        )

  def test4() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aabb</ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test5() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (2, 4, <b/>),
                      (4, 4, <empty/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aa<b>bb<empty/></b></ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test6() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab/>),
                      (2, 4, <b/>),
                      (2, 4, <c/>),
                      (3, 4, <d/>),
                      (4, 8, <cd/>),
                      (4, 6, <ok/>)),
          expected = <xml><ab>aa<b><c>b<d>b</d></c></b></ab><cd><ok>cc</ok>dd</cd></xml>
      )

  def test7() =
      test(
          text = "aabbccdd",
          spans = Seq((0, 8, <xml/>),
                      (0, 4, <ab a="a" b="b"/>),
                      (4, 8, <cd c="c" d="d"/>)),
          expected = <xml><ab a="a" b="b">aabb</ab><cd c="c" d="d">ccdd</cd></xml>
      )

}

回复收藏 0 原文

甜味超标? 2024-10-08 07:45:04

您可以轻松地动态创建 XML 节点：

scala> import scala.xml._
import scala.xml._

scala> Elem(null, "AAA",xml.Null,xml.TopScope, Array[Node]():_*)
res2: scala.xml.Elem = <AAA></AAA>

这是 Elem.apply 签名 def apply (前缀: String, 标签: String, 属性: MetaData, 范围: NamespaceBinding, child: Node*) : Elem

我认为这种方法的唯一问题是您需要首先构建内部节点。

一些让它变得更容易的东西：

scala> def elem(name:String, children:Node*) = Elem(null, name ,xml.Null,xml.TopScope, children:_*); def elem(name:String):Elem=elem(name, Array[Node]():_*);

scala> elem("A",elem("B"))
res11: scala.xml.Elem = <A><B></B></A>

You can easily creates XML nodes dynamically:

scala> import scala.xml._
import scala.xml._

scala> Elem(null, "AAA",xml.Null,xml.TopScope, Array[Node]():_*)
res2: scala.xml.Elem = <AAA></AAA>

Here is the Elem.apply signature def apply (prefix: String, label: String, attributes: MetaData, scope: NamespaceBinding, child: Node*) : Elem

The only problem I see with this approach is that you will need to build the inner nodes first.

Something to make it easier:

scala> def elem(name:String, children:Node*) = Elem(null, name ,xml.Null,xml.TopScope, children:_*); def elem(name:String):Elem=elem(name, Array[Node]():_*);

scala> elem("A",elem("B"))
res11: scala.xml.Elem = <A><B></B></A>

回复收藏 0 原文

夕色琉璃 2024-10-08 07:45:04

下面是一个使用字符串连接和 XML.loadString 接近正确的解决方案：

def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
  // arrange items so that at each offset:
  //   closing tags sort before opening tags
  //   with two opening tags, the one with the later closing tag sorts first
  //   with two closing tags, the one with the later opening tag sorts first
  val items = Buffer[(Int, Int, Int, String)]()
  for ((begin, end, elem) <- spans) {
    val elemStr = elem.toString
    val splitIndex = elemStr.indexOf('>') + 1
    val beginTag = elemStr.substring(0, splitIndex)
    val endTag = elemStr.substring(splitIndex)
    items += ((begin, +1, -end, beginTag))
    items += ((end, -1, -begin, endTag))
  }
  // group tags to be inserted by index
  val inserts = Map[Int, Buffer[String]]()
  for ((index, _, _, tag) <- items.sorted) {
    inserts.getOrElseUpdate(index, Buffer[String]()) += tag
  }
  // put tags and characters into a buffer
  val result = Buffer[String]()
  for (i <- 0 until text.size + 1) {
    for (tags <- inserts.get(i); tag <- tags) {
      result += tag
    }
    result += text.slice(i, i + 1)
  }
  // create XML from the string buffer
  XML.loadString(result.mkString)
}

它通过了前两个测试用例，但在第三个测试用例上失败了。

Here's a solution that is close to correct using string concatenation and XML.loadString:

def spansToXML(text: String, spans: Seq[(Int, Int, Elem)]): Node = {
  // arrange items so that at each offset:
  //   closing tags sort before opening tags
  //   with two opening tags, the one with the later closing tag sorts first
  //   with two closing tags, the one with the later opening tag sorts first
  val items = Buffer[(Int, Int, Int, String)]()
  for ((begin, end, elem) <- spans) {
    val elemStr = elem.toString
    val splitIndex = elemStr.indexOf('>') + 1
    val beginTag = elemStr.substring(0, splitIndex)
    val endTag = elemStr.substring(splitIndex)
    items += ((begin, +1, -end, beginTag))
    items += ((end, -1, -begin, endTag))
  }
  // group tags to be inserted by index
  val inserts = Map[Int, Buffer[String]]()
  for ((index, _, _, tag) <- items.sorted) {
    inserts.getOrElseUpdate(index, Buffer[String]()) += tag
  }
  // put tags and characters into a buffer
  val result = Buffer[String]()
  for (i <- 0 until text.size + 1) {
    for (tags <- inserts.get(i); tag <- tags) {
      result += tag
    }
    result += text.slice(i, i + 1)
  }
  // create XML from the string buffer
  XML.loadString(result.mkString)
}

This passes both of the first two test cases, but fails on the third.

回复收藏 0 原文

~没有更多了~