在 Go 中解组 ISO-8859-1 XML 输入

发布于 2024-11-06 21:09:33 字数 118 浏览 4 评论 0原文

当您的 XML 输入未采用 UTF-8 编码时,xml 包的 Unmarshal 函数似乎需要 CharsetReader

你在哪里可以找到这样的东西?

When your XML input isn't encoded in UTF-8, the Unmarshal function of the xml package seems to require a CharsetReader.

Where do you find such a thing ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

清醇 2024-11-13 21:09:33

更新了 2015 年和 2015 年的答案超越:

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)

Updated answer for 2015 & beyond:

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)
素手挽清风 2024-11-13 21:09:33

扩展@anschel-schaffer-cohen的建议和@mjibson的评论,
使用上面提到的 go-charset 包,您可以使用这三行

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)

来实现所需的结果。 调用,让 charset 知道其数据文件的位置

charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"

只需记住在应用程序启动时的某个时刻

编辑

而不是上面的,charset.CharsetDir =等,只导入数据文件更明智。它们被视为嵌入式资源:

import (
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data"
    ...
)

go install 只会做它的事情,这也避免了部署的麻烦(在哪里/如何获取与执行应用程序相关的数据文件?)。

使用带有下划线的 import 只是调用包的 init() 函数,该函数将所需的内容加载到内存中。

Expanding on @anschel-schaffer-cohen suggestion and @mjibson's comment,
using the go-charset package as mentioned above allows you to use these three lines

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)

to achieve the required result. just remember to let charset know where its data files are by calling

charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"

at some point when the app starts up.

EDIT

Instead of the above, charset.CharsetDir = etc. it's more sensible to just import the data files. they are treated as an embedded resource:

import (
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data"
    ...
)

go install will just do its thing, this also avoids the deployment headache (where/how do I get data files relative to the executing app?).

using import with an underscore just calls the package's init() func which loads the required stuff into memory.

终陌 2024-11-13 21:09:33

下面是一个示例 Go 程序,它使用 CharsetReader 函数将 XML 输入从 ISO-8859-1 转换为 UTF-8。该程序打印测试文件 XML 注释。

package main

import (
    "bytes"
    "fmt"
    "io"
    "os"
    "strings"
    "utf8"
    "xml"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
    return &CharsetISO88591er{r.(io.ByteReader), buf}
}

func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
    // http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
    // Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
    if cs.buf.Len() <= 0 {
        r, err := cs.r.ReadByte()
        if err != nil {
            return 0, err
        }
        if r < utf8.RuneSelf {
            return r, nil
        }
        cs.buf.WriteRune(int(r))
    }
    return cs.buf.ReadByte()
}

func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
    // Use ReadByte method.
    return 0, os.EINVAL
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func IsCharsetUTF8(charset string) bool {
    names := []string{
        "UTF-8",
        // Default
        "",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
    switch {
    case IsCharsetUTF8(charset):
        return input, nil
    case IsCharsetISO88591(charset):
        return NewCharsetISO88591(input), nil
    }
    return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}

func main() {
    // Print the XML comments from the test file, which should
    // contain most of the printable ISO-8859-1 characters.
    r, err := os.Open("ISO88591.xml")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer r.Close()
    fmt.Println("file:", r.Name())
    p := xml.NewParser(r)
    p.CharsetReader = CharsetReader
    for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
        switch t := t.(type) {
        case xml.ProcInst:
            fmt.Println(t.Target, string(t.Inst))
        case xml.Comment:
            fmt.Println(string([]byte(t)))
        }
    }
}

使用 encoding="ISO-8859-1" 将 XML 从 io.Reader r 解组为结构 结果,在使用程序中的 CharsetReader 函数从 ISO-8859-1 转换为 UTF-8 时,写入:

p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)

Here's a sample Go program which uses a CharsetReader function to convert XML input from ISO-8859-1 to UTF-8. The program prints the test file XML comments.

package main

import (
    "bytes"
    "fmt"
    "io"
    "os"
    "strings"
    "utf8"
    "xml"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
    return &CharsetISO88591er{r.(io.ByteReader), buf}
}

func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
    // http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
    // Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
    if cs.buf.Len() <= 0 {
        r, err := cs.r.ReadByte()
        if err != nil {
            return 0, err
        }
        if r < utf8.RuneSelf {
            return r, nil
        }
        cs.buf.WriteRune(int(r))
    }
    return cs.buf.ReadByte()
}

func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
    // Use ReadByte method.
    return 0, os.EINVAL
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func IsCharsetUTF8(charset string) bool {
    names := []string{
        "UTF-8",
        // Default
        "",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
    switch {
    case IsCharsetUTF8(charset):
        return input, nil
    case IsCharsetISO88591(charset):
        return NewCharsetISO88591(input), nil
    }
    return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}

func main() {
    // Print the XML comments from the test file, which should
    // contain most of the printable ISO-8859-1 characters.
    r, err := os.Open("ISO88591.xml")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer r.Close()
    fmt.Println("file:", r.Name())
    p := xml.NewParser(r)
    p.CharsetReader = CharsetReader
    for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
        switch t := t.(type) {
        case xml.ProcInst:
            fmt.Println(t.Target, string(t.Inst))
        case xml.Comment:
            fmt.Println(string([]byte(t)))
        }
    }
}

To unmarshal XML with encoding="ISO-8859-1" from an io.Reader r into a structure result, while using the CharsetReader function from the program to translate from ISO-8859-1 to UTF-8, write:

p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)
三寸金莲 2024-11-13 21:09:33

似乎有一个外部库可以处理此问题:go-charset< /a>.我自己没试过;它对你有用吗?

There appears to be an external library which handles this: go-charset. I haven't tried it myself; does it work for you?

油饼 2024-11-13 21:09:33

编辑:不要使用这个,使用 go-charset 答案。

这是 @peterSO 代码的更新版本,适用于 go1:

package main

import (
    "bytes"
    "io"
    "strings"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.Buffer{}
    return &CharsetISO88591er{r.(io.ByteReader), &buf}
}

func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
    for _ = range p {
        if r, err := cs.r.ReadByte(); err != nil {
            break
        } else {
            cs.buf.WriteRune(rune(r))
        }
    }
    return cs.buf.Read(p)
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
    if IsCharsetISO88591(charset) {
        return NewCharsetISO88591(input), nil
    }
    return input, nil
}

调用方式:

d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)

Edit: do not use this, use the go-charset answer.

Here's an updated version of @peterSO's code that works with go1:

package main

import (
    "bytes"
    "io"
    "strings"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.Buffer{}
    return &CharsetISO88591er{r.(io.ByteReader), &buf}
}

func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
    for _ = range p {
        if r, err := cs.r.ReadByte(); err != nil {
            break
        } else {
            cs.buf.WriteRune(rune(r))
        }
    }
    return cs.buf.Read(p)
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
    if IsCharsetISO88591(charset) {
        return NewCharsetISO88591(input), nil
    }
    return input, nil
}

Called with:

d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)
倾城花音 2024-11-13 21:09:33

目前 go 发行版中或我能找到的其他任何地方都没有提供任何内容。这并不奇怪,因为该钩子不到一个月 在撰写本文时。

由于 CharsetReader 被定义为 CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error),因此您可以创建自己的。
测试中有一个示例,但这可能不是对你来说绝对有用。

There aren't any provided in the go distribution at the moment, or anywhere else I can find. Not surprising as that hook is less than a month old at the time of writing.

Since a CharsetReader is defined as CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error), you could make your own.
There's one example in the tests, but that might not be exactly useful to you.

逆光飞翔i 2024-11-13 21:09:33

接受的答案对我不起作用(可能是因为我收到的 XML 没有设置应有的 encoding="ISO-8859-1" 字段)。

因此,我找到了一个仅从 ISO 格式转换为 UTF-8 的库,我的代码最终看起来像这样:

import "golang.org/x/text/encoding/charmap"

// ...

func someFunc(r io.Reader) (MyDTO, err) {
    decoder := xml.NewDecoder(charmap.ISO8859_1.NewDecoder().Reader(f))
    var myDTO MyDTO
    err := decoder.Decode(&myDTO)
    return myDTO, err
}

The accepted answer didn't work for me (probably because the XML I was receiving did not have the encoding="ISO-8859-1" field set as it should).

So instead I found a library that just converts from the ISO format to the UTF-8, my code ended up looking something like this:

import "golang.org/x/text/encoding/charmap"

// ...

func someFunc(r io.Reader) (MyDTO, err) {
    decoder := xml.NewDecoder(charmap.ISO8859_1.NewDecoder().Reader(f))
    var myDTO MyDTO
    err := decoder.Decode(&myDTO)
    return myDTO, err
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文