文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

Golang 中字符编码

发布于 2024-10-12 12:11:03 字数 2825 浏览 0 评论 0 收藏 0

不像 C++、Java 等语言支持五花八门的字符编码，Golang 遵从“大道至简”的原则： 全用 UTF-8 。所以 go 程序员再也不用担心乱码问题，甚至可以用汉字和表情包写代码，string 与字节数组转换也是直接转换。

func TestTemp(t *testing.T) {
    来自打工人的问候()
}

func 来自打工人的问候() {
    问候语 := "早安，打工人"
    fmt.Println(问候语)
    bytes := []byte(问候语)
    fmt.Println(hex.EncodeToString(bytes))
}

执行结果

早安，打工人
e697a9e5ae89efbc8ce68993e5b7a5e4babaf09f9881

值得一提的是，Golang 中 string 的底层模型就是字节数组，所以类型转换过程中无需编解码。也因此， Golang 中 string 的底层模型是字节数组，其长度并非字符数，而是对应字节数 。如果要取字符数，需要先将字符串转换为字符数组。 字符类型（rune）实际上是 int32 的别名，即用 UTF-32 编码表示字符 。

func TestTemp(t *testing.T) {
    fmt.Println(len("早")) // 3
    fmt.Println(len([]byte("早"))) // 3
    fmt.Println(len([]rune("早")) // 1
}
// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32

再看一下 go 中 utf-8 编码的具体实现。首先获取字符的码点值，然后根据范围判断字节数，根据对应格式生成编码值。如果是无效的码点值，或码点值位于空段，则返回 U+FFFD。解码过程不再赘述。

// EncodeRune writes into p (which must be large enough) the UTF-8 encoding of the rune.
// It returns the number of bytes written.
func EncodeRune(p []byte, r rune) int {
    // Negative values are erroneous. Making it unsigned addresses the problem.
    switch i := uint32(r); {
    case i <= rune1Max:
        p[0] = byte(r)
        return 1
    case i <= rune2Max:
        _ = p[1] // eliminate bounds checks
        p[0] = t2 | byte(r>>6)
        p[1] = tx | byte(r)&maskx
        return 2
    case i > MaxRune, surrogateMin <= i && i <= surrogateMax:
        r = RuneError
        fallthrough
    case i <= rune3Max:
        _ = p[2] // eliminate bounds checks
        p[0] = t3 | byte(r>>12)
        p[1] = tx | byte(r>>6)&maskx
        p[2] = tx | byte(r)&maskx
        return 3
    default:
        _ = p[3] // eliminate bounds checks
        p[0] = t4 | byte(r>>18)
        p[1] = tx | byte(r>>12)&maskx
        p[2] = tx | byte(r>>6)&maskx
        p[3] = tx | byte(r)&maskx
        return 4
    }
}

const(
    t1 = 0b00000000
    tx = 0b10000000
    t2 = 0b11000000
    t3 = 0b11100000
    t4 = 0b11110000
    t5 = 0b11111000
    maskx = 0b00111111
    mask2 = 0b00011111
    mask3 = 0b00001111
    mask4 = 0b00000111
    rune1Max = 1<<7 - 1
    rune2Max = 1<<11 - 1
    rune3Max = 1<<16 - 1
    RuneError = '\uFFFD' // the "error" Rune or "Unicode replacement character"
)

// Code points in the surrogate range are not valid for UTF-8.
const (
    surrogateMin = 0xD800
    surrogateMax = 0xDFFF
)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

Golang 中字符编码

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。