即使环境和 aspell 配置都将编码指定为 UTF-8,Aspell 也会将字典文件解码为 latin1
更新 显然,解决这个问题的方法是使用另一个配置参数来设置 encofing:在命令行上 --encodig=UTF-8 。
例如:
zby@tvm1:/home/xpapers$ aspell --lang=en create master ./dictionary.local < w
Warning: The word "Pérez" is invalid. The character '©' (U+A9) may not appear in the middle of a word. Skipping word.
文件w只包含一个单词:
zby@tvm1:/home/xpapers$ cat w
Pérez
即第二个字母是带有重音符号的e。 hexdump:
zby@tvm1:/home/xpapers$ hexdump w
0000000 c350 72a9 7a65 000a
0000007
这是littleendian - 所以你需要翻转字节 - 但它似乎是正确的UTF-8(50 - P,然后c3 72 - 这是带有重音符号的e),并且它在我的控制台中显示正常。
在我的环境中:
zby@tvm1:/home/xpapers$ set | grep LANG
LANG=en_US.UTF-8
aspell 配置(由 aspell dump config 转储)附在下面,我认为唯一相关的信息是:
# encoding (string)
# encoding to expect data to be in
# default: !encoding = UTF-8
所以看起来一切都为 UTF-8 设置 - 但 aspell 似乎仍然尝试拉丁语-1。
这是在 Ubuntu Karmic Coala 上:
zby@tvm1:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.10
DISTRIB_CODENAME=karmic
DISTRIB_DESCRIPTION="Ubuntu 9.10"
Aspell 是:
zby@tvm1:~$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)
=========================================== ======
zby@tvm1:/home/xpapers$ aspell dump config
# conf (string)
# main configuration file
# default: aspell.conf
# conf-dir (string)
# location of main configuration file
# default: /etc
# data-dir (string)
# location of language data files
# default: <prefix:lib/aspell> = /usr/lib/aspell
# dict-alias (list)
# create dictionary aliases
# dict-dir (string)
# location of the main word list
# default: <data-dir> = /usr/lib/aspell
# encoding (string)
# encoding to expect data to be in
# default: !encoding = UTF-8
# filter (list)
# add or removes a filter
# filter-path (list)
# path(s) aspell looks for filters
# mode (string)
# filter mode
# default: url
# extra-dicts (list)
# extra dictionaries to use
# home-dir (string)
# location for personal files
# default: <$HOME|./> = /home/zby
# ignore (integer)
# ignore words <= n chars
# default: 1
# ignore-case (boolean)
# ignore case when checking words
# default: false
# ignore-repl (boolean)
# ignore commands to store replacement pairs
# default: false
# keyboard (string)
# keyboard definition to use for typo analysis
# default: standard
# lang (string)
# language code
# default: <language-tag> = en_US
# local-data-dir (string)
# location of local language data files
# default: <actual-dict-dir> = /usr/lib/aspell/
# master (string)
# base name of the main dictionary to use
# default: <lang> = en_US
# normalize (boolean)
# enable Unicode normalization
# default: true
# norm-required (boolean)
# Unicode normalization required for current lang
# default: false
# norm-form (string)
# Unicode normalization form: none, nfd, nfc, comp
# default: nfc
# norm-strict (boolean)
# avoid lossy conversions when normalization
# default: false
# per-conf (string)
# personal configuration file
# default: .aspell.conf
# personal (string)
# personal dictionary file name
# default: .aspell.<lang>.pws = .aspell.en_US.pws
# prefix (string)
# prefix directory
# default: /usr
# repl (string)
# replacements list file name
# default: .aspell.<lang>.prepl = .aspell.en_US.prepl
# run-together (boolean)
# consider run-together words legal
# default: false
# run-together-limit (integer)
# maximum number that can be strung together
# default: 2
# run-together-min (integer)
# minimal length of interior words
# default: 3
# save-repl (boolean)
# save replacement pairs on save all
# default: true
# set-prefix (boolean)
# set the prefix based on executable location
# default: true
# size (string)
# size of the word list
# default: +60
# sug-mode (string)
# suggestion mode
# default: normal
# sug-edit-dist (integer)
# edit distance to use, override sug-mode default
# default: 1
# sug-typo-analysis (boolean)
# use typo analysis, override sug-mode default
# default: true
# sug-repl-table (boolean)
# use replacement tables, override sug-mode default
# default: true
# sug-split-char (list)
# characters to insert when a word is split
# use-other-dicts (boolean)
# use personal, replacement & session dictionaries
# default: true
# variety (list)
# extra information for the word list
# warn (boolean)
# enable warnings
# default: true
# affix-compress (boolean)
# use affix compression when creating dictionaries
# default: false
# clean-affixes (boolean)
# remove invalid affix flags
# default: true
# clean-words (boolean)
# attempts to clean words so that they are valid
# default: false
# invisible-soundslike (boolean)
# compute soundslike on demand rather than storing
# default: false
# partially-expand (boolean)
# partially expand affixes for better suggestions
# default: false
# skip-invalid-words (boolean)
# skip invalid words
# default: true
# validate-affixes (boolean)
# check if affix flags are valid
# default: true
# validate-words (boolean)
# check if words are valid
# default: true
# backup (boolean)
# create a backup file by appending ".bak"
# default: true
# byte-offsets (boolean)
# use byte offsets instead of character offsets
# default: false
# guess (boolean)
# create missing root/affix combinations
# default: false
# keymapping (string)
# keymapping for check mode: "aspell" or "ispell"
# default: aspell
# reverse (boolean)
# reverse the order of the suggest list
# default: false
# suggest (boolean)
# suggest possible replacements
# default: true
# time (boolean)
# time load time and suggest time in pipe mode
# default: false
#######################################################################
#
# Filter: email
# filter for skipping quoted text in email messages
#
# configured as follows:
# f-email-quote (list)
# email quote characters
# f-email-margin (integer)
# num chars that can appear before the quote char
# default: 10
#######################################################################
#
# Filter: html
# filter for dealing with HTML documents
#
# configured as follows:
# f-html-check (list)
# HTML attributes to always check
# f-html-skip (list)
# HTML tags to always skip the contents of
#######################################################################
#
# Filter: tex
# filter for dealing with TeX/LaTeX documents
#
# configured as follows:
# f-tex-check-comments (boolean)
# check TeX comments
# default: false
# f-tex-command (list)
# TeX commands
#######################################################################
#
# Filter: sgml
# filter for dealing with generic SGML/XML documents
#
# configured as follows:
# f-sgml-check (list)
# SGML attributes to always check
# f-sgml-skip (list)
# SGML tags to always skip the contents of
#######################################################################
#
# Filter: texinfo
# filter for dealing with Texinfo documents
#
# configured as follows:
# f-texinfo-ignore (list)
# Texinfo commands to ignore the parameters of
# f-texinfo-ignore-env (list)
# Texinfo environments to ignore
#######################################################################
#
# Filter: context
# experimental filter for hiding delimited contexts
#
# configured as follows:
# f-context-delimiters (list)
# context delimiters (separated by spaces)
# f-context-visible-first (boolean)
# swaps visible and invisible text
# default: false
Update
Apparently the solution to this is to use yet another configuration parameter to set the encofing: --encodig=UTF-8 on the command line.
For example:
zby@tvm1:/home/xpapers$ aspell --lang=en create master ./dictionary.local < w
Warning: The word "Pérez" is invalid. The character '©' (U+A9) may not appear in the middle of a word. Skipping word.
The file w contains only one word:
zby@tvm1:/home/xpapers$ cat w
Pérez
That is the second letter is e with accent. The hexdump:
zby@tvm1:/home/xpapers$ hexdump w
0000000 c350 72a9 7a65 000a
0000007
This is littleendian - so you need to flip the bytes - but it seems correct UTF-8 (50 - P, then c3 72 - which is e with accent ), and it displays OK in my console.
In the env I have:
zby@tvm1:/home/xpapers$ set | grep LANG
LANG=en_US.UTF-8
The aspell config (as dumped by aspell dump config ) is attached below, I think that the only relevant info is:
# encoding (string)
# encoding to expect data to be in
# default: !encoding = UTF-8
So it seems that everything is set up for UTF-8 - but still aspell seem to try Latin-1.
This is on Ubuntu Karmic Coala:
zby@tvm1:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.10
DISTRIB_CODENAME=karmic
DISTRIB_DESCRIPTION="Ubuntu 9.10"
And Aspell is:
zby@tvm1:~$ aspell -v
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.6)
=============================================
zby@tvm1:/home/xpapers$ aspell dump config
# conf (string)
# main configuration file
# default: aspell.conf
# conf-dir (string)
# location of main configuration file
# default: /etc
# data-dir (string)
# location of language data files
# default: <prefix:lib/aspell> = /usr/lib/aspell
# dict-alias (list)
# create dictionary aliases
# dict-dir (string)
# location of the main word list
# default: <data-dir> = /usr/lib/aspell
# encoding (string)
# encoding to expect data to be in
# default: !encoding = UTF-8
# filter (list)
# add or removes a filter
# filter-path (list)
# path(s) aspell looks for filters
# mode (string)
# filter mode
# default: url
# extra-dicts (list)
# extra dictionaries to use
# home-dir (string)
# location for personal files
# default: <$HOME|./> = /home/zby
# ignore (integer)
# ignore words <= n chars
# default: 1
# ignore-case (boolean)
# ignore case when checking words
# default: false
# ignore-repl (boolean)
# ignore commands to store replacement pairs
# default: false
# keyboard (string)
# keyboard definition to use for typo analysis
# default: standard
# lang (string)
# language code
# default: <language-tag> = en_US
# local-data-dir (string)
# location of local language data files
# default: <actual-dict-dir> = /usr/lib/aspell/
# master (string)
# base name of the main dictionary to use
# default: <lang> = en_US
# normalize (boolean)
# enable Unicode normalization
# default: true
# norm-required (boolean)
# Unicode normalization required for current lang
# default: false
# norm-form (string)
# Unicode normalization form: none, nfd, nfc, comp
# default: nfc
# norm-strict (boolean)
# avoid lossy conversions when normalization
# default: false
# per-conf (string)
# personal configuration file
# default: .aspell.conf
# personal (string)
# personal dictionary file name
# default: .aspell.<lang>.pws = .aspell.en_US.pws
# prefix (string)
# prefix directory
# default: /usr
# repl (string)
# replacements list file name
# default: .aspell.<lang>.prepl = .aspell.en_US.prepl
# run-together (boolean)
# consider run-together words legal
# default: false
# run-together-limit (integer)
# maximum number that can be strung together
# default: 2
# run-together-min (integer)
# minimal length of interior words
# default: 3
# save-repl (boolean)
# save replacement pairs on save all
# default: true
# set-prefix (boolean)
# set the prefix based on executable location
# default: true
# size (string)
# size of the word list
# default: +60
# sug-mode (string)
# suggestion mode
# default: normal
# sug-edit-dist (integer)
# edit distance to use, override sug-mode default
# default: 1
# sug-typo-analysis (boolean)
# use typo analysis, override sug-mode default
# default: true
# sug-repl-table (boolean)
# use replacement tables, override sug-mode default
# default: true
# sug-split-char (list)
# characters to insert when a word is split
# use-other-dicts (boolean)
# use personal, replacement & session dictionaries
# default: true
# variety (list)
# extra information for the word list
# warn (boolean)
# enable warnings
# default: true
# affix-compress (boolean)
# use affix compression when creating dictionaries
# default: false
# clean-affixes (boolean)
# remove invalid affix flags
# default: true
# clean-words (boolean)
# attempts to clean words so that they are valid
# default: false
# invisible-soundslike (boolean)
# compute soundslike on demand rather than storing
# default: false
# partially-expand (boolean)
# partially expand affixes for better suggestions
# default: false
# skip-invalid-words (boolean)
# skip invalid words
# default: true
# validate-affixes (boolean)
# check if affix flags are valid
# default: true
# validate-words (boolean)
# check if words are valid
# default: true
# backup (boolean)
# create a backup file by appending ".bak"
# default: true
# byte-offsets (boolean)
# use byte offsets instead of character offsets
# default: false
# guess (boolean)
# create missing root/affix combinations
# default: false
# keymapping (string)
# keymapping for check mode: "aspell" or "ispell"
# default: aspell
# reverse (boolean)
# reverse the order of the suggest list
# default: false
# suggest (boolean)
# suggest possible replacements
# default: true
# time (boolean)
# time load time and suggest time in pipe mode
# default: false
#######################################################################
#
# Filter: email
# filter for skipping quoted text in email messages
#
# configured as follows:
# f-email-quote (list)
# email quote characters
# f-email-margin (integer)
# num chars that can appear before the quote char
# default: 10
#######################################################################
#
# Filter: html
# filter for dealing with HTML documents
#
# configured as follows:
# f-html-check (list)
# HTML attributes to always check
# f-html-skip (list)
# HTML tags to always skip the contents of
#######################################################################
#
# Filter: tex
# filter for dealing with TeX/LaTeX documents
#
# configured as follows:
# f-tex-check-comments (boolean)
# check TeX comments
# default: false
# f-tex-command (list)
# TeX commands
#######################################################################
#
# Filter: sgml
# filter for dealing with generic SGML/XML documents
#
# configured as follows:
# f-sgml-check (list)
# SGML attributes to always check
# f-sgml-skip (list)
# SGML tags to always skip the contents of
#######################################################################
#
# Filter: texinfo
# filter for dealing with Texinfo documents
#
# configured as follows:
# f-texinfo-ignore (list)
# Texinfo commands to ignore the parameters of
# f-texinfo-ignore-env (list)
# Texinfo environments to ignore
#######################################################################
#
# Filter: context
# experimental filter for hiding delimited contexts
#
# configured as follows:
# f-context-delimiters (list)
# context delimiters (separated by spaces)
# f-context-visible-first (boolean)
# swaps visible and invisible text
# default: false
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 -lang=en 创建字典时,Aspell 会查找 en 语言文件。在我的 Ubuntu 系统上,如下所示:
所以 Aspell 使用该字符集。要覆盖该设置,请使用 --encoding=utf-8 选项。
然后对于输入(和建议的单词)设置编码选项。
When creating a dictionary with -lang=en Aspell looks for the en language file. On my Ubuntu system that looks like:
So Aspell uses that charset. To override that setting you use the --encoding=utf-8 option.
Then for input (and suggested words) set the encoding option.