使用 ispell/aspell 对驼峰式单词进行拼写检查

发布于 2024-09-15 17:06:40 字数 353 浏览 19 评论 0原文

我需要对包含许多驼峰式单词的大型文档进行拼写检查。我想要 ispell 或 aspell 检查各个单词是否拼写正确。

所以，对于这个词：

ScientificProgrezGoesBoink

我很想让它建议这样：

科学进步引起轰动

有什么办法可以做到这一点吗？（我的意思是，在 Emacs 缓冲区上运行它时。）请注意，我不一定希望它建议完整的替代方案。然而，如果它知道 Progrez 不被识别，我希望至少能够替换该部分，或者将该单词添加到我的私人词典中，而不是将每个驼峰式单词都包含到词典中。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自此以后，行同陌路 2024-09-22 17:06:43

我采纳了 @phils 的建议并进行了更深入的研究。事实证明，如果您获得 camelCase-mode 并重新配置一些 ispell，如下所示：

(defun ispell-get-word (following)
  (when following
    (camelCase-forward-word 1))
  (let* ((start (progn (camelCase-backward-word 1)
                       (point)))
         (end (progn (camelCase-forward-word 1)
                     (point))))
    (list (buffer-substring-no-properties start end)
          start end)))

然后，在在这种情况下，单个驼峰式单词 suchAsThisOne 实际上会被正确地进行拼写检查。（除非你在文档的开头——我刚刚发现。）

所以这显然不是完整的解决方案，但至少是这样。

I took @phils suggestions and dug around a little deeper. It turns out that if you get camelCase-mode and reconfigure some of ispell like this:

(defun ispell-get-word (following)
  (when following
    (camelCase-forward-word 1))
  (let* ((start (progn (camelCase-backward-word 1)
                       (point)))
         (end (progn (camelCase-forward-word 1)
                     (point))))
    (list (buffer-substring-no-properties start end)
          start end)))

then, in that case, individual camel cased words suchAsThisOne will actually be spell-checked correctly. (Unless you're at the beginning of a document -- I just found out.)

So this clearly isn't the fullblown solution, but at least it's something.

回复收藏 0 原文

暖风昔人 2024-09-22 17:06:43

aspell 中有“--run-together”选项。 Hunspell 无法检查驼峰式单词。

如果你读过aspell的代码，你会发现它的算法实际上并没有将camelcase单词拆分成子单词列表。也许这个算法更快，但它会错误地将包含两个字符子词的单词报告为拼写错误。不要浪费时间调整其他拼写选项。我尝试过，但他们没有用。

因此，我们遇到了两个问题：

aspell 将某些驼峰式单词报告为拼写错误
hunspell 将所有驼峰式单词报告为拼写错误

错误解决这两个问题的方法是在 Emacs Lisp 中编写我们自己的谓词。

下面是为 javascript 编写的示例谓词：

(defun split-camel-case (word)
  "Split camel case WORD into a list of strings.
Ported from 'https://github.com/fatih/camelcase/blob/master/camelcase.go'."
  (let* ((case-fold-search nil)
         (len (length word))
         ;; ten sub-words is enough
         (runes [nil nil nil nil nil nil nil nil nil nil])
         (runes-length 0)
         (i 0)
         ch
         (last-class 0)
         (class 0)
         rlt)

    ;; split into fields based on class of character
    (while (< i len)
      (setq ch (elt word i))
      (cond
       ;; lower case
       ((and (>= ch ?a) (<= ch ?z))
        (setq class 1))
       ;; upper case
       ((and (>= ch ?A) (<= ch ?Z))
        (setq class 2))
       ((and (>= ch ?0) (<= ch ?9))
        (setq class 3))
       (t
        (setq class 4)))

      (cond
       ((= class last-class)
        (aset runes
              (1- runes-length)
              (concat (aref runes (1- runes-length)) (char-to-string ch))))
       (t
        (aset runes runes-length (char-to-string ch))
        (setq runes-length (1+ runes-length))))
      (setq last-class class)
      ;; end of while
      (setq i (1+ i)))

    ;; handle upper case -> lower case sequences, e.g.
    ;;     "PDFL", "oader" -> "PDF", "Loader"
    (setq i 0)
    (while (< i (1- runes-length))
      (let* ((ch-first (aref (aref runes i) 0))
             (ch-second (aref (aref runes (1+ i)) 0)))
        (when (and (and (>= ch-first ?A) (<= ch-first ?Z))
                   (and (>= ch-second ?a) (<= ch-second ?z)))
          (aset runes (1+ i) (concat (substring (aref runes i) -1) (aref runes (1+ i))))
          (aset runes i (substring (aref runes i) 0 -1))))
      (setq i (1+ i)))

    ;; construct final result
    (setq i 0)
    (while (< i runes-length)
      (when (> (length (aref runes i)) 0)
        (setq rlt (add-to-list 'rlt (aref runes i) t)))
      (setq i (1+ i)))
     rlt))

(defun flyspell-detect-ispell-args (&optional run-together)
  "If RUN-TOGETHER is true, spell check the CamelCase words.
Please note RUN-TOGETHER will make aspell less capable. So it should only be used in prog-mode-hook."
  ;; force the English dictionary, support Camel Case spelling check (tested with aspell 0.6)
  (let* ((args (list "--sug-mode=ultra" "--lang=en_US"))args)
    (if run-together
        (setq args (append args '("--run-together" "--run-together-limit=16"))))
    args))

;; {{ for aspell only, hunspell does not need setup `ispell-extra-args'
(setq ispell-program-name "aspell")
(setq-default ispell-extra-args (flyspell-detect-ispell-args t))
;; }}

;; ;; {{ hunspell setup, please note we use dictionary "en_US" here
;; (setq ispell-program-name "hunspell")
;; (setq ispell-local-dictionary "en_US")
;; (setq ispell-local-dictionary-alist
;;       '(("en_US" "[[:alpha:]]" "[^[:alpha:]]" "[']" nil ("-d" "en_US") nil utf-8)))
;; ;; }}

(defvar extra-flyspell-predicate '(lambda (word) t)
  "A callback to check WORD.  Return t if WORD is typo.")

(defun my-flyspell-predicate (word)
  "Use aspell to check WORD.  If it's typo return t."
  (let* ((cmd (cond
               ;; aspell: `echo "helle world" | aspell pipe`
               ((string-match-p "aspell$" ispell-program-name)
                (format "echo \"%s\" | %s pipe"
                        word
                        ispell-program-name))
               ;; hunspell: `echo "helle world" | hunspell -a -d en_US`
               (t
                (format "echo \"%s\" | %s -a -d en_US"
                        word
                        ispell-program-name))))
         (cmd-output (shell-command-to-string cmd))
         rlt)
    ;; (message "word=%s cmd=%s" word cmd)
    ;; (message "cmd-output=%s" cmd-output)
    (cond
     ((string-match-p "^&" cmd-output)
      ;; it's a typo because at least one sub-word is typo
      (setq rlt t))
     (t
      ;; not a typo
      (setq rlt nil)))
    rlt))

(defun js-flyspell-verify ()
  (let* ((case-fold-search nil)
         (font-matched (memq (get-text-property (- (point) 1) 'face)
                             '(js2-function-call
                               js2-function-param
                               js2-object-property
                               js2-object-property-access
                               font-lock-variable-name-face
                               font-lock-string-face
                               font-lock-function-name-face
                               font-lock-builtin-face
                               rjsx-text
                               rjsx-tag
                               rjsx-attr)))
         subwords
         word
         (rlt t))
    (cond
     ((not font-matched)
      (setq rlt nil))
     ;; ignore two character word
     ((< (length (setq word (thing-at-point 'word))) 2)
      (setq rlt nil))
     ;; handle camel case word
     ((and (setq subwords (split-camel-case word)) (> (length subwords) 1))
      (let* ((s (mapconcat (lambda (w)
                             (cond
                              ;; sub-word wholse length is less than three
                              ((< (length w) 3)
                               "")
                               ;; special characters
                              ((not (string-match-p "^[a-zA-Z]*$" w))
                               "")
                              (t
                               w))) subwords " ")))
        (setq rlt (my-flyspell-predicate s))))
     (t
      (setq rlt (funcall extra-flyspell-predicate word))))
    rlt))

(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)

或者只使用我的新 pacakge https://github.com/redguardtoo/wucuo

There is "--run-together" option in aspell. Hunspell can't check camelcased word.

If you read the code of aspell, you will find its algorithm actually does not split camelcase word into a list of sub-words. Maybe this algorithm is faster, but it will wrongly report word containing two character sub-word as typo. Don't waste time to tweak other aspell options. I tried and they didn't work.

So we got two problems:

aspell reports SOME camelcased words as typos
hunspell reports ALL camelcased words as typos

Solution to solve BOTH problems is to write our own predicate in Emacs Lisp.

Here is a sample predicate written for javascript:

(defun split-camel-case (word)
  "Split camel case WORD into a list of strings.
Ported from 'https://github.com/fatih/camelcase/blob/master/camelcase.go'."
  (let* ((case-fold-search nil)
         (len (length word))
         ;; ten sub-words is enough
         (runes [nil nil nil nil nil nil nil nil nil nil])
         (runes-length 0)
         (i 0)
         ch
         (last-class 0)
         (class 0)
         rlt)

    ;; split into fields based on class of character
    (while (< i len)
      (setq ch (elt word i))
      (cond
       ;; lower case
       ((and (>= ch ?a) (<= ch ?z))
        (setq class 1))
       ;; upper case
       ((and (>= ch ?A) (<= ch ?Z))
        (setq class 2))
       ((and (>= ch ?0) (<= ch ?9))
        (setq class 3))
       (t
        (setq class 4)))

      (cond
       ((= class last-class)
        (aset runes
              (1- runes-length)
              (concat (aref runes (1- runes-length)) (char-to-string ch))))
       (t
        (aset runes runes-length (char-to-string ch))
        (setq runes-length (1+ runes-length))))
      (setq last-class class)
      ;; end of while
      (setq i (1+ i)))

    ;; handle upper case -> lower case sequences, e.g.
    ;;     "PDFL", "oader" -> "PDF", "Loader"
    (setq i 0)
    (while (< i (1- runes-length))
      (let* ((ch-first (aref (aref runes i) 0))
             (ch-second (aref (aref runes (1+ i)) 0)))
        (when (and (and (>= ch-first ?A) (<= ch-first ?Z))
                   (and (>= ch-second ?a) (<= ch-second ?z)))
          (aset runes (1+ i) (concat (substring (aref runes i) -1) (aref runes (1+ i))))
          (aset runes i (substring (aref runes i) 0 -1))))
      (setq i (1+ i)))

    ;; construct final result
    (setq i 0)
    (while (< i runes-length)
      (when (> (length (aref runes i)) 0)
        (setq rlt (add-to-list 'rlt (aref runes i) t)))
      (setq i (1+ i)))
     rlt))

(defun flyspell-detect-ispell-args (&optional run-together)
  "If RUN-TOGETHER is true, spell check the CamelCase words.
Please note RUN-TOGETHER will make aspell less capable. So it should only be used in prog-mode-hook."
  ;; force the English dictionary, support Camel Case spelling check (tested with aspell 0.6)
  (let* ((args (list "--sug-mode=ultra" "--lang=en_US"))args)
    (if run-together
        (setq args (append args '("--run-together" "--run-together-limit=16"))))
    args))

;; {{ for aspell only, hunspell does not need setup `ispell-extra-args'
(setq ispell-program-name "aspell")
(setq-default ispell-extra-args (flyspell-detect-ispell-args t))
;; }}

;; ;; {{ hunspell setup, please note we use dictionary "en_US" here
;; (setq ispell-program-name "hunspell")
;; (setq ispell-local-dictionary "en_US")
;; (setq ispell-local-dictionary-alist
;;       '(("en_US" "[[:alpha:]]" "[^[:alpha:]]" "[']" nil ("-d" "en_US") nil utf-8)))
;; ;; }}

(defvar extra-flyspell-predicate '(lambda (word) t)
  "A callback to check WORD.  Return t if WORD is typo.")

(defun my-flyspell-predicate (word)
  "Use aspell to check WORD.  If it's typo return t."
  (let* ((cmd (cond
               ;; aspell: `echo "helle world" | aspell pipe`
               ((string-match-p "aspell$" ispell-program-name)
                (format "echo \"%s\" | %s pipe"
                        word
                        ispell-program-name))
               ;; hunspell: `echo "helle world" | hunspell -a -d en_US`
               (t
                (format "echo \"%s\" | %s -a -d en_US"
                        word
                        ispell-program-name))))
         (cmd-output (shell-command-to-string cmd))
         rlt)
    ;; (message "word=%s cmd=%s" word cmd)
    ;; (message "cmd-output=%s" cmd-output)
    (cond
     ((string-match-p "^&" cmd-output)
      ;; it's a typo because at least one sub-word is typo
      (setq rlt t))
     (t
      ;; not a typo
      (setq rlt nil)))
    rlt))

(defun js-flyspell-verify ()
  (let* ((case-fold-search nil)
         (font-matched (memq (get-text-property (- (point) 1) 'face)
                             '(js2-function-call
                               js2-function-param
                               js2-object-property
                               js2-object-property-access
                               font-lock-variable-name-face
                               font-lock-string-face
                               font-lock-function-name-face
                               font-lock-builtin-face
                               rjsx-text
                               rjsx-tag
                               rjsx-attr)))
         subwords
         word
         (rlt t))
    (cond
     ((not font-matched)
      (setq rlt nil))
     ;; ignore two character word
     ((< (length (setq word (thing-at-point 'word))) 2)
      (setq rlt nil))
     ;; handle camel case word
     ((and (setq subwords (split-camel-case word)) (> (length subwords) 1))
      (let* ((s (mapconcat (lambda (w)
                             (cond
                              ;; sub-word wholse length is less than three
                              ((< (length w) 3)
                               "")
                               ;; special characters
                              ((not (string-match-p "^[a-zA-Z]*$" w))
                               "")
                              (t
                               w))) subwords " ")))
        (setq rlt (my-flyspell-predicate s))))
     (t
      (setq rlt (funcall extra-flyspell-predicate word))))
    rlt))

(put 'js2-mode 'flyspell-mode-predicate 'js-flyspell-verify)

Or just use my new pacakge https://github.com/redguardtoo/wucuo

回复收藏 0 原文