如何对 html 和 xml 进行拼写检查?

发布于 2024-10-30 12:52:33 字数 288 浏览 5 评论 0原文

我必须对大量大型 html 和 xml 文档(超过 30.000 个)进行拼写检查。我还需要自定义字典和复杂的检查算法。我尝试将 BASH + linux 实用程序(sedgrep、...)与 hunspell 结合使用。 Hunspell 有选项 -H,强制它检查 HTML 文档(对于 XML,该选项也适用)。但有一个问题:它输出偏移量而不是行数,它也可以逐行检查,因为在这种情况下它会查看标签内部(他找不到闭合标签)。 那么完成任务的正确方法是什么?

I have to do spell check for large number of big html and xml documents (more than 30.000). I also need custom dictionary and sophisticated algorithms of checking. I try to use BASH + linux utility (sed, grep, ...) with hunspell. Hunspell has option -H that force it to check document as HTML (for XML the option is also suitable). But there is one problem: it output offsets and not number of line also it can check line by line because in this case it looks inside of tags (he can't find closed tag).
So what is the right way to do the task?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

落花随流水 2024-11-06 12:52:33

我刚刚遇到了类似问题。您应该能够通过使用那些未记录的开关(例如 -u-U)获得良好的输出。但要小心,因为这些功能现在似乎还处于实验阶段,我只是通过查看 hunspell 的来源才发现它们的存在。

所以本质上:

hunspell -H -u my-file.html

应该这样做。

或者,您还可以使用开关 -u1-u2-u3

I just had a similar problem. You should be able to get a good output by using those undocumented switches, e.g. -u or -U. But be careful, as those features seem to be experimental right now, and I only found out about their existance by looking at the sources of hunspell.

So essentially:

hunspell -H -u my-file.html

should do it.

Alternatively, there are also the switches -u1, -u2 and -u3 you can play around with.

花辞树 2024-11-06 12:52:33

您是否尝试过使用 tidy

我还没有在如此多的文件上使用它,但它可以很好地查找 100 多个 HTML 页面中的问题。您还可以在 XML 文件上使用它,并且能够接受带有许多我尚未探索的选项的配置文件。

Have you tried using tidy?

I have not used it on such elevated number of files, but it worked fine for finding issues in 100+ HTML pages. You can also use it on XML files and is able to accept a configuration file with many option which I have not yet explored.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文