HTML 中的对齐纯文本

发布于 2024-08-09 15:05:36 字数 733 浏览 10 评论 0原文

我需要任意 HTML 文件(例如博客文章)的纯文本表示。到目前为止这不是问题,有数十个 HTML 到 txt 转换器。但是,段落中的文本(读为“p elements”)应该在纯文本视图中对齐(到一定数量的列),并且如果可能的话,使用连字符连接以提供更好的可读结果。此外,生成的文本文件必须是 UTF-8 或 UTF-16。

我可以使用 XSLT 进行简单的纯文本对话,这几乎是微不足道的。但文本的合理性超出了它的可能性(不完全正确,因为 XSLT 是图灵完备的,但足够接近现实)。

FOP 和 XSL-FO 也不起作用。他们按照要求做了,但 FOP 的纯文本输出很糟糕(开发人员说,它不适合这种用途)。

我还尝试了 HTML -> XSLT-> Roff,但我一直坚持 groff,而且它的 Unicode 支持远非最佳。由于存在省略号(“...”)和印刷正确的引号等字符,因此在 XSLT 样式表中告诉 groff 数十个 Unicode 字符的转义序列非常麻烦。

另一种方法可以是转换为 TeX 并以纯文本形式输出,但我之前从未使用 (La)TeX 尝试过此操作。

也许我错过了一些非常简单的事情。有谁知道我如何实现上述目标?顺便说一句:解决方案最好无需 root 权限即可安装,可以使用 PHP、Python、Perl、XSLT 或任何在普通 Linux 发行版中找到的程序。

I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.

Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT is Turing complete, but close enough to reality).

FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).

I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.

Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.

Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

别把无礼当个性 2024-08-16 15:05:36

尝试Python。使用 BeautifulSoup 解析 HTML。 textwrap 模块 将允许您设置文本格式。

不过,还缺少两个功能。为了证明文本合理,您需要在每一行添加空格,但这应该不是一个大问题(请参阅 此代码示例)。

对于连字符,请尝试此项目

Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.

There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).

For hyphenation, try this project.

驱逐舰岛风号 2024-08-16 15:05:36

如果您熟悉 Emacs,您可以在 Emacs-W3M 中打开 HTML 文件(即 Mx w3m-find-file foo.html),将渲染的页面保存为纯文本文件,然后对其调用 Mx set-justification-full

您甚至可以编写一个小函数来完成这项工作:

(defun my-html-to-justifed-text (html-file text-file)
  "Convert HTML-FILE to plain TEXT-FILE."
  (find-file html-file)
  (w3m-rendering-buffer)
  (set-justification-full (point-min) (point-max))
  (write-file text-file))

(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")

If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e. M-x w3m-find-file foo.html), save the rendered page as a plain text file, and then call M-x set-justification-full on it.

You can even write a small function to do the job:

(defun my-html-to-justifed-text (html-file text-file)
  "Convert HTML-FILE to plain TEXT-FILE."
  (find-file html-file)
  (w3m-rendering-buffer)
  (set-justification-full (point-min) (point-max))
  (write-file text-file))

(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")
一片旧的回忆 2024-08-16 15:05:36

链接lynx 可能值得一试,请参阅 -dump 开关。然后,您可以使用 iconv 或类似的东西轻松单独解决编码部分。

Links or lynx might be worth a try, see the -dump switch. The encoding part you can then easily solve separately using iconv or something similar.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文