HTML 中的对齐纯文本
我需要任意 HTML 文件(例如博客文章)的纯文本表示。到目前为止这不是问题,有数十个 HTML 到 txt 转换器。但是,段落中的文本(读为“p
elements”)应该在纯文本视图中对齐(到一定数量的列),并且如果可能的话,使用连字符连接以提供更好的可读结果。此外,生成的文本文件必须是 UTF-8 或 UTF-16。
我可以使用 XSLT 进行简单的纯文本对话,这几乎是微不足道的。但文本的合理性超出了它的可能性(不完全正确,因为 XSLT 是图灵完备的,但足够接近现实)。
FOP 和 XSL-FO 也不起作用。他们按照要求做了,但 FOP 的纯文本输出很糟糕(开发人员说,它不适合这种用途)。
我还尝试了 HTML -> XSLT-> Roff,但我一直坚持 groff,而且它的 Unicode 支持远非最佳。由于存在省略号(“...”)和印刷正确的引号等字符,因此在 XSLT 样式表中告诉 groff 数十个 Unicode 字符的转义序列非常麻烦。
另一种方法可以是转换为 TeX 并以纯文本形式输出,但我之前从未使用 (La)TeX 尝试过此操作。
也许我错过了一些非常简单的事情。有谁知道我如何实现上述目标?顺便说一句:解决方案最好无需 root 权限即可安装,可以使用 PHP、Python、Perl、XSLT 或任何在普通 Linux 发行版中找到的程序。
I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p
elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.
Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT is Turing complete, but close enough to reality).
FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).
I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.
Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.
Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试Python。使用 BeautifulSoup 解析 HTML。 textwrap 模块 将允许您设置文本格式。
不过,还缺少两个功能。为了证明文本合理,您需要在每一行添加空格,但这应该不是一个大问题(请参阅 此代码示例)。
对于连字符,请尝试此项目。
Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.
There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).
For hyphenation, try this project.
如果您熟悉 Emacs,您可以在 Emacs-W3M 中打开 HTML 文件(即
Mx w3m-find-file foo.html
),将渲染的页面保存为纯文本文件,然后对其调用Mx set-justification-full
。您甚至可以编写一个小函数来完成这项工作:
If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e.
M-x w3m-find-file foo.html
), save the rendered page as a plain text file, and then callM-x set-justification-full
on it.You can even write a small function to do the job:
链接 或 lynx 可能值得一试,请参阅
-dump
开关。然后,您可以使用 iconv 或类似的东西轻松单独解决编码部分。Links or lynx might be worth a try, see the
-dump
switch. The encoding part you can then easily solve separately using iconv or something similar.