Python Markdown 模块因 unicode 转换、utf-8 而卡住
我正在使用 web2py 的 markdown 模块来处理标记的文本。问题是,人们正在提交带有智能引号、特殊字符等的内容,我需要将它们替换为等效字符。
我有这样的文字:'\n\r\n上校的脸色有点苍白。 \xe2\x80\x9c但是,\xe2\x80" 请原谅我的大胆,先生 \xe2\x80" 我们现在要去 Uvar' 我
如何确保像在markdown 内部的文本不会抛出错误?文字处理程序插入的花哨的特殊引号是正常原因,但似乎有很多字符是一个问题。
I'm using the markdown module from web2py to handle marked up text. The problem is, people are submitting stuff with smartquotes, special characters etc, and I need to replace those with their equivalents.
I have text like this: '\n\r\nThe Colonels face paled a bit. \xe2\x80\x9cBut, then \xe2\x80" excuse my boldness, sir \xe2\x80" our going to Uvar now'
How do I ensure that calling unicode(txt, 'utf-8') like it does on the text internally inside markdown will not throw an error? The fancy special quotes that word processing programs insert are the normal cause, but there seem to be many characters which are an issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当解码为 UTF-8 时,
\xe2\x80\x9c
是 U+201C 左双引号(“智能引号”)。两次出现的\xe2\x80"
不是有效的 UTF-8 序列,并且"
(“哑”引号)的存在是可疑的。您似乎遇到了损坏问题或编码问题,或两者兼而有之。在开始用哑引号替换智能引号之前,我们需要解决这个问题。“人们提交东西”到底怎么样?
unicode(txt, 'utf-8')
在 markdown 之前经历了哪些转换?The
\xe2\x80\x9c
is U+201C LEFT DOUBLE QUOTATION MARK (a "smart quote") when decoded as UTF-8. The two occurrences of\xe2\x80"
are not valid UTF-8 sequences and the presence there of a"
(a "dumb" quote) is suspicious. You appear to have a mangling problem or an encoding problem, or both. We need to sort that out before we get to the task of replacing e.g. smart quotes by dumb quotes.Exactly how are "people submitting stuff"? What transformations has it gone through before markdown does
unicode(txt, 'utf-8')
?