以编程方式将纯文本转换为乳胶代码
我想获取一些用户输入文本并快速解析它以生成一些乳胶代码。目前,我将 %
替换为 \%
,将 \n
替换为 \n\n
,但是我想知道是否应该进行其他替换来从纯文本转换为乳胶。
我并不非常担心这里的安全性(你甚至可以编写恶意乳胶代码吗?),因为这应该只由用户用来将他们自己的文本转换为乳胶,所以他们应该被允许使用自己的乳胶标记在预先转换的文本中,但我想确保输出不包含意外的乳胶命令(如果可能的话)。如果有一个好的库可以进行这样的转换,我会看看。
I'd like to take some user input text and quickly parse it to produce some latex code. At the moment, I'm replacing %
with \%
and \n
with \n\n
, but I'm wondering if there are other replacements I should be making to make the conversion from plain text to latex.
I'm not super worried about safety here (can you even write malicious latex code?), as this should only be used by the user to convert their own text into latex, and so they should probably be allowed to used their own latex markup in the pre-converted text, but I'd like to make sure the output doesn't include accidental latex commands if possible. If there's a good library to make such a conversion, I'd take a look.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
显然,以下字符
在LaTeX中是特殊的,所以你应该确保转义它们(使用反斜杠作为前缀可以解决其中一些问题,请参阅 Thomas 的回答(针对特殊情况))或告诉您的用户不要使用它们,除非他们故意想要使用 LaTeX 命令(或两者的混合,具体取决于角色)。
一些额外的陷阱:
ä -> \"a
)编辑:由于这已成为公认的答案,我还添加了其他答案中提出的要点,所以现在这是一个总结
Apparently, the following characters
are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).
Some additional pitfalls:
ä -> \"a
).EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.
正如Heinzi所说,需要注意以下几点:
大多数都可以用反斜杠转义,但
\
变成\textbackslash
,~
变成\ textasciitilde
。我想你可能想保留换行符。 LaTeX 处理这些内容的方式与许多内容管理系统完全相同;许多人都认为“双换行”=“段落换行”。哎呀,甚至 stackoverflow 本身也是这样工作的。
(你不能编写恶意的 LaTeX 代码;LaTeX 中发生的所有事情都保留在 LaTeX 中。除非你在运行
latex
时显式启用 write18,但默认情况下它是禁用的。)As Heinzi said, the following need attention:
Most can be escaped with a backslash, but
\
becomes\textbackslash
and~
becomes\textasciitilde
.I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.
(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running
latex
, but it's disabled by default.)Heinzi 已经展示了大部分基本字符,需要转义,但这里的困难部分是确保引用正确。
需要转换为
在这个微不足道的情况下看起来很容易,但充满了需要仔细处理的问题。对于中等大小的文本,我通常使用 sed 中生成的简单替换并手动欺骗结果。如果您的“纯文本”使用弯引号,事情会变得更容易和更困难。
这里的“朴素引号替换”意味着后面的单词字符被替换为(一个或两个,视情况而定)反引号,所有其他的都被替换为(一个或两个)单引号(
'
)。这可以捕获散文中的大多数情况,但您必须手动清理所有三引号情况。Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.
needs to be converted to
which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.
Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes (
'
). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.另一种可能的解决方案是在插入用户的文本之前将所有“特殊”字符转换为普通字符。这可能会避免许多令人头疼的问题,但也可能会产生新的问题......
您可以通过更改角色的目录代码来做到这一点。 TeX Wikibook 了解更多。
会将
$
转换为普通字符。然而,由于某种原因,有些角色并没有按照你的预期出现。\
变为双开引号,{
变为破折号...并在组内重新定义}
({...}
) 使 TeX 完全窒息。长话短说:仅在您知道自己在做什么的情况下才推荐。
Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...
You can do this by changing the catcode of the character. The TeX Wikibook knows more.
will turn
$
into an ordinary character. However, for some reason some characters don't come out as you'd expect.\
becomes a double open quote,{
becomes a dash... and redefining}
inside a group ({...}
) makes TeX choke entirely.Long story short: only recommended if you know what you're doing.