设计翻译 API - 如何处理空格
我的应用程序使用外部翻译 API(无法选择使用其他翻译引擎)。当我调用翻译引擎时,我看到以下意外行为。
输入
<代码>
预期输出
<代码>
实际输出
API 修剪空格是否正确?这对我来说感觉不对。
注意:记录中用
标签对替换非 html 标签(数字递增以保持标签对唯一)。
更新:最终结果是我必须解决这个问题,在调用翻译 API 之前对空格进行编码。我不喜欢它,但我无法说服 API 所有者将其更改为 GIGO(Garbage In,Garbage Out)。
My application consumes an external Translation API (no option to use other translation engines). I'm seeing the following unexpected behavior when I call the translation engine.
input<b1> Hello World. </b1>
expected output<b1> Hola a todos. </b1>
actual output<b1>Hola a todos.</b1>
Is it proper for the API to be trimming the spaces? This feels wrong to me.
Note: it is documented to replace non-html tags with <b1></b1>
tag pairs (numbers increment to keep tag pairs unique).
Update: The end result was that I had to hack around the issue, encode spaces before I call the translation API. I don't like it, but I was not able to convince the API owner change it to GIGO (Garbage In, Garbage Out).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
嗯,一般来说,空格不被视为单词的一部分,因此 API 这样做并不奇怪。这种行为是否可以可能是有争议的(至少应该记录下来),但你应该遵循“对你接受的东西要自由,对你产生的东西严格”的规则。当你生产代币时,你应该更加严格。
Well, in general whitespaces are not considered part of a word so it is not really surprising that the API is doing that. Whether or not this behaviour is ok is probably debateable (at least it should be documented) but you should follow the rule "be liberal in what you accept and strict in what you produce". As you produce the tokens you should be more strict.
据我所知,HTML 中的空格并不是特别重要,多个空格会折叠为单个空格,换行符会被忽略等等,因此该字符串中的前导空格和尾随空格被删除并不奇怪。从浏览器的角度来看,它们是等效的。
那么问题就变成了,API 中是否有一个选项可以保留空格或将传入文本视为“纯文本”而不是 html?
As far as I know, whitespace in HTML is not particularly significant, multiple spaces are collapsed to single space, newlines are ignored, etc. so it's not much of a surprise that the leading and trailing spaces in that string are being dropped. From the browser's point of view, they're equivalent.
So the question then becomes, is there an option in the API to preserve spaces or treat the incoming text as "plain text" and not html?