apache httpclient 4 的 UNICODE URI 编码
我正在使用 apache http client 4 进行所有 Web 访问。 这意味着我需要执行的每个查询都必须通过 URI 语法检查。 我尝试访问的网站之一使用 UNICODE 作为 url GET params 编码,即:
(参数“srh_txt=%u05E0%u05D9%u05D1”以 UNICODE 编码 srh_txt=נйב)
问题是 URI 不支持 UNICODE 编码(仅支持 UTF) -8) 这里真正的大问题是,该网站希望它的参数以 UNICODE 进行编码,因此任何尝试使用 String.format("http://...srh_txt=%s&...",URLEncoder.encode( "נב" , "UTF8")) 生成的 url 是合法的,可用于构造 URI,但站点会用错误消息响应它,因为它不是它期望的编码。
顺便说一下,可以创建 URL 对象,甚至可以使用未转换的 url 连接到网站。 有没有办法以非 UTF-8 编码创建 URI? 有什么方法可以使用常规 URL(而不是 URI)使用 apache httpclient 4 吗?
谢谢, 尼夫
I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实并非如此。这不是 URL 编码,并且 URL 中的序列
%u
是无效。%u05E0%u05D9%u05D1"
仅使用 JavaScript 的奇怪escape
语法对נйב
进行编码。escape
与 URL 相同- 对除+
之外的所有 ASCII 字符进行编码,但它为 Unicode 字符生成的%u####
转义完全是它自己的发明(应该是。一般情况下,切勿使用
escape
。使用encodeURIComponent
会生成正确的 URL 编码 UTF-8,ננב
=%D7%A0% D7%99%D7%91
。)如果站点需要其查询字符串中的
%u####
序列,那么它就会被严重破坏。是的,URI 可以使用您喜欢的任何字符编码。常规上是UTF-8;这就是 IRI 所要求的,并且如果用户在地址栏中键入非 ASCII 字符,浏览器通常会提交什么,但 URI 本身只与字节有关。
因此,您可以将
נйב
转换为%F0%E9%E1
。 Web 应用程序无法判断这些字节代表的是代码页 1255(希伯来语,类似于 ISO-8859-8)中编码的字符。但它似乎确实可以在上面的链接上工作,而 UTF-8 版本却不能。哦亲爱的!It doesn't really. That's not URL-encoding and the sequence
%u
is invalid in a URL.%u05E0%u05D9%u05D1"
encodesניב
only in JavaScript's oddballescape
syntax.escape
is the same as URL-encoding for all ASCII characters except for+
, but the%u####
escapes it produces for Unicode characters are completely of its own invention.(One should, in general, never use
escape
. UsingencodeURIComponent
instead produces the correct URL-encoded UTF-8,ניב
=%D7%A0%D7%99%D7%91
.)If a site requires
%u####
sequences in its query string, it is very badly broken.Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert
ניב
to%F0%E9%E1
. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!