“En dash”在 http 响应处理或文本操作期间出现乱码
我正在编写代码来处理维基百科的文本,但遇到了破折号乱码的问题。我以前没有使用过破折号或其他非标准字符(对我来说非标准字符是没有出现在我的键盘上的字符;),所以我不确定该把手指指向哪里做错事。这是正在发生的事情以及代码片段......
我向维基百科发送请求(我使用 Apache HttpComponents 客户端 API 与维基百科进行通信)以获取文章内容并将其保存在字符串中
DefaultHttpClient client = new DefaultHttpClient();
HttpGet queryRequest = new HttpGet(query); // query is the URL for retrieving the article contents.
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
:如果我要将“responseBody”发送到 System.out,破折号会在 Eclipse 控制台中显示为“?”。这可能只是 Eclipse 控制台显示问题,所以我将继续。
我处理文本,忽略破折号,然后将文本发送回维基百科。
List<NameValuePair> postParams = new ArrayList<NameValuePair>();
postParams.add(new BasicNameValuePair("text", content); // content is a String with the article text
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(postParams, "UTF-8");
HttpPost queryRequest = new HttpPost(url); // url is the basic URL for the Wikipedia api
queryRequest.setEntity(entity);
queryRequest.addHeader("Content-Type", "application/x-www-form-urlencoded");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
当现在上传到维基百科的文本显示在网络浏览器中时,之前的破折号现在显示为“?”在一个盒子里(未知字符?)。因此,在某个地方我无意中更改了破折号或对破折号进行了错误编码,但我不确定到底在哪里。
有人能指出我正确的方向吗?
I'm writing code to work with text from Wikipedia and am having issues with en dashes being garbled. I haven't worked with en dashes or other non-standard characters before (non-standard to me being character that don't appear on my keyboard ;), so I'm not sure where to point the finger at what I'm doing wrong. Here's what is happening, along with code snippets.....
I send a request to Wikipedia (I'm using the Apache HttpComponents client API for communicating with Wikipedia) for the contents of an article and save it in a String:
DefaultHttpClient client = new DefaultHttpClient();
HttpGet queryRequest = new HttpGet(query); // query is the URL for retrieving the article contents.
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
At this point if I were to send "responseBody" to System.out, en dashes are displayed in my Eclipse console as '?'. This might just be an Eclipse console display issue so I'll move on.
I manipulate the text, ignoring the en dashes, and then send the text back to Wikipedia.
List<NameValuePair> postParams = new ArrayList<NameValuePair>();
postParams.add(new BasicNameValuePair("text", content); // content is a String with the article text
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(postParams, "UTF-8");
HttpPost queryRequest = new HttpPost(url); // url is the basic URL for the Wikipedia api
queryRequest.setEntity(entity);
queryRequest.addHeader("Content-Type", "application/x-www-form-urlencoded");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
When the text, now uploaded to Wikipedia, is displayed in a web browser what was en dashes before are now displayed as '?' in a box (unknown character?). Therefore, somewhere I am inadvertently changing or miscoding the en dashes, but I'm not sure exactly where.
Can someone point me in the right direction?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
现在真正的答案。非英语字符被破坏的问题与 Apache HTTPComponents 或 Java 字符串处理/操作无关。问题出在 Windows 上运行的 Eclipse IDE。
Eclipse在运行配置中默认使用系统默认的编码方式,Cp1252 for Windows。由于Cp1252不支持所有UTF-8字符,因此会出现问题。我在这里找到了解决方案。在 Eclipse 中,您进入运行配置。对于您尝试运行的项目,请转到“公共”选项卡。有一个用于编码的部分。将其从“默认”更改为“其他”并将编码设置为 UTF-8。
现在一切都很好。
Now for the real answer. The problem with the non-English characters getting mangled had nothing to do with the Apache HTTPComponents or with an Java string handling/manipulation. The problem was with the Eclipse IDE running on Windows.
Eclipse in the run configuration defaults to use the system's default encoding method, Cp1252 for Windows. Since Cp1252 doesn't support all of the UTF-8 characters, thus problems arise. I found the solution here. In Eclipse you go into the Run Configurations. For the project you are attempting to run, go to the 'Common' tab. There is a section for encoding. Change it from "Default" to "Other" and set the encoding to UTF-8.
All is now well.
我仍然没有弄清楚为什么 endash 被破坏了。与此同时,我确实有一个(可能是笨拙的)修复。
我基本上用 endash 字符替换“未知”UTF-8 字符的所有实例。假设原始内容不包含任何其他被转换为“未知”字符的 UTF-8 字符,则此操作有效。
I still have yet to figure out why the endash is getting mangled. I do have a (possibly kludgy) fix in the mean time.
I'm basically replacing all instances of the 'unknown' UTF-8 character with the endash character. This works assuming that the original content doesn't contain any other UTF-8 characters that are getting converted into the 'unknown' character.