当前位置：文江博客话题详情

JMeter CSV 数据集正在破坏存储为正确 UTF-8 的日语字符串，我得到的是问号

发布于 2024-10-09 04:27:11 字数 1622 浏览 3 评论 0原文

我从一个简单的文本文件中读取搜索词并将其发送到搜索引擎。它在英语中工作正常，但给了我？？？对于任何日语文本。英语和日语混合的文本确实显示了英语文本，所以我知道它正在阅读它。

我所看到的：

输入文本：雪豹をイinsutoruする场合、新しい
变成：雪豹？？？？？？？？？？？？

这是在我的 HTTP 的 POST 字段中。如果我设置 JMeter 对数据进行编码，它只会输入问号的百分比序列。

关于数据：

CSV 文件非常简单结构。
只有一个字段/一列，我将其命名为 TERM，稍后用作 ${TERM}
我真的不需要完整的 CSV，因为它每行只有一个字符串。
没有逗号或引号。
它是 UTF-8，当我在该文件上运行 Unix“文件”命令时，它显示 UTF-8 文本。
我还在两台机器上以命令行和图形模式验证了 UTF-8。

有趣的注释：我注意到一个有趣的巧合：如果有 15 个日语字符，那么我会得到 15 个问号，因此在某些时候它会被视为完整字符而不仅仅是字节。

JMeter CSV数据集配置：

文件名：japanese-searches.csv
文件编码：UTF-8（也尝试过不使用）
变量名称：TERM
分隔符：，
允许引用的数据：False（我也尝试过True，不同，但仍然错误）
在EOF处回收：True
Stop at EOF：False
Staring 模式：所有线程

我尝试过的一些事情： - 尝试允许引用数据。它变成了其他奇怪的字符。 - 添加了-Dfile.encoding=UTF-8 - 尝试对 POST 阶段进行编码，但它只是变成了一堆问号的 %nn

而且我不确定在读入 CSV 的每一行后如何“调试”。我认为它立即损坏，但我不确定。

如果它只是在我引用它时被破坏，那么可能不是 ${TERM} 而是其他一些“to bytes”函数调用。我将开始检查这一点。我还没有对 JMeter 函数做任何事情。

12 月 24 日编辑：

调整：

更改格式并添加项目符号点以便更清晰。
澄清该文件是UTF-8，并已验证。

一种新理论：

日语字符是否有可能通过，问题是每个显示它们的地方都将它们映射到“？”仅在展示时间。那么，即使我检查了很多地方，它们都只是在用户界面中存在显示问题吗？
JMeter 有没有办法查看字符或字符串的数值？实际上，告诉 JMeter 显示 Unicode 代码点列表？
我会查看我的最后一个日志文件...尽管我认为即使服务器日志也可能错误映射字符。
另外，也许在我发布的文本字段内进行变量扩展时，我引用了 ${TERM}，也许在那个点它也映射到问号，但是损坏发生在那个位置稍后一点。如果发生这种情况，并且 UI 中显示错误，则可能会导致错误的结论。
我真正想做的是在第一个 CSV 记录之后、加载该行之后暂停 JMeter，然后使用“数据范围”或字节编辑器或其他工具查看它。不确定这是否可能。

原文

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.

What I'm seeing:

Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????

This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.

About the Data:

The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.

Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.

JMeter CSV Dataset Config:

Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads

A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks

And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.

If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.

Edited Dec 24:

Tweaks:

Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.

A new theory:

Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠意 2024-10-16 04:27:11

发现问题了，还有一个地方需要指定UTF-8。

在 HTTP 请求中，在方法的右侧，您还必须将内容编码设置为 UTF-8

是的，事后看来，这似乎是显而易见的，但出于多种原因，我认为不需要这样做。我的一些不正确的假设可能对正在调试的其他人有帮助，所以这里 - 我本以为：

1：一旦文本以 Unicode 形式进入 Java，它就会保持 Unicode 形式，并按 UTF-8 输入和输出。显然不是在这种情况下。

2：我有点认为HTTP默认为UTF-8，除非你另有说明，但也许我只是习惯了XML，但假设这一点可能不是一个好习惯，也许HTTP默认为ISO-Latin1或其他东西，甚至如果有一个规范，也许人们不会遵循它。

3：如果我不具体说明，我认为“不造成伤害”的方法是传递字符，并让另一端的接收者处理它。又错了！

（好的，所以第 1、2 和 3 点有一点重叠）

4：即使我的 HTTP 请求 POST，我仍然尝试了编码复选框。我当然认为这会对它进行编码，但我得到的只是问号的重复％十六进制，所以在我看来数据在那时已经损坏了。又错了。我怀疑在 HTTP 阶段内，有两个字符转换，首先从 Unicode 到它认为您拥有的任何编码，然后第二次编码到 %signs，而我的数据在第一步被错误编码。

5：我本以为 JMeter 会说些什么或警告，但从我的阅读来看，显然它在这方面没有帮助。您可以进行日志记录或其他操作。

还有“？”是 Java 默认报告问题的方式，这是从 Java 1.4x 时间范围开始的。在我的 Java 代码中，我更喜欢将编码错误设置为异常报告，但同样，不是默认值，也不是 JMeter 所做的。

所以我吸取了教训。

Unicode 至少开始时表现良好的暗示是问号的数量等于日语字符的数量，而不是问号数量的 2 或 3 倍。如果“？？？”的长度匹配您的日语（或中文）字符串，那么 Java 在整个过程中的某个时刻确实看到了实际的 Unicode 字符。然而，如果您看到 ? 的数量是输入文本的 3 倍，那么 Java 总是将它们视为字节或整数或其他任何内容，而永远不会将其视为有效的代码点。

Found the issue, there was another place the UTF-8 had to be specified.

In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8

Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:

1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.

2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.

3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!

(OK, so points 1, 2 and 3 overlap a bit)

4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.

5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.

And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.

So I learned my lesson.

The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.

回复收藏 0 原文

再见回来 2024-10-16 04:27:11

在搜索使用 csv 文件中的参数的解决方案时遇到了这个主题，该文件包含一些用希伯来语编写的列。

我使用Excel 2007创建了1000行用户注册数据。名字和姓氏必须是希伯来语。
我将文件导出为“Unicode 文本”文件。它变成了制表符分隔。
“Unicode 文本”以 UTF-16 LE（Little Endian）保存，而不是以 UTF-8 保存。这很重要。
我在Notepad++中打开了结果。我可以正确地看到希伯来字母。 Notepad++ 有“编码”菜单项，您可以在其中检查或更改编码。所以我把Little Endian改成了UTF-8。
然后我用逗号替换了制表符（只需选择制表符并将其粘贴到“查找”框中。
参数已替换好，但运行脚本后我看到以下内容：
在“查看结果树”侦听器中，我打开了“Http 请求”的“结果”选项卡。
参数已被替换，但请求的 HTTP 视图选项卡（位于底部）向我显示了一些乱码。
但是当我查看原始视图时，我发现请求参数实际上包含像 %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 这样的字符串，成对使用时 (% D7 %A9) 正确对应了希伯来字母。

在我看来，JMeter 有一个错误，无法正确显示 unicode 字符。但它可以正常发送（POST）它们。

希望我是对的，并希望它能帮助别人。