检索 TEXT 值后出现奇怪的字符串比较结果
我正在从多个页面抓取数据并将其插入到我的 MySQL 数据库中。可能有重复;我只想存储唯一的条目。为了以防万一我的主键不够,我进行了一个测试,当我收到 MySQL 1062 错误*(主键上有重复条目**)时,会检查该测试。该测试检查要插入的元组的所有部分是否与存储的元组相同。我发现,当我收到 1062 错误时,存储的元组和抓取的元组仅由一个元素/字段(一个 TEXT
字段)不同。
首先,我检索已存储的条目并将它们都传递到 htmlspecialchars()
中以直观地比较输出;他们看起来一模一样。
根据 strlen()
,从数据库检索到的字符串长度为 304 个字符,但新抓取的字符串长度为 305。similar_text()
通过返回 304** 来支持这一点*。
然后我循环遍历一个字符串,将一个字符与另一个字符串进行比较,当出现不匹配时停止。问题出在第一个角色上。在来自数据库的字符串中,它是 N
但两个字符串似乎都以 I
开头(即使在 htmlspecialchars()
的输出中也是如此)。另外,数据库字符串应该短一个字符,而不是更长。
然后,我再次检查了输出(打印 htmlspecialchars()
)和 strlen()
,但这一次在原始字符串(最终出现在数据库中的字符串)之前是插入,并在插入重复项之前。它们看起来和以前一样,并且 strlen()
都返回 305。
所以这让我觉得它们一定是我的 PHP 和 MySQL 之间发生的事情。因此,我不是将新抓取的字符串与数据库中具有相同主键(ID)的字符串进行比较,而是尝试检索一个元组,其中每个字段都等于新抓取的部分中各自的部分,例如 SELECT * FROM table WHERE value1='{$MYSQL_ESCAPED['value1']}' .... AND valueN='{$MYSQL_ESCAPED['valueN']}'; 并且返回元组。因此,它们在各个方面都是相同的,包括有问题的 TEXT
字段。
这是怎么回事?
当我在字符串前面看到 N
时,我立即想到 MSSQL 中的 NVARCHAR
等,但据我所知,这不是 MySQL 的一部分,但是......
可以吗?与 " 每个 TEXT 值这一事实有关使用两个字节长度存储指示值中字节数的前缀。"?
或者这只是表明字符编码问题?
编辑:
- 数据库中没有存储多字节字符。
mb_strlen()
返回与上面提到的strlen()
相同的结果。- 在插入数据库之前使用
utf8_encode()
或mb_convert_encoding()
没有什么区别;不可见的N
仍然是从数据库检索到的字符串的前缀。
注释:
- 在将任何字符串插入数据库之前,我将其传递给
mysql_real_escape_string(trim(preg_replace('/\s\s+/', ' ', $str)))
它将双空格替换为单空格,删除前导和后缀。尾部空格并转义以进行 MySQL 插入。 - 我打印输出的页面&测试为UTF-8。
- 创建后,我的数据库将其字符集设置为
utf8
,将其排序规则设置为utf8_general_ci
,并且我使用SET NAMES 'utf8' COLLATE 'utf8_general_ci';
code> 命令也是如此,作为预防措施。 - 脚注:
*
我也强制退出抓取。**
主键只是我从页面中抓取的 ID (VARCHAR(10)
)。***
常用字符数
I'm scraping data from multiple pages and inserting to my MySQL database. There could be duplicates; I only want to store unique entries. Just in case my primary key isn't sufficient, I put in a test which is checked when I get a MySQL 1062 error* (duplicate entry on primary key**). The test checks that all of the pieces of the tuple to be inserted are identical to the stored tuple. What I found is that the when I get the 1062 error that the stored tuple and the scraped tuple are only different by one element/field, a TEXT
field.
First, I retrieved the already stored entry and passed them both into htmlspecialchars()
to compare the output visually; they looked identical.
According to strlen()
, the string retrieved from the DB was 304 characters in length but the newly scraped string was 305. similar_text()
backed that up by returning 304***.
So then I looped through one string comparing character for character with the other string, stopping when there was a mismatch. The problem was the first character. In the string coming from the DB it was N
yet both strings appear to start with I
(even in their output from htmlspecialchars()
). Plus the DB string was supposedly one character shorter, not longer.
I then checked the output (printing htmlspecialchars()
) and the strlen()
again, but this time before the original string (the one that ends up in the DB) is inserted, and before the duplicated is inserted. They looked the same as before and strlen()
returned 305 for both.
So this made me think their must be something happening between my PHP and my MySQL. So instead of comparing the newly scraped string to the string in the database with the same primary key (the ID), I try to retrieve a tuple where every single field is equal to their respective parts in newly scraped section like SELECT * FROM table WHERE value1='{$MYSQL_ESCAPED['value1']}' .... AND valueN='{$MYSQL_ESCAPED['valueN']}';
and the tuple is returned. Therefore they are identical in every way including that problematic TEXT
field.
What's going on here?
Straight away when I see N
in front of string I think of NVARCHAR
, etc. from MSSQL but as I know that's not a part of MySQL, but...
Could it have anything to do with the fact that "Each TEXT value is stored using a two-byte length prefix that indicates the number of bytes in the value."?
Or does this just point to a character encoding problem?
Edit:
- There are no multi-byte characters stored in the database.
mb_strlen()
returns the same results asstrlen()
where mentioned above.- Using
utf8_encode()
ormb_convert_encoding()
before inserting to the DB makes no difference; an invisibleN
is still prefixing the string retrieved from the DB.
Notes:
- Before inserting any string into my database I pass it through
mysql_real_escape_string(trim(preg_replace('/\s\s+/', ' ', $str)))
which replaces double spaces with single spaces, removes leading & tailing spaces and escapes it for MySQL insertion. - The page I print the output & testing to is UTF-8.
- Upon creation, my DB has its character set set to
utf8
, its collation toutf8_general_ci
and I use theSET NAMES 'utf8' COLLATE 'utf8_general_ci';
command too, as a precaution. - Foot notes:
*
I force an exit from the scraping then also.**
The primary key is just a ID (VARCHAR(10)
) which I scrape from the pages.***
Number of common characters
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当 MySQL 认为合适时,TEXT 字段会进行字符集转换。但是,MySQL不会无缘无故地随机添加/删除数据。虽然文本字段确实将数据长度存储为包含文本字段数据的磁盘数据 blob 头部的 2 个额外字节,但这 2 个字节永远不会暴露给最终用户。假设整个客户端 -> 数据库 -> 磁盘上 -> 数据库 -> 客户端管道的字符集设置都是相同的,任何地方的字符串长度都不应该发生变化。
TEXT fields are subject to character set conversion as/when MySQL sees fit. However, MySQL will not randomly add/remove data without a reason. While text fields DO store the length of the data as 2 extra bytes at the head of the on-disk data blob containing the text field data, those 2 bytes are NEVER exposed to the end user. Assuming character set settings are the same throughout the client->database->on-disk->database->client pipeline, there should never be a change in string length anywhere.