我有一个数据库和一个全文索引表。我们将此表称为“测试”。该表有一个名为 testfield
的字段。现在,我们只插入一条记录,如下所示:
insert into test values ('word' + Char(13) + 'test')
此查询插入一个带有 LINE_BREAK
的单词。
现在,让我们使用全文查询该表:
select * from test where contains(testfield, '"word test"')
在这种情况下,该查询根本不返回任何内容。
另外:
select * from test where contains(testfield, '"wordtest"')
不返回任何内容(这是预期的方式)
现在,让我们再次查询表并修改搜索词:
select * from test where contains(testfield, '"word' + Char(13) + 'test"')
在这种情况下,查询将返回正确的行。
问题是:为什么会发生这种情况?我的意思是,SQL 在搜索单词时应该忽略换行符,但它没有。我不认为这是全文引擎的默认行为。至少是不能接受的。如果我的用户在我的表上插入带有换行符的行(这很常见,因为他们可以写任何他们想要的内容,因为该字段是备注字段),会发生什么?
有什么方法可以纠正这个问题吗?
编辑
只有当我选择巴西语作为 FT 的语言时才会发生这种情况。如果我选择英语,我提到的问题都不会发生。
编辑
在 Sql Server Denali CTP3 上,英语和葡萄牙语都不起作用。也许这是英语轮船中的错误。
编辑
fts_parser 使用的屏幕截图
I have a database and a fulltext indexed table. Lets call this table test
. This table has one field called testfield
. Now, lets insert only one record as follow:
insert into test values ('word' + Char(13) + 'test')
This query inserts a word with a LINE_BREAK
.
Now, lets query this table using fulltext:
select * from test where contains(testfield, '"word test"')
In that case, this query returns nothing at all.
Also:
select * from test where contains(testfield, '"wordtest"')
Returns nothing (that was expected do be this way)
Now, lets query the table again modifying the search word:
select * from test where contains(testfield, '"word' + Char(13) + 'test"')
In that case the query returns the correct line.
The question is: Why does this happen? I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable. What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?
Is there any way of correcting this?
EDIT
It only happens when I choose brazilian as the language for FT. If I choose english, none of the problems I mentioned happens.
EDIT
On Sql Server Denali CTP3 neither english nor portuguese works. Maybe this is bug in the english steamer.
EDIT
Screenshot for usage of fts_parser
发布评论
评论(2)
尝试:
查看
CONTAINS
文档 ,您似乎需要额外的邻近项来查找Char(13)
之后的匹配项。编辑:如果使用邻近词还不够,请尝试将字符添加到“干扰词”列表中,以便在索引中忽略它。您可以在
$SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
中找到干扰词文件。英文文件名为noiseENG.txt
。您可以在 Robert Sheldon 关于全文索引的文章。编辑:我还在对 CHAR(13) 和 NCHAR(13) 之间的差异进行一些研究,因为您在原始版本中可能会遇到 unicode 与非 unicode 问题测试。编辑: 此外,2008 年葡萄牙语获得了新的 分词器。断字系统是特定于语言的,这可能是您看到这种差异的原因。为索引列选择的语言也很重要。
编辑: 要检查单词是如何被破坏的,请尝试运行此命令(如果您有权访问 master):
编辑: 这会破坏单词中的短语“hello[13]world”英语和葡萄牙语。下面是结果的屏幕截图,这与我的想法一致。
CHAR(13)
的处理方式略有不同。Try:
Looking at the
CONTAINS
documentation, it seems that you'd need the additional proximity term to find the match past theChar(13)
.Edit: If using a proximity term isn't enough, try adding the character to the "noise words" list so that it gets ignored in indexing. You can find the noise word files in
$SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
. The English file is namednoiseENG.txt
. You can find more about this detail in Robert Sheldon's article about full-text indexing.Edit: I am also doing some research on the difference between CHAR(13) and NCHAR(13) as there might be a unicode vs. non-unicode problem you're seeing in your original testing.Edit: Additionally, in 2008 Portugese got new word breakers. Word breakers are language specific and likely why you may see this difference. It also matters what language was chosen for the indexed column.
Edit: To check how words are being broken, try running this (if you have access to master):
Edit: This breaks the phrase "hello[13]world" in English and in Portugese. Below is the screenshot of the results, which are as I would think they would be. The
CHAR(13)
is being treated slightly differently.我试图在我的 SQL Server 2008 上重现您所谓的“错误”(请参阅下面的代码)。结果,它完全按照您的预期工作。我很惊讶你没有首先提供完整的代码。
更新:将完整索引的语言更改为葡萄牙语并收到相同的结果。它显示出“类似错误的行为”。
更新:
MS 在 SQL Server 2008 中引入了新的分词系统,葡萄牙语就是其中之一。
更新:
看起来这种行为可以在所有新断路器中看到。由 MS 决定这是否可以接受。不过,您可以通过支持服务提交票证。
所见即所得
也许您需要上传自己的分词器或降级到 SQL Server 2005。
无论如何,祝你好运!
I have tried to reproduce your so called "bug" on my SQL Server 2008 (please see code below). As a result it works exactly as you expected. I am surprised you did not provide with the complete code in the first place.
UPDATE: Changed language of full index to Portuguese and received the same result. It shows "bug-like behaviour".
UPDATE:
MS introduced new word breakers in SQL Server 2008 and Portuguese is one of them.
UPDATE:
It looks likethis behaviour can be seen in all new breakers.It is up to MS to decide if this is acceptable. You can submit a ticket with support services though.
WYSIWYG
Maybe you will need to upload your own word breaker or downgrade to SQL Server 2005.
Goodluck anyway!