当前位置：文江博客话题详情

sql-server-2008 line-breaks full-text-search

Sql Server 2008 - 换行符和全文查询

发布于 2024-11-29 19:04:30 字数 1087 浏览 5 评论 0 原文

我有一个数据库和一个全文索引表。我们将此表称为“测试”。该表有一个名为 testfield 的字段。现在，我们只插入一条记录，如下所示：

insert into test values ('word' + Char(13) + 'test')

此查询插入一个带有 LINE_BREAK 的单词。

现在，让我们使用全文查询该表：

select * from test where contains(testfield, '"word test"')

在这种情况下，该查询根本不返回任何内容。

另外：

select * from test where contains(testfield, '"wordtest"')

不返回任何内容（这是预期的方式）

现在，让我们再次查询表并修改搜索词：

select * from test where contains(testfield, '"word' + Char(13) + 'test"')

在这种情况下，查询将返回正确的行。

问题是：为什么会发生这种情况？我的意思是，SQL 在搜索单词时应该忽略换行符，但它没有。我不认为这是全文引擎的默认行为。至少是不能接受的。如果我的用户在我的表上插入带有换行符的行（这很常见，因为他们可以写任何他们想要的内容，因为该字段是备注字段），会发生什么？

有什么方法可以纠正这个问题吗？

编辑

只有当我选择巴西语作为 FT 的语言时才会发生这种情况。如果我选择英语，我提到的问题都不会发生。

编辑

在 Sql Server Denali CTP3 上，英语和葡萄牙语都不起作用。也许这是英语轮船中的错误。

编辑

fts_parser 使用的屏幕截图

原文

I have a database and a fulltext indexed table. Lets call this table test. This table has one field called testfield. Now, lets insert only one record as follow:

insert into test values ('word' + Char(13) + 'test')

This query inserts a word with a LINE_BREAK.

Now, lets query this table using fulltext:

select * from test where contains(testfield, '"word test"')

In that case, this query returns nothing at all.

Also:

select * from test where contains(testfield, '"wordtest"')

Returns nothing (that was expected do be this way)

Now, lets query the table again modifying the search word:

select * from test where contains(testfield, '"word' + Char(13) + 'test"')

In that case the query returns the correct line.

The question is: Why does this happen? I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable. What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?

Is there any way of correcting this?

EDIT

It only happens when I choose brazilian as the language for FT. If I choose english, none of the problems I mentioned happens.

EDIT

On Sql Server Denali CTP3 neither english nor portuguese works. Maybe this is bug in the english steamer.

EDIT

Screenshot for usage of fts_parser

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

碍人泪离人颜 2024-12-06 19:04:30

尝试：

select * from test where contains(testfield, 'word NEAR test')

查看CONTAINS文档，您似乎需要额外的邻近项来查找 Char(13) 之后的匹配项。

编辑：如果使用邻近词还不够，请尝试将字符添加到“干扰词”列表中，以便在索引中忽略它。您可以在 $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData 中找到干扰词文件。英文文件名为 noiseENG.txt。您可以在 Robert Sheldon 关于全文索引的文章。

编辑：我还在对 CHAR(13) 和 NCHAR(13) 之间的差异进行一些研究，因为您在原始版本中可能会遇到 unicode 与非 unicode 问题测试。

编辑： 此外，2008 年葡萄牙语获得了新的分词器。断字系统是特定于语言的，这可能是您看到这种差异的原因。为索引列选择的语言也很重要。

编辑： 要检查单词是如何被破坏的，请尝试运行此命令（如果您有权访问 master）：

declare @english nvarchar(20), @portugese nvarchar(20)
set @english = N'"hello' + NCHAR(13) + N'world"'
set @portugese = N'"Olá' + NCHAR(13) + N'mundo"'

select * from sys.dm_fts_parser (@english, 1033, 0, 0)
select * from sys.dm_fts_parser (@portugese, 2070, 0, 0)

编辑： 这会破坏单词中的短语“hello[13]world”英语和葡萄牙语。下面是结果的屏幕截图，这与我的想法一致。 CHAR(13) 的处理方式略有不同。

屏幕截图

Try:

select * from test where contains(testfield, 'word NEAR test')

Looking at the CONTAINS documentation, it seems that you'd need the additional proximity term to find the match past the Char(13).

Edit: If using a proximity term isn't enough, try adding the character to the "noise words" list so that it gets ignored in indexing. You can find the noise word files in $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. The English file is named noiseENG.txt. You can find more about this detail in Robert Sheldon's article about full-text indexing.

Edit: I am also doing some research on the difference between CHAR(13) and NCHAR(13) as there might be a unicode vs. non-unicode problem you're seeing in your original testing.

Edit: Additionally, in 2008 Portugese got new word breakers. Word breakers are language specific and likely why you may see this difference. It also matters what language was chosen for the indexed column.

Edit: To check how words are being broken, try running this (if you have access to master):

declare @english nvarchar(20), @portugese nvarchar(20)
set @english = N'"hello' + NCHAR(13) + N'world"'
set @portugese = N'"Olá' + NCHAR(13) + N'mundo"'

select * from sys.dm_fts_parser (@english, 1033, 0, 0)
select * from sys.dm_fts_parser (@portugese, 2070, 0, 0)

Edit: This breaks the phrase "hello[13]world" in English and in Portugese. Below is the screenshot of the results, which are as I would think they would be. The CHAR(13) is being treated slightly differently.

Screenshot

回复收藏 0 原文

贵在坚持 2024-12-06 19:04:30

我试图在我的 SQL Server 2008 上重现您所谓的“错误”（请参阅下面的代码）。结果，它完全按照您的预期工作。我很惊讶你没有首先提供完整的代码。

更新：将完整索引的语言更改为葡萄牙语并收到相同的结果。它显示出“类似错误的行为”。
更新：

为什么会发生这种情况？

MS 在 SQL Server 2008 中引入了新的分词系统，葡萄牙语就是其中之一。

我的意思是，SQL 在搜索单词时应该忽略换行符，并且它
没有。我不认为这是全文引擎的默认行为。至少是不能接受的。

更新：~~看起来~~这种行为可以在所有新断路器中看到。

 -- Create auxiliary table to test languages
IF OBJECTPROPERTY(object_id('test_languages'), 'IsUserTable') IS NOT NULL
DROP TABLE test_languages;
GO

CREATE TABLE test_languages
(
componenttype varchar(30),
componentname int,
clsid uniqueidentifier,
fullpath varchar(2000),
version_no varchar(50),
manufacturer varchar(50)
);

-- Populate Auxiliary table 
INSERT INTO test_languages
EXEC ('exec sp_help_fulltext_system_components ''wordbreaker''');


-- Create Cursor and check how languages work with sys.dm_fts_parser 
  DECLARE MY_CURSOR CURSOR FOR
  select 
  componentname
  from test_languages 
  INNER JOIN 
  sys.fulltext_languages
  on sys.fulltext_languages.lcid=test_languages.componentname
  ORDER BY name

  DECLARE @RESULT varchar(max)
  DECLARE @test_var NVARCHAR(20)
  SET @test_var='"word' + CHAR(13) + 'test"'   
  DECLARE @componentname int

  OPEN MY_CURSOR

  FETCH NEXT FROM MY_CURSOR
    INTO @componentname

WHILE (@@FETCH_STATUS = 0)
 BEGIN

    SELECT name+ ' - '+ 
    CASE test_languages.version_no 
  WHEN '6.0.6001.18000' 
  THEN 'Updated Language 6.0.6001.18000'
  WHEN '12.0.9735.0'
  THEN 'Old Language 12.0.9735.0'
  WHEN '12.0.6828.0'
  THEN 'Old Language 12.0.6828.0'
  END
  AS [Language_Type]

 FROM test_languages 
  INNER JOIN 
  sys.fulltext_languages
  ON sys.fulltext_languages.lcid=test_languages.componentname
  WHERE lcid=@componentname
    SELECT display_term
    FROM sys.dm_fts_parser (@test_var, @componentname, 0, 0)



    FETCH NEXT FROM MY_CURSOR
    INTO @componentname

 END;

-- clean up the mess 
CLOSE MY_CURSOR
DEALLOCATE MY_CURSOR; 
DROP TABLE test_languages;

由 MS 决定这是否可以接受。不过，您可以通过支持服务提交票证。

如果我的用户在我的表格上插入带有换行符的行，会发生什么情况
（这很常见，因为他们可以写任何他们想要的东西，因为
该字段是备注字段）？

所见即所得

有什么办法可以纠正这个问题吗？

也许您需要上传自己的分词器或降级到 SQL Server 2005。
无论如何，祝你好运！

        USE
        master
        GO

        CREATE
        DATABASE Test_Brazil -- Portuguese (Brazil)
        COLLATE
        Latin1_General_100_CI_AI

        USE
        Test_Brazil
        GO

        CREATE
        TABLE [dbo].[test](
        [test] [varchar] (100) NOT NULL
        )
        ON [PRIMARY]

        CREATE UNIQUE NONCLUSTERED INDEX [test] ON [dbo].[test] 
        (
        [test]
        ASC
        )
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
        GO

        CREATE
        FULLTEXT CATALOG ft AS DEFAULT;

        CREATE
        FULLTEXT INDEX ON test(test)
        KEY INDEX test
        WITH STOPLIST = SYSTEM;

        sp_fulltext_database 'enable'


        insert into test values ('word' + Char(9) + 'test') --Tab    
        insert into test values ('word' + Char(10) + 'test') -- Line feed     
        insert into test values ('word' + Char(13) + 'test') -- Carriage return  
        insert into test values ('word test')
        -- wait 3 sec

        select test from test where contains(test, '"word test"')

        -- 4 rows returned

        DROP FULLTEXT INDEX ON dbo.test
        CREATE
        FULLTEXT INDEX ON dbo.test
        (
         test Language 2070 -- Portuguese
        )
        KEY INDEX test;


        -- 2 rows returned
        select * from test where contains(test, '"word test"')

I have tried to reproduce your so called "bug" on my SQL Server 2008 (please see code below). As a result it works exactly as you expected. I am surprised you did not provide with the complete code in the first place.

UPDATE: Changed language of full index to Portuguese and received the same result. It shows "bug-like behaviour".
UPDATE:

Why does this happen?

MS introduced new word breakers in SQL Server 2008 and Portuguese is one of them.

I mean, SQL should ignore line breaks when searching for a word and it
does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable.

UPDATE: ~~It looks like~~ this behaviour can be seen in all new breakers.

 -- Create auxiliary table to test languages
IF OBJECTPROPERTY(object_id('test_languages'), 'IsUserTable') IS NOT NULL
DROP TABLE test_languages;
GO

CREATE TABLE test_languages
(
componenttype varchar(30),
componentname int,
clsid uniqueidentifier,
fullpath varchar(2000),
version_no varchar(50),
manufacturer varchar(50)
);

-- Populate Auxiliary table 
INSERT INTO test_languages
EXEC ('exec sp_help_fulltext_system_components ''wordbreaker''');


-- Create Cursor and check how languages work with sys.dm_fts_parser 
  DECLARE MY_CURSOR CURSOR FOR
  select 
  componentname
  from test_languages 
  INNER JOIN 
  sys.fulltext_languages
  on sys.fulltext_languages.lcid=test_languages.componentname
  ORDER BY name

  DECLARE @RESULT varchar(max)
  DECLARE @test_var NVARCHAR(20)
  SET @test_var='"word' + CHAR(13) + 'test"'   
  DECLARE @componentname int

  OPEN MY_CURSOR

  FETCH NEXT FROM MY_CURSOR
    INTO @componentname

WHILE (@@FETCH_STATUS = 0)
 BEGIN

    SELECT name+ ' - '+ 
    CASE test_languages.version_no 
  WHEN '6.0.6001.18000' 
  THEN 'Updated Language 6.0.6001.18000'
  WHEN '12.0.9735.0'
  THEN 'Old Language 12.0.9735.0'
  WHEN '12.0.6828.0'
  THEN 'Old Language 12.0.6828.0'
  END
  AS [Language_Type]

 FROM test_languages 
  INNER JOIN 
  sys.fulltext_languages
  ON sys.fulltext_languages.lcid=test_languages.componentname
  WHERE lcid=@componentname
    SELECT display_term
    FROM sys.dm_fts_parser (@test_var, @componentname, 0, 0)



    FETCH NEXT FROM MY_CURSOR
    INTO @componentname

 END;

-- clean up the mess 
CLOSE MY_CURSOR
DEALLOCATE MY_CURSOR; 
DROP TABLE test_languages;

It is up to MS to decide if this is acceptable. You can submit a ticket with support services though.

What happens if my users insert lines with line breaks on my table
(and this is usual, since they can write anything they want because
the field is an memo field)?

WYSIWYG

Is there any way of correcting this?

Maybe you will need to upload your own word breaker or downgrade to SQL Server 2005.
Goodluck anyway!

        USE
        master
        GO

        CREATE
        DATABASE Test_Brazil -- Portuguese (Brazil)
        COLLATE
        Latin1_General_100_CI_AI

        USE
        Test_Brazil
        GO

        CREATE
        TABLE [dbo].[test](
        [test] [varchar] (100) NOT NULL
        )
        ON [PRIMARY]

        CREATE UNIQUE NONCLUSTERED INDEX [test] ON [dbo].[test] 
        (
        [test]
        ASC
        )
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
        GO

        CREATE
        FULLTEXT CATALOG ft AS DEFAULT;

        CREATE
        FULLTEXT INDEX ON test(test)
        KEY INDEX test
        WITH STOPLIST = SYSTEM;

        sp_fulltext_database 'enable'


        insert into test values ('word' + Char(9) + 'test') --Tab    
        insert into test values ('word' + Char(10) + 'test') -- Line feed     
        insert into test values ('word' + Char(13) + 'test') -- Carriage return  
        insert into test values ('word test')
        -- wait 3 sec

        select test from test where contains(test, '"word test"')

        -- 4 rows returned

        DROP FULLTEXT INDEX ON dbo.test
        CREATE
        FULLTEXT INDEX ON dbo.test
        (
         test Language 2070 -- Portuguese
        )
        KEY INDEX test;


        -- 2 rows returned
        select * from test where contains(test, '"word test"')

回复收藏 0 原文

~没有更多了~