查找 sharepoint 2010 中的所有重复文档

发布于 2024-11-08 21:20:49 字数 1800 浏览 0 评论 0原文

当我们对共享点实例执行一些搜索时,我们会在搜索结果中看到一些文件的“查看重复项”链接。

有没有办法报告所有这些重复项?

我看到这里有一个 SQL 根据 md5 哈希值查找重复项: http://social.technet.microsoft.com/forums/en-US/sharepointsearch/thread/8a8b25d9-a3ac-45df-86de-2a3a7838a534 我已更正此处的 SharePoint 2010 兼容性 SQL:

-- Step1 : get all files with short names, md5 signatures, and size
SELECT  md5 ,
        RIGHT(DisplayURL, CHARINDEX('/', REVERSE(DisplayURL)) - 1) AS ShortFileName ,
        DisplayURL AS Url ,
        llVal / 1024 AS FileSizeKb
INTO    #listingFilesMd5Size
FROM    SearchServiceApplication_CrawlStore.dbo.MSSCrawlURL y
        INNER JOIN SearchServiceApplication_PropertyStore.dbo.MSSDocProps dp ON ( y.DocID = dp.DocID )
WHERE   dp.pid = 58 -- File size
        AND llVal > 1024 * 10 -- 10 Kb minimum in size
        AND md5 <> 0
        AND CHARINDEX('/', REVERSE(DisplayURL)) > 1

-- Step 2: Filter duplicated items

SELECT  COUNT(*) AS NbDuplicates ,
        md5 ,
        ShortFileName ,
        FileSizeKb
INTO    #duplicates
FROM    #listingFilesMd5Size
GROUP BY md5 ,
        ShortFileName ,
        FileSizeKb
HAVING  COUNT(*) > 1
ORDER BY COUNT(*) DESC

DROP TABLE #listingFilesMd5Size

-- Step3 : show the report with search URLs

SELECT  *,
        NbDuplicates * FileSizeKb AS TotalSpaceKb ,
        'http://srv-moss/SearchCenter/Pages/results.aspx?k=' + ShortFileName AS SearchUrl
FROM    #duplicates
--ORDER BY NbDuplicates * FileSizeKb DESC

DROP TABLE #duplicates

但这仅匹配精确的重复项,而我对 SharePoint 认为基于搜索结果中的“查看重复项”链接的重复项感兴趣。

我已经看到有托管属性“DuplicateHash”,但这没有在任何地方记录,并且我找不到通过对象模型访问它的方法。

谢谢

When we perform some searches on our sharepoint instance, we see the "View Duplicates" link in the search results for a few files.

Is there a way to report on all of these duplicates?

I've seen that there's this SQL here to find duplicates based on their md5 hash: http://social.technet.microsoft.com/forums/en-US/sharepointsearch/thread/8a8b25d9-a3ac-45df-86de-2a3a7838a534 and I have corrected the SQL for SharePoint 2010 compatibility here:

-- Step1 : get all files with short names, md5 signatures, and size
SELECT  md5 ,
        RIGHT(DisplayURL, CHARINDEX('/', REVERSE(DisplayURL)) - 1) AS ShortFileName ,
        DisplayURL AS Url ,
        llVal / 1024 AS FileSizeKb
INTO    #listingFilesMd5Size
FROM    SearchServiceApplication_CrawlStore.dbo.MSSCrawlURL y
        INNER JOIN SearchServiceApplication_PropertyStore.dbo.MSSDocProps dp ON ( y.DocID = dp.DocID )
WHERE   dp.pid = 58 -- File size
        AND llVal > 1024 * 10 -- 10 Kb minimum in size
        AND md5 <> 0
        AND CHARINDEX('/', REVERSE(DisplayURL)) > 1

-- Step 2: Filter duplicated items

SELECT  COUNT(*) AS NbDuplicates ,
        md5 ,
        ShortFileName ,
        FileSizeKb
INTO    #duplicates
FROM    #listingFilesMd5Size
GROUP BY md5 ,
        ShortFileName ,
        FileSizeKb
HAVING  COUNT(*) > 1
ORDER BY COUNT(*) DESC

DROP TABLE #listingFilesMd5Size

-- Step3 : show the report with search URLs

SELECT  *,
        NbDuplicates * FileSizeKb AS TotalSpaceKb ,
        'http://srv-moss/SearchCenter/Pages/results.aspx?k=' + ShortFileName AS SearchUrl
FROM    #duplicates
--ORDER BY NbDuplicates * FileSizeKb DESC

DROP TABLE #duplicates

But this only matches exact duplicates, whereas I'm interested in the ones SharePoint thinks are duplicates based on the "View Duplicates" link in the search results.

I've seen that there's the managed property "DuplicateHash" but this is not documented anywhere and I cannot find a way of accessing it through the object model.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

千紇 2024-11-15 21:20:50

你不应该直接查询数据库,你可能会让自己处于不受支持的状态。

关于重复项:“搜索重复项”与哈希无关。这是通过比较文档向量(主要是术语和术语数量)的搜索引擎索引来处理的。

您可以尝试查找一个 FQL(如果使用“快速”,否则使用“搜索 QL”)查询来提供结果,但我不确定这是否可行。

You should not query the database directly, you may put yourself in a unsupported state.

About the duplicate: "search duplicate" have nothing to do with hash. This is handled by the search engine index comparing document vector (mostly terms and number of terms).

You may try to find a FQL (if using Fast otherwise a Search QL) query which give you the result, but I'm not sure this is possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文