将 Word/PDF 文档从文件系统索引到 SQL Server

发布于 2024-11-28 23:03:00 字数 897 浏览 0 评论 0原文

我正在尝试为我遇到的问题提出一个简单的解决方案,因为到目前为止我发现的所有这些似乎都太复杂了!

情况是我们使用专有应用程序来管理我们业务的大部分方面。它有一个SQL Server 2005后端数据库,该数据库相当大。该应用程序还允许将 Word 和 PDF 文档附加到我们广泛使用的记录中,这些记录存储在服务器上的文件系统中,文件名在数据库中引用。不幸的是,应用程序中的搜索功能很差,所以我正在尝试构建自己的版本。

到目前为止,我已经有了一个带有搜索框的简洁 ASP.NET 页面,该页面允许用户输入要搜索的单词,并在其他字段(例如部门、日期等)上过滤结果。 存储过程 I'我们在数据库中写入的内容会在数据库中的几个不同字段中查找他们正在搜索的单词。我真正的目标是谷歌风格的“一次搜索统治所有”效果,用户不必指定他们期望在哪里找到他们正在寻找的单词,他们只会在任何地方得到点击它出现在数据库中。这正在发挥作用。

我现在要添加的是搜索功能,包括“附加”到记录的文档文本。它们都是 .doc 或 .pdf 文件,但如果我无法搜索 .pdf 文件,那也不会是世界末日。

在我的理想世界中,我要做的就是找到一些软件来索引包含文档的文件夹(目前大约有 100,000 个文档,平均约为 100k),并使用该索引填充现有数据库中的表,以便我可以只需将该表包含在我的搜索中即可。我希望它只包含它索引的每个唯一单词的记录以及引用包含该单词的文件系统中的文档的连接表。

鉴于这似乎很奇怪,并且没有任何软件可以做到这一点,或任何接近它的东西,据我所知,您会推荐什么解决方案?服务器上已经运行了 dtSearch,对我感兴趣的文件进行索引。但是,虽然我可以费力地浏览文档,试图找出如何通过我自己的网页(我已经开始要做,并且发现很繁重),这将必须是对 SQL 数据库之一的单独搜索。我无法以统一的方式从文件索引和数据库返回结果。

那么,从最终希望将索引词存储在数据库中出发,以实现全文检索,有人会建议什么呢?

I'm trying to come up with a simple solution to a problem I have because all of those I have found so far just seem too complicated!

The situation is that we use a proprietary application for managing most aspects of our business. It has an SQL Server 2005 backend database, which is quite large. The application also allows the attaching of Word and PDF documents to records, which we use extensively, and these are stored in the file system on the server, with the filenames referenced in the database. Unfortunately the search facilities in the application are poor, so I'm trying to build my own version.

So far I've got a neat ASP.NET page with a search box which will allow users to enter words to search for, as well as filter their results on other fields, such as department, date, etc. The Stored Procedure I've written in the database looks for the words they're searching for in several different fields in the database. What I'm really aiming for is Google-style 'one search to rule them all' effect, where the user doesn't have to specify where they expect to find the word they're looking for, they will just get hits anywhere that it appears in the database. And this is working.

What I want to add now is the ability for the search to include the text of the documents which are 'attached' to records. They are all either .doc or .pdf files but if I couldn't search the .pdf files it wouldn't be the end of the world.

In my ideal world what I'd do is find some software which would index the folder containing the documents (currently there are around 100,000 of them, averaging about 100k) and populate a table in my existing database with this index so that I could then just include that table in my search. I'd love it to just contain a record for each unique word it indexed and a join table referencing documents in the file system containing that word.

Given that this seems fanciful and there isn't any software that will do this, or anything close to it, as far as I can see, what solution would you recommend? The server already has dtSearch running on it, indexing the very files I'm interested in. However, whilst I could wade through the documentation trying to figure out how to implement a search of this index through my own webpage (which I've started to do, and found heavy going), that would have to be a separate search to the one of the SQL database. I couldn't return results from the file index and the database in a unified way.

So, starting from the ultimate wish of having the indexed words stored in the database, with a view to implementing full-text searching on that, what would anyone suggest?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时光无声 2024-12-05 23:03:00

SQL Server 具有全文搜索 (http://msdn.microsoft.com/en-us/library/ms142571.aspx);它支持 PDF 和 Word 文件(尽管有一些问题 - 安装可能有点棘手)。该链接指向 SQL Server 2008 - 但该功能自 SQL Server 2000 以来就已存在。

因此,超级简单 - 您的解决方案将要求您将文档加载到 SQL Server 中,并修改您的存储过程以使用内置的免费查询它们文本搜索功能。

保持文档的文件系统和数据库版本同步可能是一个挑战,但除此之外,我认为解决方案应该相当简单。

SQL Server has full text search (http://msdn.microsoft.com/en-us/library/ms142571.aspx); this supports both PDF and word files (though with some wrinkles - installation can be a bit tricky). The link is to SQL Server 2008 - but the feature's been presence since SQL Server 2000.

So, super simplistically - your solution would require you to load the documents into SQL Server, and amend your stored proc to query them using the built-in free text search features.

Keeping the file system and database versions of the document synchronized could be a challenge, but other than that, I think the solution should be fairly straightforward.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文