如何通过文本搜索实现基于 Web 的查找文件数据库

发布于 2024-08-29 08:10:10 字数 432 浏览 1 评论 0原文

我有一系列这样的文件：

foo1.txt.gz
foo2.txt.gz
bar1.txt.gz
..etc..

以及描述这些文件的表格格式文件：

foo1 - Explain foo1
foo2 - Explain foo2
bar1 - Explain bar1
..etc..

我想做的是拥有一个带有简单搜索栏的网站，并允许人们输入 foo1 或只是 foo 最后返回 gzip 压缩的文件以及文件的相关解释。

实现这一点的最佳方法是什么以及我应该使用什么样的工具。抱歉，我在这个领域完全陌生。

更新： 具体来说，我想给出链接到匹配文件的 URL 列表。以便人们稍后可以选择下载哪一个。

原文

I have series of files like this:

foo1.txt.gz
foo2.txt.gz
bar1.txt.gz
..etc..

and a tabular format file that describe those files:

foo1 - Explain foo1
foo2 - Explain foo2
bar1 - Explain bar1
..etc..

What I want to do is to have a website with a simple search bar and allow people to type
foo1 or just foo and finally return the gzipped file(s) and the related explanation of the file(s).

What's the best way to implement this and what kind of tools should I use.
Sorry I am totally new in this area.

Update:
Specifically I want to give list of URLs linked to the matched files. So that
people can later choose which one to download.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梨涡少年 2024-09-05 08:10:10

您构建了一个 HTML 搜索表单。
- 表单有一个文本输入元素
- 提交后，表单将搜索字符串的值发送到后端脚本（例如，为简单起见，使用 CGI.pm 实现的 Perl CGI 脚本，尽管现在使用更现代的 Web 框架，例如如 Perl 的 Catalyst 或模板框架，例如 EmbPerl）
后端脚本搜索匹配的文件：
- 在 Perl 中打开匹配文件列表，使用 glob("*$search*.txt.gz") 或 File::Find 模块（如果文件位于子目录中）目录。
- 打开、读取描述文件并将其解析为基于“foo1”到描述的哈希映射文件。
- 运行 grep 查找与搜索字符串匹配的文件名（使用正则表达式）
- 打印 HTML 报告页面，其中的表格列出了找到的文件名及其描述 - 该页面将发送回浏览器。
- 文件名是下载文件的链接（见下文）。最简单的方法是将文件添加到“htdocs”树内的目录中 - 例如，Web 服务器查找文档的目录中的某个位置。然后你就可以通过 URL 引用它们。例如，如果您的主页是 /home/webpages/main/index.html （URL 为 http:// mysite.com/index.html），您可以将文件放在 /home/webpages/main/foofiles/foo1.txt.gz 中，URL 将为 http://mysite.com/foofiles/foo1.txt.gz。
您必须确保您的网络服务器可以发送带有适当内容标头的这些文件（例如，不会将它们作为文本/html 发送）。

回复收藏 0 原文

始终不够爱げ你 2024-09-05 08:10:10

出于性能原因，您可能想要做的是定期构建索引。有非常复杂的方法可以做到这一点，但也可以以非常简单的方式做出相当有用的东西。

从本质上讲，“索引”与您在教科书末尾找到的内容完全相同，但将这个想法转化为计算机世界。您需要扫描描述表，并构建一个键/值“字典”、“哈希”或任何您的语言的等效结构。关键字将是您在描述中找到的单词。这些值将是一个数组（或列表或无论您的语言如何称呼它），其中包含可以找到该单词的网址。

当您处理查询时，您会分解查询中的单词，并在词典中查找每个单词。那么每个“url”都可以为url包含的每个单词获得一个分数。然后，您根据每个网址的点数对结果进行排名。或者，您可以通过在通过查找单词找到的所有各种 url 数组之间执行集合交集，仅返回包含所有单词的结果。

根据您想要实现的目标，您可以更复杂地了解如何构建索引，例如使用单词的语音表示作为键，而不是原始单词本身。当您进行搜索时，将搜索词分解为其语音表示形式，这样您就可以消除与常见拼写错误有关的问题。

或者，您可以通过为每个单词创建重复的键来直接解决常见的拼写错误。

或者，您也可以索引字母三元组而不是整个单词，以捕获具有不同时态和词形变化的单词的替代形式。

您可能不希望在每个查询上构建

此索引（否则，有什么意义？），因此您希望能够将其保存到磁盘并将其（或其部分）加载到内存中在查询期间。无论您使用数据库还是其他任何方式来执行此操作，我都由您决定。

For performance reasons, what you'll likely want to do is have a periodic process build an index. There are very sophisticated ways to do this, but it's also possible to make something quite reasonably useful in a very simple way.

At heart, an "index" is the very same sort of thing you'd find at the end of a textbook, but translate that idea into a computer world. You'll want to scan through your tables of descriptions, and build a key/value "dictionary","hash", or whatever your language's equivelent structure is called. The keys will be the words you find in your description. The values will be an array (or list or whatever your language calls it), of urls in which that word could be found.

When you process a query, you break apart the words in the query, and look each one up in your dictionary. Then each "url" can get a point for every word that url contains. You then rank your results based on how many points each url has. Alternatively, you can return only results that contain all the words by performing a set intersection between all the various url arrays you find by looking up your words.

depending on what you are trying to achieve, you can get more sophisticated about how you construct your index, such as using phonetic representations of words as keys, instead of the raw words themselves. When you do a search, break the search terms into their phonetic representations, and in this way you can eliminate problems to do with common misspellings.

Alternatively you can address common misspellings directly by making duplicate keys for each word.

Alternatively, you can also index letter triplets rather than whole words, to catch alternative forms of words with different tenses and conjugations.

etc. etc.

You'll probably want to not construct this index on every query (otherwise, what's the point?), so you'll want to be able to save it to disk and load it (or parts of it) into memory during a query. Whether you use a database, or whatever for doing this, I leave up to you.

回复收藏 0 原文

~没有更多了~