良好的搜索词白名单

发布于 2024-10-01 00:33:20 字数 238 浏览 3 评论 0原文

我正在网站上实现一个简单的搜索,现在我正在努力清理输入。我的计划是制定允许的字符白名单。我正在使用 PHP,到目前为止我已经获得了当前的正则表达式:

preg_replace('/[^a-z0-9 -]/i', '', $s);

因此,我将删除所有非字母数字、空格或连字符的内容。

对于此类事情是否有一个普遍接受的白名单,或者它仅取决于应用程序?我将搜索书名、作者姓名和书简介。

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:

preg_replace('/[^a-z0-9 -]/i', '', $s);

So, I'm removing anything that's not alphanumeric or a space or a hyphen.

Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

已下线请稍等 2024-10-08 00:33:20

2010(太空漫游)怎么样?吉斯卡尔·德斯坦的自传怎么样? ...这确实不可能一般地回答,这将取决于您的应用程序和数据结构。

您想要研究您选择的数据库的全文搜索功能,甚至是像 Sphinx 这样的专门搜索设备。

明确您将首先使用什么引擎来实际执行搜索,然后您需要删除的内容的规则将变得更加清晰。

What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.

You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.

Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.

瞄了个咪的 2024-10-08 00:33:20

谷歌有一些非常高级的搜索规则,但他们的基本规则是:

一般会忽略标点符号,包括@#$%^&*()=+[]\等特殊字符。

但是,Google 对常见搜索词(例如 C++、C# 或 $100)设置了例外。

如果您想要像 Google 一样复杂的搜索,您可以针对上述标点符号制定规则并有一些例外。然而,对于简单的搜索,只需忽略Google通常忽略的字符即可。

Google has some pretty advanced rules for searches, but their basic rule is this:

Generally, punctuation is ignored, including @#$%^&*()=+[]\ and other special characters.

However, Google makes exceptions for common search terms, like C++, C#, or $100.

If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.

秋意浓 2024-10-08 00:33:20

没有通用的正则表达式可以解决这个问题。您的代码删除了许多您可能想要保留的内容,例如逗号、感叹号、(分)冒号和非英文字母。如果您的数据库中有所有标题的完整列表,您应该能够编写一个脚本来构建所有标题中找到的所有字符的列表。如果您的正则表达式删除了任何这些字符,那么您就有遇到问题的风险(尽管通过此测试并不意味着您不会遇到问题)。

根据其余搜索的实施方式,您也许能够删除有效字符并仍然返回相关搜索结果。在这种情况下,您希望表达式允许非英语字符(因为您不想拆分单词),但您也许可以删除不在引号分隔短语内的所有标点符号。例如,搜索 red-haired 应该会提供搜索 red-haired 所得到的所有结果以及一些额外的结果。

There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).

Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文