网站的模式匹配

发布于 2024-11-25 18:18:34 字数 1183 浏览 0 评论 0原文

我在一个表中维护了一个全球站点存储库。

website:
id, name,  url 
1   google http://www.google.com/
2   CNN    http://www.cnn.com/
3   SO     http://www.stackoverflow.com/

我维护一个参考表,其中存储用户存储的网站 ID。

userwebsite
userid, websiteid
[attributes of the table]

假设用户有兴趣拯救微软;在他的收藏中,他进入

www.microsoft.com

由于该网站不存在于全局存储库中,因此它首先位于存储库中,然后添加到他的集合中。现在两个表的内容看起来像这样:

website:
id, name,  url 
1   google http://www.google.com/
2   CNN    http://www.cnn.com/
3   SO     http://www.stackoverflow.com/
4   msft   http://www.microsoft.com

userwebsite:
userid, websiteid
1       4

假设用户有兴趣将 google 保存到他的集合中,并且他输入

www.google.com

由于该网站已存在于集合中,因此不会将网站添加到集合中,而只会获取引用添加到用户集合中。

被困住的地方,

www.google.com 和 http://www.google.com/

在语义上它们都指向相同的内容site,但是当您尝试匹配它们时,它们是两个不同的字符串。在这种情况下我应该如何匹配字符串?

我想到的一个解决方案是,输入一个站点,首先检查该域名是否存在于网站集合中(可能 PATINDEX 在这里会很好),通过这样做,您将获得具有保存域名的站点列表。然后检查该路径是否存在于任何生成的网站中。这是个好主意吗?

这个问题是否存在有效的解决方案?有没有更好的方法可以进行?

I maintain a global repository of sites in a table.

website:
id, name,  url 
1   google http://www.google.com/
2   CNN    http://www.cnn.com/
3   SO     http://www.stackoverflow.com/

I maintain a reference table, which stores the the website id's the user has stored.

userwebsite
userid, websiteid
[attributes of the table]

Say a user is interested to save microsoft; in his collection, he enters

www.microsoft.com

As the website doesn't exist in the global repository, it first sits in the repository and then gets added to his collection. Now the contents of both the tables looks something like this:

website:
id, name,  url 
1   google http://www.google.com/
2   CNN    http://www.cnn.com/
3   SO     http://www.stackoverflow.com/
4   msft   http://www.microsoft.com

userwebsite:
userid, websiteid
1       4

Say a user is interested in saving google in his collection, and he enters

www.google.com

As the website is already existing in the collection, instead of adding the website to the collection, only the reference gets added to the user collection.

The place where am stuck,

both www.google.com and http://www.google.com/

semantically they point out to the same site, but when you try to match them they are 2 distinct strings. How should I go about matching the strings in such cases?

One solution I think of is, input a site first check if the domain exists in the collection of websites (probably a PATINDEX will do good here), by doing this you get a list of sites which have the save domain name. and then check if the path exists in any of the resultant websites. Is this is a good idea?

Does a significant solution exist to this problem? Are there any better methods to go about?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

陪我终i 2024-12-02 18:18:34

在这种情况下,您不需要模式匹配,您真正要求的(继续 Matteo 的评论)是一种验证网址并以一致的方式存储它们的方法。但是,如果您希望正则表达式至少确定地址是否有效,您可以查看此处: http://www.shauninman.com/archive/2006/05/08/validating_domain_names

或使用 Javascript 来验证它 尽管您没有说明在 SQL 之外使用的是什么语言服务器。

在将域名存储到表中之前,您几乎需要将域名发送到域名服务器进行解析。最好忽略它们是网址这一事实,而将它们视为字符串。例如,您如何确保在数据库中正确比较人名?第一步通常是确保使用大写或小写;从那时起,它变得更加困难,例如处理可能被省略的中间名/首字母缩写。

You don't need pattern matching in this case, what you are really asking for (to continue from what Matteo commented about) is a way of validating web addresses and storing them in a consistent way. But if you want a regular expression to at least determine if the address is valid you can have a look here: http://www.shauninman.com/archive/2006/05/08/validating_domain_names

Or use Javascript to validate it although you don't say what language you are using outside of the SQL server.

It's almost the case you need to send the domain name to a Domain Name Server to resolve before storing it in your table. It may be better to ignore the fact they are web addresses and just think of them as strings. For example, how would you ensure peoples names were compared correctly in a database? The first step is usually to ensure upper or lower case is used; from then on it becomes more difficult such as handling middle names/initials which may be omitted.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文