URL缩短算法
现在,这并不是严格意义上的 URL 缩短,但我的目的无论如何都是如此,所以让我们这样看待它。当然,URL 缩短的步骤是:
- 获取完整的 URL
- 生成一个唯一的短字符串作为 URL 的键
- 将 URL 和键存储在数据库中(键值存储在这里是完美的匹配)
现在,关于第二点。这就是我的想法:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
UUID uuid = UUID.randomUUID();
dos.writeLong(uuid.getMostSignificantBits());
String encoded = new String(Base64.encodeBase64(baos.toByteArray()), "ISO-8859-1");
String shortUrlKey = StringUtils.left(encoded, 6); // returns the leftmost 6 characters
// check if exists in database, repeat until it does not
这足够好吗?
Now, this is not strictly about URL shortening, but my purpose is such anyway, so let's view it like that. Of course the steps to URL shortening are:
- Take the full URL
- Generate a unique short string to be the key for the URL
- Store the URL and the key in a database (a key-value store would be a perfect match here)
Now, about the second point. Here's what I've come up with:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
UUID uuid = UUID.randomUUID();
dos.writeLong(uuid.getMostSignificantBits());
String encoded = new String(Base64.encodeBase64(baos.toByteArray()), "ISO-8859-1");
String shortUrlKey = StringUtils.left(encoded, 6); // returns the leftmost 6 characters
// check if exists in database, repeat until it does not
Is this good enough?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于我编写的文件上传应用程序,我也需要此功能。阅读这篇文章后,我决定坚持使用一些随机数,检查它们是否存在于数据库中。
所以你的做法和我的做法很相似。
For a file upload application I wrote, I needed this functionality, too. Having read this SO article, I decided to stick with just some random numbers and check whether they exists in the DB.
So your aproach is similar to what I did.
那么 URL 缩短是什么意思呢?
有非常不同的技术。 AFAIK,大多数网站都使用该技术将数据库主键(可能以某种编码的形式)放在 URL 中的某个位置,在该位置可以通过正则表达式进行解析,并使用关键字增强其余部分。
来自亚马逊的示例:
http://www.amazon.de/Bauknecht-WA-PLUS-614-Waschmaschine/dp/B003V1JDU8/
您可以输入任何内容来代替产品名称,只能输入最后的 id 很重要。
但是,您可能希望保持链接干净并检查其是否正确,并在出现错误 URL 时执行 301 转发到真实 URL 或放置规范 URL。
但是:
如果您想做 TinyURL 之类的事情,我的答案是肯定的。
这还不够好。
好吧,这取决于。
这不是“安全”的。猜测 URL 是很容易的。更好的方法是使用一些加密函数,例如 SHA-1/MD5。
当谈到碰撞时,我真的无法分辨。 GUID 被设计为不会发生冲突,但您只使用前 6 个字符。我不知道它们在算法中到底代表什么。但这绝对不是最佳的。
但是,为什么不直接使用数据库自动递增主键呢?如果安全性很重要,那么您肯定也必须使用 6 个以上的字符。
在我做的一个项目中,我使用了类似
/database-primary-key/hash-of-primary-key-with-some-token-or-client-information/
这样我可以直接查找数据库中的主键是最快的方法,但也可以验证该链接不是通过哈希强制找到的。在我的例子中,哈希值是客户端秘密令牌和主键的 SHA-1 总和。
Well what do you mean by URL shortening?
There are very different techniques. Most websites, AFAIK, use the technique to just put the databse primary key (maybe in some encoded) form in the URL at some position where it can be parsed by a regular expression and just enhancing the rest with keywords.
Example from Amazon:
http://www.amazon.de/Bauknecht-WA-PLUS-614-Waschmaschine/dp/B003V1JDU8/
You can enter anything in place of the name of the product, only the id at the end is important.
However you may want to keep your links clean and check if it's correct and do 301 forwarding to the real URL or put a canonical URL if a wrong URL turns up.
However:
If you want to do something like TinyURL, my answer is a definite no.
It's not good enough.
Well it depends.
It's not "secure". It would be pretty easy to guess URLs. A better approach would be using some cryptographic function like SHA-1/MD5.
When it comes to collisions I can't really tell. GUID was designed to have no collisions, but you are only using the first 6 characters. I don't know what exactly they represent in the algorithm. But it's definitely not optimal.
Why, however, don't you just use the database auto incrementing primary key? If security is important you also definitely have go to with more than 6 characters.
On a project I did I used something like
/database-primary-key/hash-of-primary-key-with-some-token-or-client-information/
This way I could directly look up the primary key in the database which was the fastest possible way but also could verify that the link was not found out by brute forced by the hash. In my case the hash was the SHA-1 sum of the client's secret token and the primary key.