将域名拆分为组成词(如果可能)?
我想将域名分解为组成词和数字,例如
iamadomain11.com = ['i', 'am', 'a', 'domain', '11']
我该怎么做?我知道可能有多种可能,但是,我目前还可以,只得到一组可能性。
I want to break a domain name into constituent words and numbers e.g.
iamadomain11.com = ['i', 'am', 'a', 'domain', '11']
How do i do this? I am aware that there may be multiple sets possible, however, i am currently even ok, just getting 1 set of possibilities.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个问题实际上在 O'Reilly Media 的书中得到了解决,Beautiful Data 。在第 14 章“自然语言语料库数据”中,他使用一个巨大的免费令牌频率数据集创建了一个拆分器,可以完全按照您在 Python 中想要的方式执行操作。
This is actually solved in the O'Reilly Media book, Beautiful Data. In chapter 14, "Natural Language Corpus Data", he creates a splitter to do exactly as you want in Python using a giant freely available token frequency data set.
这是一个有趣的问题!首先你需要一本字典。出于性能原因,将其存储在哈希集中(可能可以使用Python中的字典类型)。然后,您可以迭代每个可能的字符串(“i”、“ia”、“iam”、...“n11”、“1”、“11”、“1”)并检查字典中的匹配项。然后就是迭代这些匹配,直到获得一个没有重叠的连续集。
这将是一个快速而肮脏的过程。可能有更快的方法来做到这一点。
This is a fun problem! First you would need a dictionary. For performance reasons, store this in an hashset (probably can use the dictionary type in python). You could then iterate over each possible string, ("i", "ia", "iam",..."n11", "1", "11", "1") and check for matches in the dictionary. Then it's a matter of iterating over these matches until you have a contiguous set with no overlaps.
This would be a quick and dirty. There are probably faster ways to do this.
这听起来类似于中文标记化的问题,其中单词之间没有空格。本段摘自 Manning、Raghavan 等人所著的“信息检索简介”。 Schütze,可在此处在线获取:
我建议第一步使用贪婪字典匹配,然后添加启发式方法来处理最常见的失败情况。
This sounds similar to the problem of tokenising Chinese, where there are no spaces in between words. This paragraph is taken from 'Introduction to Information Retrieval' by Manning, Raghavan & Schütze, available online here:
I would suggest greedy dictionary matching as a first step, then adding heuristics to handle the most common failure cases.