当前位置：文江博客话题详情

在Python中将字符串分解为单独的单词

发布于 2024-11-27 12:55:59 字数 415 浏览 2 评论 0原文

我有一个很大的域名列表（大约六千个），我想看看哪些单词趋势最高，以便粗略地概述我们的投资组合。

我遇到的问题是列表格式为域名，例如：

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

仅运行字数统计就会产生垃圾。所以我想最简单的方法是在整个单词之间插入空格，然后运行字数统计。

为了我的理智，我更愿意编写这个脚本。

我对 python 2.7 知之甚少，但我愿意接受任何解决此问题的建议，代码示例确实会有所帮助。有人告诉我，使用简单的字符串 trie 数据结构将是实现此目的的最简单方法，但我不知道如何在 python 中实现它。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟沫凡尘 2024-12-04 12:55:59

我们尝试将域名 (s) 从一组已知单词 (words) 中拆分为任意数量的单词（不仅仅是 2 个）。递归ftw！

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

如果该迭代器函数在 words 中，则它首先生成调用它的字符串。然后它以各种可能的方式将字符串分成两部分。如果第一部分不在 words 中，它会尝试下一个拆分。如果是，则第一部分被添加到第二部分调用自身的所有结果之前（可能没有，如 ["example", "cart", ...]）

然后我们构建英语词典：

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

现在我们可以将它们放在一起：

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

结果：{'ions'：1，'pens'：1，'summer'：1，'car'：1，'pensions'：1，'deals'：1， 'offers': 1, 'trading': 1, 'example': 4}

使用 set 包含英语词典可以实现快速成员资格检查。 -= 从集合中删除项目，|= 添加到集合中。

将 all 函数与生成器表达式一起使用可以提高效率，因为 all 在第一个 False 时返回。

某些子字符串可能是一个有效单词，无论是整体还是拆分，例如“example”/“ex”+“ample”。对于某些情况，我们可以通过排除不需要的单词来解决问题，例如上面代码示例中的“ex”。对于其他单词，例如“pensions”/“pens”+“ions”，这可能是不可避免的，当发生这种情况时，我们需要防止字符串中的所有其他单词被多次计数（一次为“pensions”，一次为“pensions”）为“笔”+“离子”）。我们通过跟踪集合中每个域名的找到的单词（集合忽略重复项）来做到这一点，然后在找到所有单词后对单词进行计数。

编辑：重组并添加了大量评论。强制字符串转为小写，以避免因大写而丢失。还添加了一个列表来跟踪没有单词组合匹配的域名。

NECROMANCY 编辑：更改了子字符串函数，以便它可以更好地扩展。对于长度超过 16 个字符的域名，旧版本的速度慢得离谱。仅使用上面的四个域名，我就将自己的运行时间从 3.6 秒提高到了 0.2 秒！

We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])

Then we build the english dictionary:

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

Now we can put things together:

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.

Using the all function together with a generator expression improves efficiency, since all returns on the first False.

Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.

EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.

NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!

回复收藏 0 原文

水中月 2024-12-04 12:55:59

假设您只有几千个标准域，您应该能够在内存中完成这一切。

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

assuming you only have a few thousand standard domains you should be able to do this all in memory.

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

回复收藏 0 原文

夏了南城 2024-12-04 12:55:59

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

这是一种暴力方法，仅尝试将域拆分为 2 个英文单词。如果域名没有拆分成 2 个英文单词，它就会被丢弃。扩展它以尝试更多的分割应该很简单，但除非你很聪明，否则它可能无法很好地随着分割的数量而扩展。幸运的是，我猜你最多只需要 3 或 4 次分割。

输出：

deals: 1
example: 2
pensions: 1

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.

output: