如何检测字符串内相同的部分？

发布于 2024-08-29 22:05:51 字数 690 浏览 10 评论 0原文

我尝试将解码算法需求问题分解为更小的问题。这是第一部分。

问题：

两个字符串：s1 和 s2
s1 的部分与 s2 的部分相同
空格是分隔符
如何提取相同的部分？

示例 1：

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"

the identical parts are "2010", "-", "1" and "visitor"

示例 2：

s1 = "Welcome, John!"
s2 = "Welcome, Peter!"

the identical parts are "Welcome," and "!"

示例 3：（澄清“！”示例）

s1 = "Welcome, Sam!"
s2 = "Welcome, Tom!"

the identical parts are "Welcome," and "m!"

首选 Python 和 Ruby。谢谢

原文

I try to break down the decoding algorithm wanted question into smaller questions. This is Part I.

Question:

two strings: s1 and s2
part of s1 is identical to part of s2
space is separator
how to extract the identical part(s)?

example 1:

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"

the identical parts are "2010", "-", "1" and "visitor"

example 2:

s1 = "Welcome, John!"
s2 = "Welcome, Peter!"

the identical parts are "Welcome," and "!"

example 3: (to clarify the "!" example)

s1 = "Welcome, Sam!"
s2 = "Welcome, Tom!"

the identical parts are "Welcome," and "m!"

Python and Ruby preferred. Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

a√萤火虫的光℡ 2024-09-05 22:05:51

编辑：更新了此示例以与所有示例一起使用，包括#1：

def scan(s1, s2):
    # Find the longest match where s1 starts with s2
    # Returns None if no matches
    l = len(s1)
    while 1:
        if not l:
            return None
        elif s1[:l] == s2[:l]:
            return s1[:l]
        else:
            l -= 1

def contains(s1, s2):
    D = {} # Remove duplicates using a dict
    L1 = s1.split(' ')
    L2 = s2.split(' ')

    # Don't add results which have already 
    # been processed to satisfy example #1!
    DProcessed = {}

    for x in L1:
        yy = 0
        for y in L2:
            if yy in DProcessed:
                yy += 1
                continue

            # Scan from the start to the end of the words
            a = scan(x, y)
            if a: 
                DProcessed[yy] = None
                D[a] = None
                break

            # Scan from the end to the start of the words
            a = scan(x[::-1], y[::-1])
            if a: 
                DProcessed[yy] = None
                D[a[::-1]] = None
                break
            yy += 1

    return list(D.keys())

print contains("12 November 2010 - 1 visitor",
               "6 July 2010 - 100 visitors")
print contains("Welcome, John!",
               "Welcome, Peter!")
print contains("Welcome, Sam!",
               "Welcome, Tom!")

哪些输出：

['1', 'visitor', '-', '2010']
['Welcome,', '!']
['Welcome,', 'm!']

EDIT: Updated this example to work with all the examples, including #1:

def scan(s1, s2):
    # Find the longest match where s1 starts with s2
    # Returns None if no matches
    l = len(s1)
    while 1:
        if not l:
            return None
        elif s1[:l] == s2[:l]:
            return s1[:l]
        else:
            l -= 1

def contains(s1, s2):
    D = {} # Remove duplicates using a dict
    L1 = s1.split(' ')
    L2 = s2.split(' ')

    # Don't add results which have already 
    # been processed to satisfy example #1!
    DProcessed = {}

    for x in L1:
        yy = 0
        for y in L2:
            if yy in DProcessed:
                yy += 1
                continue

            # Scan from the start to the end of the words
            a = scan(x, y)
            if a: 
                DProcessed[yy] = None
                D[a] = None
                break

            # Scan from the end to the start of the words
            a = scan(x[::-1], y[::-1])
            if a: 
                DProcessed[yy] = None
                D[a[::-1]] = None
                break
            yy += 1

    return list(D.keys())

print contains("12 November 2010 - 1 visitor",
               "6 July 2010 - 100 visitors")
print contains("Welcome, John!",
               "Welcome, Peter!")
print contains("Welcome, Sam!",
               "Welcome, Tom!")

Which outputs:

['1', 'visitor', '-', '2010']
['Welcome,', '!']
['Welcome,', 'm!']

回复收藏 0 原文

潦草背影 2024-09-05 22:05:51

例如 1

>>> s1 = 'November 2010 - 1 visitor'
>>> s2 = '6 July 2010 - 100 visitors'
>>> 
>>> [i for i in s1.split() if any(j for j in s2.split() if i in j)]
['2010', '-', '1', 'visitor']
>>>

对于两者

>>> s1 = "Welcome, John!"
>>> s2 = "Welcome, Peter!"
>>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)]
['Welcome,', '!']
>>>

注意：以上代码对于示例 3 不起作用，这是刚刚添加的 OP

For example 1

>>> s1 = 'November 2010 - 1 visitor'
>>> s2 = '6 July 2010 - 100 visitors'
>>> 
>>> [i for i in s1.split() if any(j for j in s2.split() if i in j)]
['2010', '-', '1', 'visitor']
>>>

For both

>>> s1 = "Welcome, John!"
>>> s2 = "Welcome, Peter!"
>>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)]
['Welcome,', '!']
>>>

Note: Above codes will not work for example 3, which is OP just added

回复收藏 0 原文

書生途 2024-09-05 22:05:51

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"
l1 = s1.split()
for item in l1:
   if item in s2:
      print item

这会在空白处分裂。

同样在单词边界上拆分的解决方案（为了捕获示例 2 中的 !）在 Python 中不起作用，因为 re.split() 不会在零长度匹配。

第三个例子，甚至使单词的任何子字符串都成为潜在的匹配，由于存在许多可能的变化，所以事情变得更加复杂（对于 1234，我必须检查 1234、123、234、12、23、34、 1、2、3 和 4，每增加一个数字，排列的数量就会呈指数级增加。

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"
l1 = s1.split()
for item in l1:
   if item in s2:
      print item

This splits on whitespace.

A solution that also splits on word boundaries (in order to catch the ! in example 2) doesn't work in Python because re.split() won't split on zero-length matches.

The third example, making even any substring of the words a potential match, is making things a lot more complicated because of the many possible variations (for 1234, I'd have to check against 1234, 123, 234, 12, 23, 34, 1, 2, 3 and 4, and with each extra digit, the number of permutations increases exponentially.

回复收藏 0 原文

红尘作伴 2024-09-05 22:05:51

完整的 Ruby 解决方案：

def start_similar(i, j)
    front = ''
    for ix in (0...([i.size, j.size].min))
      if i[ix] == j[ix] then
        front += i[ix].chr
      else
        break
      end
    end
    return front
end

def back_similar(i, j)
    back = ''
    for ix in (0...([i.size, j.size].min)).to_a.reverse
      dif = i.size < j.size ? j.size - i.size : i.size - j.size
      ci = i[ i.size < j.size ? ix : ix + dif ]
      cj = j[ i.size > j.size ? ix : ix + dif ]
      if ci == cj then
        back = ci.chr + back
      else
        break
      end
    end
    return back
end

def scan(x, y)
    a, b = x.split(' '), y.split(' ')
    result = []
    for i in a do
      for j in b do
        result << start_similar(i, j)
        result << back_similar(i, j)
      end
    end
    return result.uniq.select do |r| not r.empty? end
end

puts scan(
    "12 November 2010 - 1 visitor",
    "6 July 2010 - 100 visitors"
).inspect
# ["1", "2010", "0", "-", "visitor"]

puts scan(
    "Welcome, John!",
    "Welcome, Peter!"
).inspect
# ["Welcome,", "!"]

puts scan(
    "Welcome, Sam!",
    "Welcome, Tom!"
).inspect
# ["Welcome,", "m!"]

The complete Ruby solution:

def start_similar(i, j)
    front = ''
    for ix in (0...([i.size, j.size].min))
      if i[ix] == j[ix] then
        front += i[ix].chr
      else
        break
      end
    end
    return front
end

def back_similar(i, j)
    back = ''
    for ix in (0...([i.size, j.size].min)).to_a.reverse
      dif = i.size < j.size ? j.size - i.size : i.size - j.size
      ci = i[ i.size < j.size ? ix : ix + dif ]
      cj = j[ i.size > j.size ? ix : ix + dif ]
      if ci == cj then
        back = ci.chr + back
      else
        break
      end
    end
    return back
end

def scan(x, y)
    a, b = x.split(' '), y.split(' ')
    result = []
    for i in a do
      for j in b do
        result << start_similar(i, j)
        result << back_similar(i, j)
      end
    end
    return result.uniq.select do |r| not r.empty? end
end

puts scan(
    "12 November 2010 - 1 visitor",
    "6 July 2010 - 100 visitors"
).inspect
# ["1", "2010", "0", "-", "visitor"]

puts scan(
    "Welcome, John!",
    "Welcome, Peter!"
).inspect
# ["Welcome,", "!"]

puts scan(
    "Welcome, Sam!",
    "Welcome, Tom!"
).inspect
# ["Welcome,", "m!"]

回复收藏 0 原文

~没有更多了~