有效地建立 string2 包含 string1 中的所有字母(字符顺序无关)

发布于 2024-08-09 05:27:08 字数 366 浏览 5 评论 0原文

我正在尝试提出一种算法来将第一个字符串与第二个字符串不对称地匹配。它将注册包含 word1 中所有字母的任何 word2 的匹配项。
例如,rent 将匹配 tern,因为 tern 包含字母 r、e、n、t。

这种比较将在两组数千个单词上进行数百次。这只是整个代码的一小部分,所以我不希望它让一切陷入困境。

对于那些问“是”的人来说,过度匹配非常重要,例如 rent 也会匹配 ternicate
对于像 rent ⊂ ternicate 这样的匹配,ternicate匹配 rent

I am trying to come up with an algorithm to asymmetrically match a first string to a second one. It would register a match for any word2 that contains all the letters from word1.
For example rent would match tern because tern contains the letters r,e,n,t.

This comparison is going to be made on two sets of a few thousands words - hundreds of times. This is only a small part of the overall code so I don't want it to bog everything down.

For those who were asking yes overmatching would be very important for example rent would also match ternicate.
For a match like rent ⊂ ternicate, ternicate would not match rent.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

满身野味 2024-08-16 05:27:08

好吧,这是一个非常糟糕的主意,但是它太疯狂了,它可能会起作用!

  1. 创建前 26 个素数的列表。

    素数 = [2, 3, 5, 7, 11, 13, 17, 19, 23, ...]
    
  2. 对于单词的每个字母,找到相应的素数。 A → 2、B → 3、C → 5 等。

  3. 将这些素数相乘。您最终会得到一个(非常大的)数字。

具有相同字母的单词将具有相同的数字。具有不同字母的单词保证具有不同的数字。这是为什么?

因为我们将质数相乘,所以我们总是会得到独特的字母组合的独特乘积。这些数字可以分解回它们的质因数,而这些因数可以准确地告诉我们原始单词中有哪些字母。字母的顺序不会被保留,但单词中包含哪些字母以及有多少个字母会被保留。

例如,以“脸”和“咖啡馆”这两个词为例。

FACE = 13 * 2 * 5 * 11 = 1430  
CAFE = 5 * 2 * 13 * 11 = 1430

哈!有什么比简单的整数比较更有效的呢?

...

好吧,不,也许不是。这对于实际使用来说有点太荒谬了。不过它很整洁。

Okay, this is a really bad idea, but it's just so crazy it might work!

  1. Create a list of the first 26 prime numbers.

    primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, ...]
    
  2. For each letter of a word, find the corresponding prime number. A → 2, B → 3, C → 5, etc.

  3. Multiply these prime numbers together. You will end up with a (very large) number.

Words that have the same letters will have the same number. Words with different letters are guaranteed to have different numbers. Why is that?

Because we're multiplying prime numbers we will always end up with unique products for unique combinations of letters. The numbers can be decomposed back into their prime factors, and the factors tell us precisely which letters were in the original word. The order of the letters isn't preserved but which letters were in the word and how many there were is.

For instance, take the words "face" and "cafe".

FACE = 13 * 2 * 5 * 11 = 1430  
CAFE = 5 * 2 * 13 * 11 = 1430

Ha! What could be more efficient than a simple integer comparison?

...

Okay, no, maybe not. This is a little too ridiculous to actually use. It's neat though.

粉红×色少女 2024-08-16 05:27:08

只需先对每个字符串的字符进行排序,然后进行比较即可。

rent == tern
enrt == enrt

Simply sort the characters of each string first, then compare them.

rent == tern
enrt == enrt
凉宸 2024-08-16 05:27:08

一种替代方法是计算每个字符串中每个字符的数量并比较计数。一个简单的实现应该花费 O(max(N, A)) 时间,其中 N 是较大字符串的长度,A 是用于存储计数的数组的大小。例如,在 Java 中:

public boolean equalIgnoringOrder(String s1, String s2) {
    if (s1.length() != s2.length()) {
        return false;
    }
    // Assuming characters in the range ASCII 0 to 127 
    int[] c1 = new int[128];
    int[] c2 = new int[128];
    for (int i = 0; i < s1.length(); i++) {
        c1[s1.charAt(i)]++;
        c2[s2.charAt(i)]++;
    }
    for (int i = 0; i < c1.length; i++) {
        if (c1[i] != c2[i]) {
            return false;
        }
    }
    return true;
}

对此有一些可能的改进。例如,您可以通过进行范围缩小来处理任意字符集;即对 s1s2 进行初始遍历,查找每个字符中的最小和最大字符,并使用它来确定 c1 的大小和c2 以及基本偏移量。这将平均使用更少的空间并减少初始化计数数组的时间。它还提供了用于比较的短路;例如,当 s1s2 的最小和最大字符不同时。

相比之下,比较使用堆排序或快速排序排序的字符串平均为 O(NlogN)O(N) 空间,其中 N 是较大字符串的长度。

然而,正如 @pst 指出的,比例常数可以使 O(NlogN) 甚至 O(N*N) 算法比 O( N) 如果 N 不大的话算法。在这种情况下,被比较的字符串的平均长度可能是最重要的因素。

上面的代码有效地执行了带有几个短路的基数排序。 (如果包括与范围缩小相关的短路,则为三个。)因此最终归结为快速排序/堆排序还是基数排序是否更好。这取决于输入字符串长度和字符范围。


采取不同的策略。 @John 的答案建议我们计算素数的乘积。如果我们使用任意精度表示进行计算,则对于每个不同的“相等忽略顺序”字符串集,结果值将是唯一的。不幸的是,计算量将是O(N*N)。 (每个中间产品都有 O(N) 位数字,将 N 位数字乘以常数是 O(N)。对 N 个字符执行此操作,您将得到 < code>O(N*N)。)

但是,如果我们以(例如)64 为模进行计算,结果实际上是一个对字符顺序不敏感的非常好的哈希值;例如

long hash = 1;
for (int i = 0; i < s.length(); i++) {
    hash = hash * primes[s.charAt(i)];
}

,我认为,用于比较随机生成的字符串的平均提供最佳性能和空间使用的算法可能采用以下形式:

if (s1.length() != s2.length()) {
    return false;
}
if (hash(s1) != hash(s2)) { // computed as above
    return false;
}
// Compare using sorting or character counting as above.

最后一点。如果我们假设字符串指针不相同并且字符串长度不等,则任何计算此等于谓词的算法都必须为 O(N)或更糟。它必须检查两个字符串中的每个字符才能做出此决定,这需要 O(N) 运算。

在这种情况下,任何对获取的值进行少于 2 * N 次提取或少于 2 * N 次进一步操作的算法都可能是不正确的。

One alternative is to count the numbers of each character in each string and compare the counts. A simple implementation should take O(max(N, A)) time where N is the length of the larger of the strings, and A is the size of the array you use to store counts. For example, in Java:

public boolean equalIgnoringOrder(String s1, String s2) {
    if (s1.length() != s2.length()) {
        return false;
    }
    // Assuming characters in the range ASCII 0 to 127 
    int[] c1 = new int[128];
    int[] c2 = new int[128];
    for (int i = 0; i < s1.length(); i++) {
        c1[s1.charAt(i)]++;
        c2[s2.charAt(i)]++;
    }
    for (int i = 0; i < c1.length; i++) {
        if (c1[i] != c2[i]) {
            return false;
        }
    }
    return true;
}

There are some possible improvements to this. For example, you can cope with an arbitrary character set by doing a range reduction; i.e. do an initial pass through s1 and s2 looking for the smallest and largest characters in each one, and use this to determine the size of c1 and c2 and a base offset. This will use less space on average and reduce the time to initialize the count arrays. It also offers a short circuit for the comparison; e.g. when the smallest and largest characters for s1 and s2 are not the same.

By comparison, comparing strings sorted using heapsort or quicksort would be O(NlogN) on average with O(N) space, where N is the larger string's length.

However, as @pst points out, the constants of proportionality can make an O(NlogN) or even O(N*N) algorithm better than an O(N) algorithm if N is not large. In this case, the average lengths of the strings being compared is probably the most important factor.

The code above is effectively performing a Radix Sort with a couple of short circuits. (Three if you include the short circuit associated with range reduction.) So ultimately it boils down to whether a Quick Sort / Heap Sort or a Radix Sort would be better. And that depends on input string lengths and character ranges.


On a different tack. @John's answer proposes that we compute a product of prime numbers. If we do the computation using an arbitrary precision representation, the resulting values will be unique for each distinct set of "equal ignoring order" strings. Unfortunately, the computation will be O(N*N). (Each intermediate product has O(N) digits, and multiplying an N-digit number by a constant is O(N). Do that for N characters and you get O(N*N).)

But if we do the computation modulo (say) 64, the result is effectively a really good hash that is insensitive to character order; e.g.

long hash = 1;
for (int i = 0; i < s.length(); i++) {
    hash = hash * primes[s.charAt(i)];
}

So, I would claim that the algorithm that gives best performance and space usage on average for comparing randomly generated strings is likely to be of the form:

if (s1.length() != s2.length()) {
    return false;
}
if (hash(s1) != hash(s2)) { // computed as above
    return false;
}
// Compare using sorting or character counting as above.

One final point. If we assume that the string pointers are not identical and that the strings have unequal lengths, any algorithm that computes this equals predicate has to be at O(N) or worse. It has to examine every character in both strings to make this determination, and that takes O(N) operations.

Any algorithm that does less than 2 * N fetches or less than 2 * N further operations on the fetched values in this scenario is provably incorrect.

深海夜未眠 2024-08-16 05:27:08

考虑到问题的含糊性,这里的关键是似乎没有必要计算任何字母出现的次数,只是它确实出现了。

因此,假设所有字母都在 az 范围内,并且还假设可以使用整数索引将原始单词列表索引为数组:

1. 创建两个数组 (每个列表一个)。

2. 对于两个列表中的每个单词,按如下方式计算位图:

bitmap = 0
foreach (character in word) {
    bitmap |= (1 << (character - 'a'))
}
arrayX[index] = bitmap;

该位图表示该单词中出现的所有字母的集合。

3. 然后,对于集合 A 中的每个单词,迭代集合 B,并在以下情况下进行匹配:

arrayA[indexA] | arrayB[indexB] == arrayB[indexB]

仅当单词 A 中的字符集是单词 B 的字符子集时,该测试才会为真。位集的“或”运算相当于实数集的并集 (∪) 运算符。

请参阅维基百科条目 集合数学 - A ⊆ B 当且仅当 A ∪ B = B。

顺便说一句,第 3 步的时间复杂度为 O(n^2),但仍然非常快,因为它只是按位比较。每个列表中的几千个单词(约 4M 测试)应该花费不到一秒的时间。

The key here, given the ambiguity in the question, is that it does not seem necessary to count how many times any letter appears, only that it does appear.

Therefore, assuming that all letters are in the range a-z, and also assuming that it's possible to index the original word lists as arrays using integer indices:

1. create two arrays (one for each list).

2. for every word in both lists calculate bitmap as follows:

bitmap = 0
foreach (character in word) {
    bitmap |= (1 << (character - 'a'))
}
arrayX[index] = bitmap;

this bitmap represents the set of all letters which appear in that word.

3. then for each word in set A, iterate over set B, and match when

arrayA[indexA] | arrayB[indexB] == arrayB[indexB]

That test will only be true if the set of characters in that word A is a subset of the characters of word B. The "or" operation for bitsets is the equivalent of the union (∪) operator for real sets.

See the Wikipedia entry on set mathematics - A ⊆ B if and only if A ∪ B = B.

BTW, Step 3 is O(n^2), but should still be very fast because it's just a bitwise comparison. A couple of thousand words in each list (~4M tests) should take less than a second.

鸢与 2024-08-16 05:27:08

我必须同意 Stephen C 的观点 - 这个定义还不够明确,无法回答

我不会投反对票,但您能否解释一下,例如,租金是否等于特伦特?有些回答者假设它是这样的(人们认为出现的次数并不重要,而其他回答者则假设最坏的情况。其中一组正在浪费时间。

此外,由于您关心的是性能,所以我们需要更多地了解您的调用模式。您能否解释一下您是否会多次查看一对集合,或者这些集合是否有所不同?

正如术语抽搐一样,您可能已经知道这一点,但按照当前的表述你的算法不是对称的。

你说租金会匹配租金,但显然,租金不会匹配所以你并不是真正在寻找等价性。你正在寻找类似“是”的东西。发现于”,或“可以由”制成。

这意味着你必须关心顺序 - 根据你访问集合的方式,你会得到不同的结果。

不要误会我的意思:这是一个有趣的问题......我只是不知道问题是什么。

I have to agree with Stephen C - this is not well enough defined to answer.

I'm not going to downvote, but could you explain, for example, whether rent is equivalent to terrent? You've got answerers who are assuming that it is (people thinking that number of occurrences doesn't matter, and other answerers who assume the worst. One of these groups is wasting their time.

Also, since your concern is about performance, we need to know more about your call pattern. Could you explain whether you'll look at a pair of sets more than once, or if the sets vary?

And just as a terminology twitch, you may already know this, but with the current formulation your algorithm isn't symmetric.

You say that rent would match ternicate, but obviously, ternicate would not match rent. So you're not really looking for equivalence. You're looking for something like "is found in", or "can be made from".

This means you have to care about order - you'll get different results depending on how you visit your sets.

Don't get me wrong: it's an interesting problem... I just don't know what the problem is.

蔚蓝源自深海 2024-08-16 05:27:08

我编写了很多与文字游戏和字谜游戏相关的代码。通常的方法是将单词转换为排序键,以便如上所述,“rent”与“tern”匹配,因为两者都映射到“enrt”。然而,一旦你开始这条路线,拥有一个字符和出现次数的字典就变得非常有用。下面是一些 Python 代码,它使用 (key=character, value=count) 将未排序的字符串转换为字典:

import collections

# Create a defaultdict(int) from a string
def create_collections_dict(key):
    dk = collections.defaultdict(int)
    for k in key:
        dk[k] += 1
    return dk

现在,您可以通过立即查看单词有多少个共同字母来对其他单词进行评分:

# Score the similarity of a defaultdict(int) against a string
# (which is temporarily converted to a defaultdict(int))
def score(dk, cand) :
    dc = create_collections_dict(cand)
    return sum(min(dk[k], dc[k]) for k in dk.keys() if k in dc)

if __name__ == '__main__':
    base = create_collections_dict('rent')
    for word in ['tern', 'ternicate', 'foobar']:
        print word, score(base, word)

结果:

tern 4
ternicate 4
foobar 1

I've done a lot of code that worked with word games and anagrams. The usual approach is to convert the word into a sorted key so that, as mentioned above, 'rent' matches 'tern' because both map to 'enrt'. Once you start on this route, though, it becomes really useful to have a dictionary of character and count of occurrence. Here is some python code that converts an unsorted string to a dictionary with (key=character, value=count):

import collections

# Create a defaultdict(int) from a string
def create_collections_dict(key):
    dk = collections.defaultdict(int)
    for k in key:
        dk[k] += 1
    return dk

Now you can score words against others by instantly seeing how many letters they have in common:

# Score the similarity of a defaultdict(int) against a string
# (which is temporarily converted to a defaultdict(int))
def score(dk, cand) :
    dc = create_collections_dict(cand)
    return sum(min(dk[k], dc[k]) for k in dk.keys() if k in dc)

if __name__ == '__main__':
    base = create_collections_dict('rent')
    for word in ['tern', 'ternicate', 'foobar']:
        print word, score(base, word)

Results:

tern 4
ternicate 4
foobar 1
装纯掩盖桑 2024-08-16 05:27:08

假设:

  1. 你的单词只包含 ascii 字符
  2. 大小写并不重要
  3. abc 匹配 abcde 并且 abcde 不匹配 abc

你可以通过匹配字符串 (s2) 计算字符数,然后通过值 (s1) 并检查所有字符是否存在于另一个,类似于(伪代码,未检查):

boolean matches(String s1, String s2) {
   int[]  counts = new int[256];
   char[] c1;
   char[] c2;

   c1 = s1.getCharArray();
   c2 = c2.getCharArray();

   // count char occurences in longest string
   for (int n = 0; n < c2.length; n++) {
       counts[(int)c2[n]]++;
   }

   // check all chars in shortest string are foud in the longest
   for (int n = 0; n < c1.length; n++) {
       if (0 == counts[(int)c1[n]]) {
          return false;
       }
   }

   return true;
}

对于参数长度的总和来说,这将是 O(n) 。

编辑:问题已更改为 s1 和 s2 之间的不对称函数。

Assuming that:

  1. your words only consist of ascii characters
  2. case doesnt matter
  3. abc matches abcde and abcde does not match abc

You can go through the match string (s2) counting chars, then go through the value (s1) and check all chars are present in the other, something like (pseudo code, not checked):

boolean matches(String s1, String s2) {
   int[]  counts = new int[256];
   char[] c1;
   char[] c2;

   c1 = s1.getCharArray();
   c2 = c2.getCharArray();

   // count char occurences in longest string
   for (int n = 0; n < c2.length; n++) {
       counts[(int)c2[n]]++;
   }

   // check all chars in shortest string are foud in the longest
   for (int n = 0; n < c1.length; n++) {
       if (0 == counts[(int)c1[n]]) {
          return false;
       }
   }

   return true;
}

This would be O(n) for the sum of argument lengths.

Edit: the question was changed to an asymmetric function between s1 an s2.

不羁少年 2024-08-16 05:27:08

也许不是最快的,但可能是使用 java+google-collections+guava (用于转换 char[]->List

import com.google.common.collect.ImmutableMultiset;
import com.google.common.primitives.Chars;

public class EqualsOrderignore {
    private static boolean compareIgnoreOrder(final String s1, String s2) {
        return ImmutableMultiset.copyOf(Chars.asList(s1.toCharArray()))
                .equals(ImmutableMultiset.copyOf(Chars.asList(s2.toCharArray())));
    } 
}

该算法运行时的 最短解决方案: O(s1.length + s2.length)

我非常确信此解决方案将在服务器虚拟机上与手工制作的 O(N1+N2) 解决方案执行相同的操作。

另外,这个解决方案适用于任何字符实例,而不仅仅是 aZ。

Maybe not the fastest, but likely the shortest solution using java+google-collections+guava (for casting char[]->List<Character>)

import com.google.common.collect.ImmutableMultiset;
import com.google.common.primitives.Chars;

public class EqualsOrderignore {
    private static boolean compareIgnoreOrder(final String s1, String s2) {
        return ImmutableMultiset.copyOf(Chars.asList(s1.toCharArray()))
                .equals(ImmutableMultiset.copyOf(Chars.asList(s2.toCharArray())));
    } 
}

runtime of this algorithm: O(s1.length + s2.length)

I am quite convinced this solution will perform en-par with handcrafted O(N1+N2) solution on a -server VM.

As a plus this solution will work for any instances of characters, not just a-Z.

心房的律动 2024-08-16 05:27:08

我认为你应该造一棵树。
我写了一些 python 代码来说明这个想法,但它可能有错误:

class knot():
    def __init__(self, char, is_word, string = "" way = 0):
        self.children = []
        self.shortest_way = way
        self.char = char
        self.word = is_word
        self.string = string

def comparing(strings):
    sorted_strings = []
    for string in strings:
        array_of_string = []
        for char in string:
            array_of_string.append(char)
        sorted_strings.append(array_of_string.sort())
    sorted_strings.sort()

    start = []
    matches = []

    for array_of_string in sorted_strings:
        matches += insert_string(array_of_string, start)

def insert_string(array, start):
    for match_string in test_string(array, start):
        matches += (array, match_string.string)
    add_string(array, start, 0):

def add_string(array, knots, n):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[n]):
            minimum = num
        elif (knots[num].char < array[n]):
            maximum = num
        elif (knots[num].char == array[n]):
            return add_string(array, knots[num], n+1)
    knots.append(new_knots(array, n))
    knots.sort

    """ more insertion routine needed"""

def search_children(array, knots):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[0]):
            minimum = num
        elif (knots[num].char < array[0]):
            maximum = num
        elif (knots[num].char == array[0]):
            return test_string(array, knots[num])
    return []

def test_string(array, target_knot):
    if len(array) > target_knot.sortest_way + 1:
        return []
    match_knots = []
    if len(array) == 1 and target_knot.is_word == True:
        match_knots.append(target_knot)
    for i in range(1, len(array)):
        match_knots += search_children(array[i:], target_knot.children)
    return match_knots

I think you should build a tree.
I have written a bit of python code to illustrate the idea, but it's probably buggy:

class knot():
    def __init__(self, char, is_word, string = "" way = 0):
        self.children = []
        self.shortest_way = way
        self.char = char
        self.word = is_word
        self.string = string

def comparing(strings):
    sorted_strings = []
    for string in strings:
        array_of_string = []
        for char in string:
            array_of_string.append(char)
        sorted_strings.append(array_of_string.sort())
    sorted_strings.sort()

    start = []
    matches = []

    for array_of_string in sorted_strings:
        matches += insert_string(array_of_string, start)

def insert_string(array, start):
    for match_string in test_string(array, start):
        matches += (array, match_string.string)
    add_string(array, start, 0):

def add_string(array, knots, n):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[n]):
            minimum = num
        elif (knots[num].char < array[n]):
            maximum = num
        elif (knots[num].char == array[n]):
            return add_string(array, knots[num], n+1)
    knots.append(new_knots(array, n))
    knots.sort

    """ more insertion routine needed"""

def search_children(array, knots):
    minimum = 0
    maximum = len(knots) - 1
    while minimum != maximum:
        num = int((minimum + maximum) / 2)
        if (knots[num].char > array[0]):
            minimum = num
        elif (knots[num].char < array[0]):
            maximum = num
        elif (knots[num].char == array[0]):
            return test_string(array, knots[num])
    return []

def test_string(array, target_knot):
    if len(array) > target_knot.sortest_way + 1:
        return []
    match_knots = []
    if len(array) == 1 and target_knot.is_word == True:
        match_knots.append(target_knot)
    for i in range(1, len(array)):
        match_knots += search_children(array[i:], target_knot.children)
    return match_knots
反差帅 2024-08-16 05:27:08

对于您选择的任何算法,都可能会针对相同长度的字符串进行优化。您所要做的就是对每个字符进行异或,如果结果为 0,则它们包含相同的字母。这在子字符串的情况下没有帮助,但它可能有助于短路更昂贵的比较。

For whatever algorithm you choose, an optimization might be made for strings of the same length. All you have to do is XOR each character, if the result is 0 then they contain the same letters. This doesn't help in the substring case, but it might be a help to short circuit a more expensive compare.

乙白 2024-08-16 05:27:08

对于区分大小写。

  1. 将两个字符串转换为小写/大写。
  2. 计算所有字符的总和,得到单词的总和 - 即租金 -> r+e+n+t(将添加每个字符的等效 ASCII 值)
  3. 对第二个字符串重复步骤 2,
  4. 如果两个总和相等,则

这意味着两个字符串包含完全相同的字符,与顺序无关。爪哇:

private static boolean matchIrrespectiveOfCharsOrder(String wordOne,String wordTwo){
    int sumOfAsciiCharsOne=sum(wordOne.toCharArray());
    int sumOfAsciiCharsTwo=sum(wordTwo.toCharArray());
        if(sumOfAsciiCharsOne==sumOfAsciiCharsTwo)
            return true;
    }       
    return false;
}
private static int sum(char[] arr) {
    int sum=0;
    for(char chh:arr) {
        sum+=chh;
    }
    return sum;
}

For Case sensitive.

  1. Convert both Strings to lowercase/uppercase.
  2. Calculate the sum of all chars together to get sum of word - i.e. rent -> r+e+n+t(equivalent ASCII value for each char will be added)
  3. repeat step 2 for second string
  4. if both the sums are equal that means both the string contains exactly same chars irrelevant of ordering.

Java:

private static boolean matchIrrespectiveOfCharsOrder(String wordOne,String wordTwo){
    int sumOfAsciiCharsOne=sum(wordOne.toCharArray());
    int sumOfAsciiCharsTwo=sum(wordTwo.toCharArray());
        if(sumOfAsciiCharsOne==sumOfAsciiCharsTwo)
            return true;
    }       
    return false;
}
private static int sum(char[] arr) {
    int sum=0;
    for(char chh:arr) {
        sum+=chh;
    }
    return sum;
}
从此见与不见 2024-08-16 05:27:08

这是相当模糊的,但我会使用关联数组来解决它:

使用每个单词的每个字母作为整数关联数组的键。
一个单词的字母会增加值,而另一个则会减少值。然后最后你可以运行 foreach 遍历所有键并检查所有值是否为零,然后它匹配。
这将为您提供基本的rent==tren 功能。

模糊性注意事项:

  1. 如果多个字母都可以,例如rent==rreenntt,那么在向数组添加字母时,检查该键是否存在,如果存在,则不要再次添加。
  2. 如果额外的字母没问题,例如rent==renter,但fernt!=renter,那么在最后检查数组值时,检查1和-1是否同时在数组中。换句话说,只有1和0可以,或者-1和0也可以,但1和-1不能同时在数组中。

我不知道这相对于其他方法有多快,但它很容易实现。

It is pretty vague, but I'd use an associative array to solve it:

Use each letter of each word as the key to an associative array of integers.
One word's letters increment the values, and the other would decrement. Then at the end you can run foreach through all the keys and check that all the values are zero, and then it matches.
This gets you basic rent==tren functionality.

Caveats for vagueness:

  1. If multiple letters are okay, eg rent==rreenntt, then when adding letters to the array, check if the key exists, and if it does, don't add it again.
  2. If extra letters are okay, eg rent==renter, but fernt!=renter, then when checking the array values at the end, check that 1 and -1 aren't both in the array at the same time. In other words, 1 and 0 only are okay, or -1 and 0 are okay, but not 1 and -1 can't be in the array at the same time.

I don't know how fast this would be relative to other approaches, but it would be easy to implement.

白昼 2024-08-16 05:27:08

假设您只是寻找子集,并且仅限于常见的英文字母,那么高效的直方图就可以了。我会考虑使用 64 位无符号整数,其中 2 位用于计算最多 2 次出现,额外的 12 位用于添加溢出标志并计算最多 3 次出现 'e ta o in s rh l d' 。位被填充而不是使用二进制(因此对于三个“e”,您将拥有 111,否则您需要比二进制 & 更复杂的东西来测试遏制)。要测试子集关系,请检查正在测试的子集的溢出位,如果未设置,则可以仅使用按位和来测试子集。
如果直方图确实溢出,则回退到 O(Length) 检查字符串的排序内容。

Assuming you're just looking for subsets, and are limited to the common English letters, then an efficient histogram will do. I'd look at using 64-bit unsigned integer, with 2 bits for counting up to 2 occurrences, and the extra 12 bits to add an overflow flag and to count up to 3 occurrences of 'e t a o i n s r h l d' . Bits are filled rather than using binary (so for three 'e's you would have 111, otherwise you need something more complicated than a binary & to test containment). To test the subset relation, you check the overflow bit of the subset you are testing and if not set, you can just use bitwise and to test for subset.
Fall back to O(Length) checking of the sorted contents of the string if the histogram does overflow.

木槿暧夏七纪年 2024-08-16 05:27:08

阅读要在两组上进行比较,比如说A&B,我仍然不知道想要的结果是什么。
Alnitak尝试了所有匹配对(单词 a 属于 A,单词 b 属于 B),使用集合操作并保留集合B 中单词的字符以供重用,每个“查询”都会迭代 B。
下面另一种方法预处理 B 并使用 Set 操作进行快速查询,每个“查询”迭代查询集(字符串 a 中的字符集)。 (受到包含超集的启发):

  • 建立一个Map 单词中的字符b∈B → < em>B 中包含这些单词的标识符集。
  • 对于“查询词”,例如来自 A 的一个词a
    • 如果不同字符多于 B 中的任何单词,则不匹配
    • 或者不在上述地图支持的“集合中”:
      获取第一个字符的“单词标识符”集
      对于后续字符 cc 的 ID 集相交:
      如果交集为空,则不保留超集。
import java.util.Collections;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

/** allows to <code>add()</code> Sets, and <code>test()</code> whether
 *  a superset of a query-set has been added. (Synonym: <code>contains()</code>) */
// To allow return of one or all supersets, one would need to keep the sets.
// 
public class SupersetChecker<E> 
    extends java.util.AbstractCollection<Set<E>> 
//  implements java.util.function.Predicate<Set<E>>
{
    private Map<E, Set<Object>> elementToSet = new java.util.HashMap<>();
    private int
        size,
        maxCardinality;
    @Override
    public int size() { return size; }
    /** @return the maximal cardinality among all Sets 
     * <code>add()</code>ed to <code>this</code>. */
    public int maxCardinality() { return maxCardinality; }
//  Set<E> elements = new java.util.HashSet<>();  // elementToSet.keySet?!
//  @Override
    public boolean test(Set<E> querySet) { return contains(querySet); }

    @Override
    /** @return <code>Set os</code> is a subset of at least one
     * <code>Set add()</code>ed to <code>this</code>. */
    public boolean contains(Object /*Set<E>*/ os) {
        if (!(os instanceof Set<?>))
            return false;
        final Set<E> s = (Set<E>) os;
        final int qSize = s.size();
        if (maxCardinality < qSize)
            return false;
        if (qSize < 1)
            return 0 < maxCardinality;
//      if (!elementToSet.keySet().containsAll(s))  // ToDo: benchmark
//          return false;
        Set<Object> candidates = null;
        for (E e: s) {
            Set<Object> containingSets = elementToSet.get(e);
            if (null == containingSets)
                return false;
            if (candidates == null)
                candidates = new java.util.HashSet<>(containingSets);
            else if (candidates.retainAll(containingSets)
                     && candidates.isEmpty())
                return false;
        }
        return true;
    }
    @Override
    /** Subsets of sets already kept do not get added.<br/>
     * Does <em>not</em> keep any reference to any of <code>s</code>. */
    // When a superset of something already kept gets added, 
    //  no information currently gets pruned:
    // Where possible, enter probable supersets first.
    public boolean add(Set<E> s) {
        if (contains(s) )
            return false;
        final Object key = new Object();
        for (E e: s) {
            Set<Object> kept = elementToSet.get(e);
            if (null == kept)  // found merge() less readable
                elementToSet.put(e, new java.util.HashSet<>
                                        (Collections.singleton(key)));
            else
                kept.add(key);
        }
        size += 1;
        if (maxCardinality < s.size())
            maxCardinality = s.size();
        return true;
    }
//  boolean add(String s) { return add(charSet(s)); }
    static Set<Character> charSet(String s) {
        java.util.ArrayList<Character>
            l = new java.util.ArrayList<>(Math.min(25, s.length()));
        for (char c: s.toCharArray())
            l.add(c);
        return new java.util.HashSet<>(l);
    }
    public static void main(String[] args) {
        SupersetChecker<Character> checker = new SupersetChecker<>();
        String B[] = { "leg", "bengal", "", "beagle", "eagle", "vulture" },
            A[] = { "x", "eagle", "gut", "legs" };
        for (String b: B) {
            checker.add(charSet(b));
            System.out.println(b + ": " + checker.size() + " sets");
        }
        for (String a: A)
            System.out.println("contains superset of " + charSet(a) + ": "
                + checker.test(charSet(a)));
    }

    @Override
    public Iterator<Set<E>> iterator() { return null; }
}

Reading The comparison is going to be made on two sets, say, A&B, I still don't know what the result wanted is.
Alnitak made a stab at all matching pairs (word a∈A, word b∈B), using Set operations and keeping the Sets of characters for words from B for reuse, each "query" iterating B.
Below another approach preprocessing B and using Set operations for fast queries, each "query" iterating the query set (set of characters in string a). (inspired a.o. by Contains superset):

  • establish a Map characters in a word b∈BSet of identifiers for words in B containing them.
  • for a "query word", say, a word a from A,
    • no match if more different characters than any word in B
    • or not "in" collection backed by above Map:
      get Set of "word identifiers" for first Character
      for subsequent characters c intersect with c's ID set:
      no superset kept if intersection empty.
import java.util.Collections;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

/** allows to <code>add()</code> Sets, and <code>test()</code> whether
 *  a superset of a query-set has been added. (Synonym: <code>contains()</code>) */
// To allow return of one or all supersets, one would need to keep the sets.
// 
public class SupersetChecker<E> 
    extends java.util.AbstractCollection<Set<E>> 
//  implements java.util.function.Predicate<Set<E>>
{
    private Map<E, Set<Object>> elementToSet = new java.util.HashMap<>();
    private int
        size,
        maxCardinality;
    @Override
    public int size() { return size; }
    /** @return the maximal cardinality among all Sets 
     * <code>add()</code>ed to <code>this</code>. */
    public int maxCardinality() { return maxCardinality; }
//  Set<E> elements = new java.util.HashSet<>();  // elementToSet.keySet?!
//  @Override
    public boolean test(Set<E> querySet) { return contains(querySet); }

    @Override
    /** @return <code>Set os</code> is a subset of at least one
     * <code>Set add()</code>ed to <code>this</code>. */
    public boolean contains(Object /*Set<E>*/ os) {
        if (!(os instanceof Set<?>))
            return false;
        final Set<E> s = (Set<E>) os;
        final int qSize = s.size();
        if (maxCardinality < qSize)
            return false;
        if (qSize < 1)
            return 0 < maxCardinality;
//      if (!elementToSet.keySet().containsAll(s))  // ToDo: benchmark
//          return false;
        Set<Object> candidates = null;
        for (E e: s) {
            Set<Object> containingSets = elementToSet.get(e);
            if (null == containingSets)
                return false;
            if (candidates == null)
                candidates = new java.util.HashSet<>(containingSets);
            else if (candidates.retainAll(containingSets)
                     && candidates.isEmpty())
                return false;
        }
        return true;
    }
    @Override
    /** Subsets of sets already kept do not get added.<br/>
     * Does <em>not</em> keep any reference to any of <code>s</code>. */
    // When a superset of something already kept gets added, 
    //  no information currently gets pruned:
    // Where possible, enter probable supersets first.
    public boolean add(Set<E> s) {
        if (contains(s) )
            return false;
        final Object key = new Object();
        for (E e: s) {
            Set<Object> kept = elementToSet.get(e);
            if (null == kept)  // found merge() less readable
                elementToSet.put(e, new java.util.HashSet<>
                                        (Collections.singleton(key)));
            else
                kept.add(key);
        }
        size += 1;
        if (maxCardinality < s.size())
            maxCardinality = s.size();
        return true;
    }
//  boolean add(String s) { return add(charSet(s)); }
    static Set<Character> charSet(String s) {
        java.util.ArrayList<Character>
            l = new java.util.ArrayList<>(Math.min(25, s.length()));
        for (char c: s.toCharArray())
            l.add(c);
        return new java.util.HashSet<>(l);
    }
    public static void main(String[] args) {
        SupersetChecker<Character> checker = new SupersetChecker<>();
        String B[] = { "leg", "bengal", "", "beagle", "eagle", "vulture" },
            A[] = { "x", "eagle", "gut", "legs" };
        for (String b: B) {
            checker.add(charSet(b));
            System.out.println(b + ": " + checker.size() + " sets");
        }
        for (String a: A)
            System.out.println("contains superset of " + charSet(a) + ": "
                + checker.test(charSet(a)));
    }

    @Override
    public Iterator<Set<E>> iterator() { return null; }
}
橘和柠 2024-08-16 05:27:08

我们可以简单地对字符串中的所有字符进行排序并进行比较吗?

private static boolean compareSrings(String first, String second) {
    char[] firstArray = first.toCharArray();
    Arrays.sort(firstArray);
    char[] secondArray = second.toCharArray();
    Arrays.sort(secondArray);

    return new String(firstArray).equals(new String(secondArray));
}

Can we simply sort all the characters in the Strings and compare ?

private static boolean compareSrings(String first, String second) {
    char[] firstArray = first.toCharArray();
    Arrays.sort(firstArray);
    char[] secondArray = second.toCharArray();
    Arrays.sort(secondArray);

    return new String(firstArray).equals(new String(secondArray));
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文