从c中的字符串中删除最常见的单词

发布于 2025-02-10 11:06:40 字数 2813 浏览 2 评论 0原文

我需要从C中的字符串中删除所有最常见单词的出现。

如果文本中有几个单词重复相同的次数字符串。省略单词时，您不应省略周围空间和其他字符。如果接收的字符串不包含任何单词，则该函数无需执行任何操作。

一个单词定义为大写和小写字母的数组。该函数不需要区分大写和小写字母，

的算法是：

次
的
查找最常见的单词出现在字符串中的最常见单词
我最常见的单词

代码：

#include <stdio.h>
#include <limits.h>
#include <ctype.h>

int number_of_word_occurrence(char *s, char *start, char *end) {
  int number = 0;
  while (*s != '\0') {
    char *p = start;
    char *q = s;
    while (p != end) {
      if (*p != *q)break;
      p++;
      q++;
    }
    if (p == end)number++;
    s++;
  }
  return number;
}

int length(char *s) {
  char *p = s; int number = 0;
  while (*p != '\0') {
    p++;
    number++;
  }
  return number;
}

char *remove_most_common(char *s) {
  int n, max = INT_MIN;
  char *p = s;
  // Find max occurrence
  while (*p != '\0') {
    char *start = p;
    int word_found = 0;
    while (toupper(*p) >= 'A' && toupper(*p) <= 'Z' && *p != '\0') {
      word_found = 1;
      p++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, p);
      if (n > max)max = n;
    }
    p++;
  }
  p = s;
  int len = length(s);
  char *end = s + len;
  int i;
  // Removing most common word
  while (p != end) {
    char *start = p, *pp = p;
    int word_found = 0;
    while (toupper(*pp) >= 'A' && toupper(*pp) <= 'Z' && pp != end) {
      word_found = 1;
      pp++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, pp);
      // If word has max, then remove it
      if (n == max) {
        while (pp != end) {
          *start = *pp;
          start++;
          pp++;
        }
        end -= max; // resize end of string
        len-=max;
      }
    }
    p++;
  } 
  s[len+=2]='\0';
  return s;
}

int main() {
  char s[1000] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
  printf("%s ", remove_most_common(s) );
  return 0;
}

应该删除的单词在 BOLD 中

示例1：“ koristio sam auto- 停止 da dodjem do znaka 停止 ali prije stopa sam otvorio dekstop kompjutera kompjutera stop stop stop ”
输出：“ koristio sam auto-da dodjem do znaka ali prije stopa sam otvorio dekstop kompjutera”
示例2：“ 这个是字符串。”
输出：“是字符串。”
示例3：“ 1 pst 1 pst 2 3 pst pstpst pst ”;
输出：“” 11 2 3 PSTPST“
示例4：“ oneeebigggggggwooorrrddd ”;
输出：“”

您能帮我修复我的代码吗？我在删除字符时有一些错误。另外，您还可以帮助我删除最接近开始的单词？

如果所有单词出现都是相同的，则代码> sprintf sscanf 在求解任务时不允许使用stdio.h库。

原文

I need to remove all occurrences of most common word from string in C.

If there are several words in the text that are repeated the same number of times, the function should remove the one of the most common words that is closest to the beginning of the string. When omitting words, you should not omit surrounding spaces and other characters. If the received string does not contain any words, the function does not need to do anything.

A word is defined as an array of uppercase and lowercase letters.
The function does not need to distinguish between uppercase and lowercase letters

My algorithm is the following:

find how many times the most common word appears in string
then go word by word through string
check if word occurrence is equal to occurrence of most common word
remove found most common word

Code:

#include <stdio.h>
#include <limits.h>
#include <ctype.h>

int number_of_word_occurrence(char *s, char *start, char *end) {
  int number = 0;
  while (*s != '\0') {
    char *p = start;
    char *q = s;
    while (p != end) {
      if (*p != *q)break;
      p++;
      q++;
    }
    if (p == end)number++;
    s++;
  }
  return number;
}

int length(char *s) {
  char *p = s; int number = 0;
  while (*p != '\0') {
    p++;
    number++;
  }
  return number;
}

char *remove_most_common(char *s) {
  int n, max = INT_MIN;
  char *p = s;
  // Find max occurrence
  while (*p != '\0') {
    char *start = p;
    int word_found = 0;
    while (toupper(*p) >= 'A' && toupper(*p) <= 'Z' && *p != '\0') {
      word_found = 1;
      p++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, p);
      if (n > max)max = n;
    }
    p++;
  }
  p = s;
  int len = length(s);
  char *end = s + len;
  int i;
  // Removing most common word
  while (p != end) {
    char *start = p, *pp = p;
    int word_found = 0;
    while (toupper(*pp) >= 'A' && toupper(*pp) <= 'Z' && pp != end) {
      word_found = 1;
      pp++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, pp);
      // If word has max, then remove it
      if (n == max) {
        while (pp != end) {
          *start = *pp;
          start++;
          pp++;
        }
        end -= max; // resize end of string
        len-=max;
      }
    }
    p++;
  } 
  s[len+=2]='\0';
  return s;
}

int main() {
  char s[1000] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
  printf("%s ", remove_most_common(s) );
  return 0;
}

words that should be removed are in bold

EXAMPLE 1: "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop"
OUTPUT: "Koristio sam auto- da dodjem do znaka ali prije stopa sam otvorio dekstop kompjutera "
EXAMPLE 2: " This is string. "
OUTPUT: " is string. "
EXAMPLE 3: "1PsT1 psT2 3Pst pstpst pst";
OUTPUT: " "11 2 3 pstpst "
EXAMPLE 4: "oneeebigggwwooorrrddd";
OUTPUT: ""

Could you help me to fix my code? I have some errors while removing characters. Also, could you help me to remove the word closest to the beginning if all of word occurrences are the same?

Note: Functions from the string.h, stdlib.h libraries, as well as the sprintf and sscanf functions from the stdio.h library are not allowed when solving the task. It is not allowed to create auxiliary strings in function or globally.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷爱 2025-02-17 11:06:40

所有主要问题都是由于该字符串是信息的源，同时被主动更改。

通常，单词不正确地标记。

用输入“ Hello World”作为示例，Hello，Ello，llo，lo和o在寻找 word 时将字符串化时，被视为单词。该程序仅在扫描单词时通过一个字符推进字符串。

该程序应延长字符串的当前令牌的长度。

number_of_word_occurrence在进行比较时考虑任何有效 Word 的子字符串。

对于输入

koristio sam auto- 停止 da dodjem do znaka 停止 ali prije 停止 sam otvorio dek stop kompjutera 停止

the maximum count is incorrectly found to be 5, for stop.该问题与上面的问题相加，并开始删除报告这种情况数的错误令牌数据。

要概括一下，这种方法的一个很大的问题是，当您从字符串中删除a word 时，下次的出现数量将有所不同找到它。 Looking at

hello hello hello world world

The max occurrence count of a word here is 3, for hello.循环删除最大 word 将首次查看Hello，检查其出现数量，找到它为3，max和max，并且删除它。

 hello hello world world

对于第二个Hello，由于字符串已更改，因此检查其出现计数现在将返回2。这不是3的最大值，因此字符串没有变化。

此外，由于更改的字符串不会被终止 - 仅此后。意味着搜索一个单词可能会读取过时的数据，从而产生不良结果。

对程序可以使用的功能的严格限制（对动态内存和辅助缓冲区的限制的特殊性）确实可以实现非常详尽的解决方案。

一种解决方案是首先发现字符串中任何 word 的最大出现 count 首先在字符串中出现。然后，做count times操作删除 word 的出现，它允许您始终拥有 first 外观作为比较点。

通过在字符串中覆盖所有内容（包括null末端字节），将其覆盖 word 。

这是一个粗略的示例 - 在很大程度上未经测试，但为所示示例提供了正确的结果。用-ddebug编译以查看其他信息。

#include <ctype.h>
#include <stdio.h>

typedef struct word {
    const char *base;
    size_t length;
} word;

#define length(s) (span((s), NULL))
size_t span(const char *base, int (*test)(int))
{
    const char *end = base;

    while (test ? test((unsigned char) *end) : *end)
        end++;

    return (size_t) (end - base);
}

int eql(word a, word b)
{
#ifdef DEBUG
    fprintf(stderr, "DEBUG: A{%zu}<<%.*s>> <=> B{%zu}<<%.*s>>\n",
            a.length, (int) a.length, a.base,
            b.length, (int) b.length, b.base);
#endif

    if (!a.length || !b.length || a.length != b.length)
        return 0;

    if (a.base == b.base)
        return 1;

    for (size_t i = 0; i < a.length; i++)
        if (tolower((unsigned char) a.base[i]) != tolower((unsigned char) b.base[i]))
            return 0;

    return 1;
}

word get_word(const char *s, const char **end)
{
    word w = { 0 };

    while (*s && !isalpha((unsigned char) *s))
        s++;

    w.base = s;
    w.length = span(s, isalpha);

    *end = (s + w.length);

    return w;
}


word find_last(const char *s, word mark, unsigned *count)
{
    word last = { 0 };
    unsigned c = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(mark, current)) {
            last = current;
            c++;
        }
    }

    if (count)
        *count = c;

    return last;
}

word find_most_common(const char *s, unsigned *count)
{
    word most_common = { 0 };

    *count = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(most_common, current))
            continue;

        unsigned c;

        (void) find_last(s, current, &c);

        if (c > *count) {
            most_common = current;
            *count = c;
        }
    }

    return most_common;
}

void copy(char *dest, char *source, size_t length)
{
    for (size_t i = 0; i < length; i++)
        dest[i] = source[i];
}

void remove_most_common(char *s)
{
    unsigned count = 0;
    word common = find_most_common(s, &count);

#ifdef DEBUG
    if (count)
        fprintf(stderr, "DEBUG: MOST COMMON WORD: [[%.*s]]x%u\n",
                (int) common.length, common.base, count);
#endif

    size_t len = length(s);

    while (count--) {
        word last = find_last(s, common, NULL);

        copy(
            (char *) last.base,
            (char *) last.base + last.length,
            len - (size_t) (last.base - s) + 1);

        len -= last.length;
    }
}

int main(void)
{
    char buffer[4096];

    if (!fgets(buffer, sizeof buffer, stdin))
        return 1;

    size_t len = length(buffer);

    if (len && buffer[len - 1] == '\n')
        buffer[len - 1] = 0;

    printf("<<%s>>\n", buffer);
    remove_most_common(buffer);
    printf("<<%s>>\n", buffer);
}

All the major issues are due to the fact that the string is a source of information, while being actively altered.

In general, words are not tokenized properly.

With the input "hello world" as an example, each of hello, ello, llo, lo, and o are considered words when tokenizing the string while looking for the word to remove. The program only advances the string by one character when scanning for words.

The program should advance the string by the length of the current token.

number_of_word_occurrence considers any substring a valid word when making comparisons.

For the input

Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop

the maximum count is incorrectly found to be 5, for stop. This problem compounds with the problem above, and starts removing incorrectly tokenized data that reports this occurrence count.

To generalize, a large problem with this approach is that as you remove a word from the string, the number of occurrences is going to be different for that word, the next time it is found. Looking at

hello hello hello world world

The max occurrence count of a word here is 3, for hello. Looping through to remove the maximum word will see hello the first time, check its occurrence count, find it to be 3, the max, and remove it.

 hello hello world world

For the second hello, checking its occurrence count now will return 2, since the string was altered. This is not the max of 3, so the string is unchanged.

Additionally, the string is not null-terminated as it is altered - only afterwards. Meaning searching for a word might read stale data, giving bad results.

The strict limitation on what functionality the program can utilize (particularity the restrictions on dynamic memory and auxiliary buffers) does make for very exhaustive solutions.

One solution is to first discover the maximum occurrence count of any word in the string, and hold a pointer to this word's first appearance in the string. Then, do count times operations removing the last appearance of the word, which allows you to always have the first appearance as a point of comparison.

A word is removed by overwriting it with everything that follows it in the string (including the null-terminating byte).

Here is a cursory example - largely untested, but provides the correct results for the examples shown. Compile with -DDEBUG to see additional information.

#include <ctype.h>
#include <stdio.h>

typedef struct word {
    const char *base;
    size_t length;
} word;

#define length(s) (span((s), NULL))
size_t span(const char *base, int (*test)(int))
{
    const char *end = base;

    while (test ? test((unsigned char) *end) : *end)
        end++;

    return (size_t) (end - base);
}

int eql(word a, word b)
{
#ifdef DEBUG
    fprintf(stderr, "DEBUG: A{%zu}<<%.*s>> <=> B{%zu}<<%.*s>>\n",
            a.length, (int) a.length, a.base,
            b.length, (int) b.length, b.base);
#endif

    if (!a.length || !b.length || a.length != b.length)
        return 0;

    if (a.base == b.base)
        return 1;

    for (size_t i = 0; i < a.length; i++)
        if (tolower((unsigned char) a.base[i]) != tolower((unsigned char) b.base[i]))
            return 0;

    return 1;
}

word get_word(const char *s, const char **end)
{
    word w = { 0 };

    while (*s && !isalpha((unsigned char) *s))
        s++;

    w.base = s;
    w.length = span(s, isalpha);

    *end = (s + w.length);

    return w;
}


word find_last(const char *s, word mark, unsigned *count)
{
    word last = { 0 };
    unsigned c = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(mark, current)) {
            last = current;
            c++;
        }
    }

    if (count)
        *count = c;

    return last;
}

word find_most_common(const char *s, unsigned *count)
{
    word most_common = { 0 };

    *count = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(most_common, current))
            continue;

        unsigned c;

        (void) find_last(s, current, &c);

        if (c > *count) {
            most_common = current;
            *count = c;
        }
    }

    return most_common;
}

void copy(char *dest, char *source, size_t length)
{
    for (size_t i = 0; i < length; i++)
        dest[i] = source[i];
}

void remove_most_common(char *s)
{
    unsigned count = 0;
    word common = find_most_common(s, &count);

#ifdef DEBUG
    if (count)
        fprintf(stderr, "DEBUG: MOST COMMON WORD: [[%.*s]]x%u\n",
                (int) common.length, common.base, count);
#endif

    size_t len = length(s);

    while (count--) {
        word last = find_last(s, common, NULL);

        copy(
            (char *) last.base,
            (char *) last.base + last.length,
            len - (size_t) (last.base - s) + 1);

        len -= last.length;
    }
}

int main(void)
{
    char buffer[4096];

    if (!fgets(buffer, sizeof buffer, stdin))
        return 1;

    size_t len = length(buffer);

    if (len && buffer[len - 1] == '\n')
        buffer[len - 1] = 0;

    printf("<<%s>>\n", buffer);
    remove_most_common(buffer);
    printf("<<%s>>\n", buffer);
}

回复收藏 0 原文

初与友歌 2025-02-17 11:06:40

我决定编写一个新功能，该功能从字符串中删除单词的所有出现，这确实表现出您想要的方式。您只需要提供需要删除的源和单词即可。

代码：

#include <stdio.h>
#include <ctype.h> // toupper
#include <string.h> // strlen
#include <stdbool.h> // bool

void removeWord(char* source, char* removeThis)
{
    int i, j;
    bool wordFound;
    int sourceLength, removeLength;

    sourceLength = strlen(source);
    removeLength = strlen(removeThis);

    for(i = 0; i < sourceLength; ++i)
    {
        wordFound = true;

        for(j = 0; j < removeLength; ++j)
        {
            if(toupper(source[i + j]) != toupper(removeThis[j]))
            {
                wordFound = false;
                break;
            }
        }

        // It is not a word if the previous character or after the last one is alphabetic
        if(wordFound && (isalpha(source[i + j]) || (i > 0 && isalpha(source[i - 1]))))
        {
            wordFound = false;
        }

        if(wordFound)
        {
            for(j = i; j <= sourceLength - removeLength; ++j)
            {
                source[j] = source[j + removeLength];
            }

            --i;
            sourceLength -= removeLength;
        }
    }
}

int main()
{
    char string1[] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
    removeWord(string1, "stop");
    printf("\n%s", string1);

    char string2[] = {"This is string."};
    removeWord(string2, "this");
    printf("\n%s", string2);

    char string3[] = "1PsT1 psT2 3Pst pstpst pst";
    removeWord(string3, "pst");
    printf("\n%s", string3);

    char string4[] = "oneeebigggwwooorrrddd";
    removeWord(string4, "oneeebigggwwooorrrddd");
    printf("\n%s", string4);

    char string5[] = "Tomtom";
    removeWord(string5, "tom");
    printf("\n%s", string5);

    return 0;
}

输出：

Koristio sam auto- da dodjem do znaka  ali prije stopa sam otvorio dekstop kompjutera
 is string.
11 2 3 pstpst

Tomtom

基于此，您应该能够编写零件以找到最常见的单词，存储并将其馈送到此功能。

I decided to write a new function that removes all occurrences of a word from a string, which exactly behaves how you want it to. You only need to provide the source and the word that needs to be removed.

Code:

#include <stdio.h>
#include <ctype.h> // toupper
#include <string.h> // strlen
#include <stdbool.h> // bool

void removeWord(char* source, char* removeThis)
{
    int i, j;
    bool wordFound;
    int sourceLength, removeLength;

    sourceLength = strlen(source);
    removeLength = strlen(removeThis);

    for(i = 0; i < sourceLength; ++i)
    {
        wordFound = true;

        for(j = 0; j < removeLength; ++j)
        {
            if(toupper(source[i + j]) != toupper(removeThis[j]))
            {
                wordFound = false;
                break;
            }
        }

        // It is not a word if the previous character or after the last one is alphabetic
        if(wordFound && (isalpha(source[i + j]) || (i > 0 && isalpha(source[i - 1]))))
        {
            wordFound = false;
        }

        if(wordFound)
        {
            for(j = i; j <= sourceLength - removeLength; ++j)
            {
                source[j] = source[j + removeLength];
            }

            --i;
            sourceLength -= removeLength;
        }
    }
}

int main()
{
    char string1[] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
    removeWord(string1, "stop");
    printf("\n%s", string1);

    char string2[] = {"This is string."};
    removeWord(string2, "this");
    printf("\n%s", string2);

    char string3[] = "1PsT1 psT2 3Pst pstpst pst";
    removeWord(string3, "pst");
    printf("\n%s", string3);

    char string4[] = "oneeebigggwwooorrrddd";
    removeWord(string4, "oneeebigggwwooorrrddd");
    printf("\n%s", string4);

    char string5[] = "Tomtom";
    removeWord(string5, "tom");
    printf("\n%s", string5);

    return 0;
}

Output:

Koristio sam auto- da dodjem do znaka  ali prije stopa sam otvorio dekstop kompjutera
 is string.
11 2 3 pstpst

Tomtom

Based on this you should be able to write the part to find the most common word, store it, and feed it to this function.

回复收藏 0 原文

~没有更多了~

关于作者

心的位置

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

从c中的字符串中删除最常见的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

从c中的字符串中删除最常见的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。