排序数组和唯一数组

发布于 2024-10-28 18:23:23 字数 825 浏览 1 评论 0原文

我用 python 编写了一个脚本,它扫描一个文件并从一个大数组中提取字符串。 我做了类似的事情:

while (delimiter #1 found)
search for the delimiter #2
if the string between #1 and #2 is not in the "final array", add it.

我花了 1 个小时用 python 制作脚本。 但对于大文件来说它太慢了(8 分钟对于 400 个文件来说太长了) 所以我决定用C写这批。一天后我仍然没有完成。

我已经研究过诸如排序数组之类的东西(gnu C 排序数组) 我想检查 #1 和 #2 之间的字符串是否已经在字符串数组中,如果没有,则添加它。 我认为会有明显的功能,例如在预排序数组中添加字符串(并保持排序),和/或在预排序数组中添加字符串如果它尚未存在

我找到的唯一解决方案是

  1. 使用 lsearch()
  2. 使用 bsearch (),如果找不到,添加它并重新排序 array()

第二个函数需要很长时间( qsort() 太长),第一个函数是在数千个元素之后变得太长(因为它们没有排序)。

你知道我可以在哪里看/我可以做什么/我可以使用哪个库吗?我想我不是地球上唯一一个想要将字符串放入预先排序的字符串数组中的人,前提是它不存在(并保持排序)! ;)

I wrote a script in python that scans a file and extract strings from it in a big array.
I do something like:

while (delimiter #1 found)
search for the delimiter #2
if the string between #1 and #2 is not in the "final array", add it.

It took me 1 hour to make the script in python.
But it's just too slow for big files (8 minutes for 400 files is far too long)
So I decided to write this batch in C. After one day I still haven't finished it.

I've already looked at things like sorted arrays (gnu C sorted arrays)
I'd like to check whether the string betwen #1 and #2 is already in an array of strings, and if not, add it.
I thought there would be obvious functions like adding a string in a pre-sorted array (and keep it sorted), and / or adding a string in a pre-sorted array if it's not already in.

The only solutions I've found is

  1. use lsearch()
  2. use bsearch (), and if not found, add it and re-sort the array()

The second function takes ages ( qsort() is too long) and the first one is getting too long after thousand of elements (because they're not sorted).

Do you know where I could look / what I could do / which library I could use? I guess I'm not the only one on earth who wants to put a string in a pre-sorted string array only if it's not present (and keep it sorted)! ;)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

烙印 2024-11-04 18:23:23

我不知道有 Ansi C 库可以做到这一点,但自己实现并不难。您想为字符串编写一个“排序数组列表”。我将简要介绍一下这会是什么样子:

struct SortedArrayList {
    int size;
    int capacity;
    char **element;
}

// returns: >= 0 if the element in contained, < 0 (-insertPos-1) if not
int GetIndexPos(char *text)
{
    if (size == 0) return -1;

    // Binary search through the list of strings
    int left = 0, right = size-1, center;
    int cmp;

    do {
        center = (left+right) / 2;
        cmp = strcmp(element[center],text);
        if (cmp == 0) return center; // found
        if (cmp < 0) left = center+1; // continue right
        else right = center-1; // continue left
    } while (left <= right);
    return -left-1; // not found, return insert position
}

void Add(char *text)
{
    int pos = GetIndexPos(text);
    if (pos >= 0) return; // already present
    pos = -pos-1

    // Expand the array
    size++;
    if (size >= capacity)
    {
        capacity *= 2;
        element = (char**)realloc(element,capacity*sizeof(char*));
    }

    // Add the element at the correct position
    if (pos < size-1) memmove(&element[pos+1],&element[pos],sizeof(char*)*(size-pos-1));
    element[pos] = text;
}

这将为您提供 O(log(n)) 的复杂性,用于带重复检查的排序插入。如果你想进一步提高运行时间,你可以使用更好的数据结构作为哈希映射。

I don't know of a library for Ansi C to do this, but it's not that hard to implement yourself. You want to write a "sorted array list" for strings. I'll give a short idea what this would be looking like:

struct SortedArrayList {
    int size;
    int capacity;
    char **element;
}

// returns: >= 0 if the element in contained, < 0 (-insertPos-1) if not
int GetIndexPos(char *text)
{
    if (size == 0) return -1;

    // Binary search through the list of strings
    int left = 0, right = size-1, center;
    int cmp;

    do {
        center = (left+right) / 2;
        cmp = strcmp(element[center],text);
        if (cmp == 0) return center; // found
        if (cmp < 0) left = center+1; // continue right
        else right = center-1; // continue left
    } while (left <= right);
    return -left-1; // not found, return insert position
}

void Add(char *text)
{
    int pos = GetIndexPos(text);
    if (pos >= 0) return; // already present
    pos = -pos-1

    // Expand the array
    size++;
    if (size >= capacity)
    {
        capacity *= 2;
        element = (char**)realloc(element,capacity*sizeof(char*));
    }

    // Add the element at the correct position
    if (pos < size-1) memmove(&element[pos+1],&element[pos],sizeof(char*)*(size-pos-1));
    element[pos] = text;
}

This will give you complexity of O(log(n)) for sorted insertion with duplicate check. If you want to improve the runtime some more, you can use better data structures as hash maps.

痕至 2024-11-04 18:23:23

在读取文件时使用字符串链接列表,因此您可以插入当前字符串,而不必为每次插入移动/排序字符串。

有多种方法可以优化搜索/插入(例如使用索引、哈希图、三元图等),但很难说哪种方法适合您的使用,我不会尝试列出/解释所有这些方法。

完成后(并知道数组实际需要的大小),您可以分配所需的内存,并将字符串指针从链表复制到分配的数组中,释放过程中的列表节点。

(或者,正如 pmg 正确评论的那样,只需继续直接使用该链表/映射即可。)

Use a linked list of strings while reading the file, so you can insert the current string instead of having to shift / sort the strings for each insert.

There are several ways in which you could optimize the search / insertion (like using indexes, hashmaps, triemaps or whatever), but it's hard to say which would be appropriate for your use, and I won't try to list / explain them all.

Once you are done (and know the size your array actually needs), you can allocate the memory needed, and copy the string pointers from the linked list into the allocated array, releasing the list nodes in the process.

(Or, as pmg correctly commented, simply continue using that linked list / map directly.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文