C桶字符串

发布于 2024-12-15 16:30:02 字数 2376 浏览 5 评论 0原文

我有一个任务要完成。它说我必须读取一个包含 300 万个字符串的文件。
我必须读取该文件并构建一个结构来保存字符串。该系统必须能够回答“这个新字符串是否存在？”的问题。

我还希望将列表分解为字符串的“存储桶”，以便“要匹配的字符串”能够（快速）选择正确的存储桶进行搜索，并且该存储桶应包含不超过总/ hashMask 字符串左右（即每个桶 3,000,000 / 0xFFF == 732 个对象）。

现在我已经创建了一个哈希表、列表和函数的结构来读取文件、添加和删除函数。但我对粗体输入的文本一无所知。我是否需要在哈希函数中实现某些内容（以粗体要求）？

下面是我的示例代码

 #define MAX_NAME 100 
    /* Linked list structure */
    typedef struct list
    {
        char *string;
        int index;
        struct list *next
    } list_t ;

     /* hash table structure*/

     typedef struct hashTable
    {
        int size; // size of the table
        list_t **table; // the table element
    } hash_table_t;

    HashListType *createHashTable( size_t size)
   {        
     // allocate hash table ..I know how to do it    
   }    
    unsigned int hash(HashListType *hashTable, void *str )     
     {        
        uint64_t hashVal;    
        hashVal = 0;    
       while( *str != '\0')   
       {
         hashVal = *str + (hashVal << 5 ) - hashVal;    
         str++;    
       }     
      return (hashVal % hashTable->size);     
     }      

    void addToHashList( HashListType *list, void *obj, uint64_t hash)    
   {    

      // add item of new list to table  --> have an idea how to do it       
   }       

  void removeFromHashList(HashListType *list, void *criterion, uint64_t hash )      
   {
      // got an idea how to do it       
   }      
   /*        
      this  function will read the file (assume one string per line)     
      and create the list of lists (list of buckets), adding one object per string.    
   */     
     HashList *loadDataSet(char *filename, int hashMask)     
     {     
        // to read a file
       char readString[ MAX_NAME];
       File *fp ;

        if( (fp = fopen(filename, "r") )== NULL)
        {
          printf(" failed to open the file\n");
          exit(0);
        }
        while( fgets ( readString,MAX_NAME -1, fp ) != NULL)
        {
         //need to break the list down into "buckets" of strings so the 'string to match'
         // is able to chose the correct bucket to search in (quickly)
         //and that bucket should contain no more than total/hashMask strings
         or so (ie 3,000,000   / 0xFFF == 732 objects per bucket). 
        }
      fclose(fp);
     }

原文

I have an assignment to complete . It says I have to read a file which contains 3 millions of strings.
I have to read the file and build a structure to hold the strings. This system must be able to answer the question "is this new string present?"

I AM also expected to break the list down into "buckets" of strings so the 'string to match' is able to chose the correct bucket to search in (quickly) and that bucket should contain no more than total/hashMask strings or so (ie 3,000,000 / 0xFFF == 732 objects per bucket).

Now I have created a structure of hash table, list and function to read a file , add and delete function. But I have no clue about the text typed in bold. Do I need to imp-lement something (requested in bold) in Hash function?

Below is my sample code

 #define MAX_NAME 100 
    /* Linked list structure */
    typedef struct list
    {
        char *string;
        int index;
        struct list *next
    } list_t ;

     /* hash table structure*/

     typedef struct hashTable
    {
        int size; // size of the table
        list_t **table; // the table element
    } hash_table_t;

    HashListType *createHashTable( size_t size)
   {        
     // allocate hash table ..I know how to do it    
   }    
    unsigned int hash(HashListType *hashTable, void *str )     
     {        
        uint64_t hashVal;    
        hashVal = 0;    
       while( *str != '\0')   
       {
         hashVal = *str + (hashVal << 5 ) - hashVal;    
         str++;    
       }     
      return (hashVal % hashTable->size);     
     }      

    void addToHashList( HashListType *list, void *obj, uint64_t hash)    
   {    

      // add item of new list to table  --> have an idea how to do it       
   }       

  void removeFromHashList(HashListType *list, void *criterion, uint64_t hash )      
   {
      // got an idea how to do it       
   }      
   /*        
      this  function will read the file (assume one string per line)     
      and create the list of lists (list of buckets), adding one object per string.    
   */     
     HashList *loadDataSet(char *filename, int hashMask)     
     {     
        // to read a file
       char readString[ MAX_NAME];
       File *fp ;

        if( (fp = fopen(filename, "r") )== NULL)
        {
          printf(" failed to open the file\n");
          exit(0);
        }
        while( fgets ( readString,MAX_NAME -1, fp ) != NULL)
        {
         //need to break the list down into "buckets" of strings so the 'string to match'
         // is able to chose the correct bucket to search in (quickly)
         //and that bucket should contain no more than total/hashMask strings
         or so (ie 3,000,000   / 0xFFF == 732 objects per bucket). 
        }
      fclose(fp);
     }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

触ぅ动初心 2024-12-22 16:30:02

我相信您为哈希表选择了不正确的数据结构：

typedef struct hashTable
{   
  char key[MAX_NAME];
  int index;
  struct hashTable *next;
  struct hashTable *prev;
};

哈希表的主要好处之一是能够直接跳转到包含您正在搜索的元素的存储桶。这是哈希存储桶链接列表的一部分 - 这意味着您必须在每次查找或插入时遍历平均 4098/2 个存储桶。这不会为您提供所需的性能。

您的哈希表应该是一个结构体数组；每个struct应该有一个指向字符串的指针（或直接存储短字符串）和一个指向存储桶中下一个struct的指针。（虽然这个struct hashTable也可以是桶内结构，但它是一个罕见的哈希表，需要next和prev< /code> 存储桶内的链接。这就是为什么我猜测这个数据结构是用于表本身的。）

您还需要选择一个好的

以下是来自stb.h 便捷函数库的哈希函数：

unsigned int stb_hash(char *str)
{
   unsigned int hash = 0;
   while (*str)
      hash = (hash << 7) + (hash >> 25) + *str++;
   return hash + (hash >> 16);
}

一个简短的提示，虽然 stb.h 代码属于公共领域，但在程序中引用源代码是非常明智的——教授、律师，以及将来的同事，将感谢您包含来源你自己没有做过的事情。

I believe you've chosen the incorrect data structure for your hash tables:

typedef struct hashTable
{   
  char key[MAX_NAME];
  int index;
  struct hashTable *next;
  struct hashTable *prev;
};

One of the primary benefits of a hash table is being able to jump directly to the bucket that contains the element you're searching for. This is part of a linked list of hash buckets -- which means you must iterate through an average of 4098/2 buckets on every lookup or insertion. That will not provide you with the performance you need.

Your hash table should instead be an array of structs; each struct should have a pointer to a string (or direct storage for short strings) and a pointer to the next struct in the bucket. (While this struct hashTable could also be the in-bucket structure, it is a rare hash table that needs next and prev links within the buckets. Which is why I guessed this data structure is instead intended for the table itself.)

You also need to select a good hash function. There is a ton of research into good hash functions, but you're really looking for something better than horrible for a homework assignment. The input to the hash function is your strings, and the output should be an integer. You'll need to % the output with the size of your array (pick a prime near 5000) to figure out which bucket to use.

Here's a hash function from the stb.h library of convenient functions:

unsigned int stb_hash(char *str)
{
   unsigned int hash = 0;
   while (*str)
      hash = (hash << 7) + (hash >> 25) + *str++;
   return hash + (hash >> 16);
}

A short hint that while the stb.h code is in the public domain, it would be very wise to reference the source in the program -- professors, lawyers, and in the future, your co-workers, will thank you for including the source of things you didn't do yourself.

回复收藏 0 原文

寂寞花火° 2024-12-22 16:30:02

哈希函数不仅可以针对整数定义，还可以针对字符或字符串（提示：字符编码）定义。
为字符串创建哈希函数。
提交时，必须与输出文件一起提交或运行。

回复收藏 0 原文

鞋纸虽美，但不合脚ㄋ〞 2024-12-22 16:30:02

注意：这个答案取决于您的作业文本对使用“存储桶”的严格程度，因为我对您的问题的解释比您的示例代码更自由一些。

毫无疑问，此任务的最佳数据结构是 < a href="http://en.wikipedia.org/wiki/Trie" rel="nofollow">Trie 或其概括。您可以构建一棵树，其中每个节点都包含存储一个原子字符串的“微小”哈希表。例如，字符串的原子可以是单个字符。你可以参数化你的数据结构来改变原子的大小（即每个节点都有一个包含 16 个子尝试的固定数组，这样你的原子就有 4 位长）——这种数组方法允许恒定时间下降，但需要相对大约占内存。但正如我所说，您可以使用小型哈希表（这将更适合您的作业），而不是快速查找数组。

回复收藏 0 原文

~没有更多了~