我应该在哪里存储停用词列表?
我的函数解析文本并删除短单词,例如“a”、“the”、“in”、“on”、“at”等。
这些单词的列表将来可能会被修改。此外,在不同列表(即不同语言)之间切换也可能是一种选择。
那么,我应该在哪里存储这样的列表呢?
- 大约 50-200 个单词
- 每分钟很多人读取
- 几乎没有写入(修改) - 例如,几个月一次,
我脑海中就会出现以下选项:
- 代码内的列表(最快,但这听起来不是一个好的做法)
- 一个单独的文件“stop_words.txt”(从文件读取的速度有多快?我应该每隔几秒钟从同一个文件中读取相同的数据我调用相同的函数吗?)
- 一个数据库表。当单词列表几乎是静态的时,它真的有效吗?
我正在使用 Ruby on Rails(如果这有什么区别的话)。
My function parses texts and removes short words, such as "a", "the", "in", "on", "at", etc.
The list of these words might be modified in the future. Also, switching between different lists (i.e., for different languages) might also be an option.
So, where should I store such a list?
- About 50-200 words
- Many reads every minute
- Almost no writes (modifications) - for example, once in a few months
I have these options in my mind:
- A list inside the code (fastest, but it doesn't sound like a good practise)
- A seperate file "stop_words.txt" (how fast is reading from a file? should I read the same data from the same file every few seconds I call the same function?)
- A database table. Would it be really efficient, when the list of words is supposed to be almost static?
I am using Ruby on Rails (if that makes any difference).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果只有大约 50-200 个单词,我会将其存储在内存中支持快速查找的数据结构中,例如哈希映射(我不知道这种结构在 Ruby 中叫什么)。
您可以使用选项 2 或 3(将数据保留在文件或数据库表中,具体取决于对您来说更容易的方式),然后在应用程序启动时将数据读入内存。存储读取数据的时间,如果收到请求并且数据在 X 分钟内没有更新,则从持久存储中重新读取数据。
这基本上是一个缓存。 Ruby on Rails 可能已经提供了这样的机制,但我对此知之甚少,无法回答这个问题。
If it's only about 50-200 words, I'd store it in memory in a data structure that supports fast lookup, such as a hash map (I don't know what such a structure is called in Ruby).
You could use option 2 or 3 (persist the data in a file or database table, depending on what's easier for you), then read the data into memory at the start of your application. Store the time at which the data was read and re-read it from the persistent storage if a request comes in and the data hasn't been updated for X minutes.
That's basically a cache. It might be possible that Ruby on Rails already provides such a mechanism, but I know too little about it to answer that.
由于停用词的查找需要快速,因此我将停用词存储在哈希表中。这样,验证一个单词是否是停用词就可以摊销 O(1) 复杂度。
现在,由于停用词列表可能会发生变化,因此将列表保留在文本文件中,并在程序启动时读取该文件(或者每隔几分钟/在文件修改时读取该文件,如果您的程序连续运行)。
Since look-up of the stop-words needs to be fast, I'd store the stop-words in a hash table. That way, verifying if a word is a stop-word has amortized O(1) complexity.
Now, since the list of stop-words may change, it makes sense to persist the list in a text file, and read that file upon program start (or every few minutes / upon file modification if your program runs continuously).