在数据库中存储标记化文本?
我有一个简单的问题。我正在进行一些轻微的爬行,因此每隔几天就会有新内容到达。我编写了一个分词器,并希望将其用于某些文本挖掘目的。具体来说,我正在使用 Mallet 的主题建模工具,其中一个管道是在进行进一步处理之前将文本标记为标记。由于我的数据库中的文本量很大,因此需要花费大量时间来标记文本(我在这里使用正则表达式)。
因此,将标记化文本存储在数据库中是否是一种规范,以便标记化数据可以随时可用,并且如果我需要它们用于其他文本挖掘目的(例如主题建模、POS 标记),则可以跳过标记化?这种方法有什么缺点?
I have a simple question. I'm doing some light crawling so new content arrives every few days. I've written a tokenizer and would like to use it for some text mining purposes. Specifically, I'm using Mallet's topic modeling tool and one of the pipe is to tokenize the text into tokens before further processing can be done. With the amount of text in my database, it takes a substantial amount of time tokenizing the text (I'm using regex here).
As such, is it a norm to store the tokenized text in the db so that tokenized data can be readily available and tokenizing can be skipped if I need them for other text mining purposes such as Topic modeling, POS tagging? What are the cons of this approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
缓存中间表示
缓存文档处理管道中速度较慢的组件创建的中间表示是很正常的。例如,如果每个文档中的所有句子都需要依存解析树, 除了解析文档一次然后重用结果之外,做任何事情都是非常疯狂的。
缓慢的标记化
但是,我很惊讶标记化对您来说真的很慢,因为标记化下游的东西通常是真正的瓶颈。
您使用什么包来进行标记化?如果您使用 Python 并且编写了自己的标记化代码,您可能需要尝试 中包含的标记化器之一NLTK(例如,TreebankWordTokenizer)。
另一种好的分词器(尽管不是用 Python 编写的)是 PTBTokenizer 包含在 Stanford Parser 中 和 Stanford CoreNLP 端到端结束 NLP 管道。
Caching Intermediate Representations
It's pretty normal to cache the intermediate representations created by slower components in your document processing pipeline. For example, if you needed dependency parse trees for all the sentences in each document, it would be pretty crazy to do anything except parsing the documents once and then reusing the results.
Slow Tokenization
However, I'm surprise that tokenization is really slow for you, since the stuff downstream from tokenization is usually the real bottleneck.
What package are you using to do the tokenization? If you're using Python and you wrote your own tokenization code, you might want to try one of the tokenizers included in NLTK (e.g., TreebankWordTokenizer).
Another good tokenizer, albeit one that is not written in Python, is the PTBTokenizer included with the Stanford Parser and the Stanford CoreNLP end-to-end NLP pipeline.
我将标记化文本存储在 MySQL 数据库中。虽然我并不总是喜欢与数据库通信的开销,但我发现我可以要求数据库为我执行许多处理任务(例如在依赖项解析树中搜索复杂的语法模式)。
I store tokenized text in a MySQL database. While I don't always like the overhead of communication with the database, I've found that there are lots of processing tasks that I can ask the database to do for me (like search the dependency parse tree for complex syntactic patterns).