是否有一种算法可以使用正则表达式并且仅使用正则表达式类型来查找“不使用”匹配项?
我的意思是,是否有一种算法可以仅根据您想要的匹配类型自动查找匹配项。例如,给定“疾病”,是否有一种现代算法可能使用机器学习技术(我只是猜测)或任何其他技术来查找给定文本中的所有疾病名称? 您认为如果没有正则表达式,这可以如何完成?
谢谢
I mean, is there an algorithm to automatically find matches given only the type of match you want. For instance, given "disease" is there a modern algorithm using ML techniques probably (I am just guessing) or any other techniques to find all the disease names in a given piece of text ?
How do you think this can be done without regexes ?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
基于主题的搜索充其量是不平凡的,尽管它很少使用正则表达式来完成(或者至少主要是正则表达式)。
对于基于主题的搜索,您通常使用看起来/行为(很奇怪)与垃圾邮件过滤器非常相似的东西。事实上,假设它使用纯贝叶斯模型,您可能会得到一个典型的垃圾邮件过滤器,将文档分类为(可能)与特定主题相关的文档和那些(可能)不相关的文档,只需通过使用正确的训练数据(即,不是基于垃圾邮件/非垃圾邮件进行训练,而是在本例中基于医疗/非医疗进行训练)。
不过,这实际上一次只适用于一个主题。您必须针对每个主题单独训练它。如果您想或多或少地同时管理多个主题,您可能需要查看诸如潜在语义索引之类的东西(更常用于机器学习类型的事物)。这将支持(例如)获取几千个文档,并将它们分成多个组,而不仅仅是那些与特定主题相关的组以及其他所有组。
根据您想要支持的搜索类型,还有自动关键字提取算法,但我不会尝试深入讨论这一点,因为尚不清楚您是否关心它。
由于有人提到使用正则表达式来处理不同形式的单词和拼写错误,我要补充一点,通常正则表达式通常不用于这些目的。有一些算法(例如,波特的词干分析器)专门用于删除后缀以获得(可能的)基本词。还有其他一些(例如编辑距离)更常用于处理拼写错误。
Topic-based searching is non-trivial at best, though it's rarely done using regexes (or at least on primarily regexes anyway).
For topic based searching, you typically use something that looks/acts (oddly enough) rather similar to a spam filter. In fact, assuming it used a pure Bayesian model, you could probably get a typical spam filter to do a decent job of classifying documents into those (probably) related to a particular topic, and those that (probably) aren't, just by using the right training data (i.e., instead of training it based on spam/non-spam, you train it on, in this case, medical/non-medical).
That really only works for one topic at a time though. You have to train it separately for each topic. If you want to manage multiple topics more or less simultaneously, you probably want to look at something like Latent Semantic Indexing (which is more commonly used for machine learning types of things). This will support (for example) taking a few thousand documents, and separating them into a number of groups, rather than just those related to a specific topic, and everything else.
Depending on the kinds of searches you want to support, there are also automated keyword extraction algorithms, but I won't try to get into this, since it's not clear that you care about it.
Since somebody mentioned using regexes for dealing with different forms of words, and for misspellings, I'll add that normally regexes are not typically used for either of those purposes. There are algorithms (e.g., Porter's stemmer) specifically for removing suffixes to get a (probable) base word. There are others (e.g., Levenshtein distance) that are more often used to deal with spelling errors.