从文本中提取 URL 并使用字典将 facebook 的自由文本翻译为 facebook.com

发布于 2024-10-15 16:58:51 字数 1062 浏览 2 评论 0原文

我需要从文本调查回复中提取网站。该算法应该广泛匹配。例如，“患者像我”或“患者像我”应被识别为“患者像我.org”。

我已经包含了来自下面数据集的回复。我开始编写一些脚本来执行此操作，但意识到我没有使用接受其他过滤器和字典的强大设计模式。简单的正则表达式不起作用，因为匹配要么太精确，要么太笼统，无法捕获足够数量的匹配。在完美的世界中，我还可以使用 aspell 之类的东西来纠正拼写错误或使用 levenstein 算法来匹配单词。

预先感谢您为我指明任何数据清理算法、框架或资源的方向。

“在线”的整体魅力社区”是指他们是大学位，匿名。然而：无障碍园艺论坛，戴夫斯花园网站； Patientlikeme.com；当然还有脸书。

$sites = array("davesgarden.com","patientslikeme.com","facebook.com");

像我这样的病人社会女士Facebook 这种主义

$sites = array("patientslikeme.com","mssociety.org","facebook.com","thisisms.com");

yaoo webmd.co

$sites = array("yahoo.com","webmd.com");

多发性硬化症治疗选项.com

$sites = array("mstreatmentoptions.com");

原文

I need to extract websites from text survey responses. The algorithm should broadly match. For example "patients like me" or "patientslikeme" should be recognized as "patientslikeme.org".

I have included responses from the data set below. I starting writing some scripts to do this but realized I am not using a robust design pattern that will accept additional filters and dictionaries. A simple regular expression wasn't working because the match was either too precise or too general to catch a sufficient number of matches. In a perfect world I could also use something like aspell to correct spelling mistakes or use the levenstein algorithm to match words.

Thanks in advance for pointing me in the direction of any data cleansing algorithms, frameworks, or resources.

The whole beauty of "online
communities"is that they are, to a
large degree, anonymous. However:
Accessible Gardening Forum,
Davesgarden.com; Patientslikeme.com;
and of course FACEBOOK.

$sites = array("davesgarden.com","patientslikeme.com","facebook.com");

Patient Like Me Ms Society Facebook
Thisisms

$sites = array("patientslikeme.com","mssociety.org","facebook.com","thisisms.com");

yaoo webmd.co

$sites = array("yahoo.com","webmd.com");

MS treatment options.com

$sites = array("mstreatmentoptions.com");

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久随 2024-10-22 16:58:51

这是一个 ruby 脚本。

向其提供以下格式的域列表，名为 inputfile.txt：

myurl.com
otherurl.com

将其保存在名为 Convert.rb 的文件中

while line = gets
        line =~ /(.+)\.\w+$/
        print "/"
        $1.each_char{|c|
                print "#{c}\\W*"
        }
        print "/i"
        puts
end

然后运行以下命令： cat inputfile.txt | ruby 转换.rb > outputfile.txt

这是正则表达式的列表。获取这些并尝试将每一个与您输入的文本相匹配。

Here's a ruby script.

Feed it a list of domains in this format, named inputfile.txt:

myurl.com
otherurl.com

Save this in a file called convert.rb

while line = gets
        line =~ /(.+)\.\w+$/
        print "/"
        $1.each_char{|c|
                print "#{c}\\W*"
        }
        print "/i"
        puts
end

Then run this command: cat inputfile.txt | ruby convert.rb > outputfile.txt

That's a list of regexes. Take those and try to match each one on your input text.

回复收藏 0 原文

~没有更多了~