如何下载雅虎网上论坛?
我想下载一些雅虎群组(文件、照片、消息、成员列表),我找到了这些脚本:
- http ://freshmeat.net/projects/grabyahoogroup/
- http://sourceforge .net/project/showfiles.php?group_id=62034
我已经从 CPAN 下载了 ActivePerl 和所需的模块(没什么花哨的;它们很容易找到)。 我已经成功安装了它们,但是当我运行脚本时,我在它告诉我已成功登录后收到错误: “在 yahoogroups_files.pl 第 244 行第 2 行的模式匹配 (m//) 中使用未初始化值 $cells。”
我猜测雅虎更改了页面布局或其他内容,但我无法自己更新脚本。 对于 Perl 和了解 Yahoo 生成页面的方式,我是一个新手,我只知道一些基本的 C++。 我想说的是,我并不懒惰,我会尝试自己修复它,但我需要你的帮助:提示、建议,任何事情。
PS:我已经联系了作者,但他不愿意更新脚本。
I want to download some Yahoo Groups (files, photos, messages, memberlist) and I've found these scripts:
- http://freshmeat.net/projects/grabyahoogroup/
- http://sourceforge.net/project/showfiles.php?group_id=62034
I've downloaded ActivePerl and the needed modules from CPAN (nothing fancy; they're very easy to find). I've managed to install them, but when I run the script I get an error after it tells me that I've successfully logged in:
"Use of uninitialized value $cells in pattern match (m//) at yahoogroups_files.pl line 244, line 2."
I'm guessing that Yahoo changed the layout of the page or something, but I'm not able to update the script myself. I'm a newbie when it comes to Perl and understanding the way Yahoo generates the pages, I only know some basic C++. I want to mention that I'm not lazy, I'll try do fix it myself but I need your help: hints, advice, anything.
PS: I've contacted the author, but he isn't willing to update the scripts.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您需要以下领域的知识:
使用 html 解析器
http 知识 ( get/post/head )
网页抓取
我建议你关注 WWW::Mechanize 因为它能够完成所有这些事情(以及更多)
编辑:另一个解决方案(不需要编程)是这样的:使用浏览器登录雅虎组,存储 cookie,然后运行 wget ,传递存储 cookie 作为参数。 这样你就能很快完成任务。
在硬盘上找到浏览器的 cookies.txt 文件,然后像这样调用 wget (如果我没记错命令的话):
可以找到完整的手册页 此处
编辑2:另一种选择是使用 WebDriver 自动化 Firefox。 您可以使用 这篇文章作为如何实现这一目标的指南。
You would need knowledge in the following fields:
use of an html parser
http knowledge ( get/post/head )
web scraping
I suggest you focus on WWW::Mechanize since it's capable of all these things ( and more )
EDIT: another solution ( that doesn't need programming ) , is this: login with your browser on yahoo groups, store the cookie, and then run wget , passing the stored cookie as a parameter. This way you'll get the task accomplished very fast.
Find your browser's cookies.txt file on your harddrive, and then call wget like this ( if I remember the commands correctly ) :
The full man page can be found here
EDIT2: Another option is to use WebDriver to automate firefox. You can use this article as a guide on how to accomplish this.
根据文件名,我假设您正在使用此处找到的雅虎组存档器: http://sourceforge.net/ items/grabyahoogroup/
我针对 SubEthaEdit 组运行了文件脚本,效果很好。 所有文件均顺利下载。
查看代码,如果 $cells 为空,则在 while 循环中处理 html 表时似乎会呕吐。
考虑到代码在我测试时确实有效,该组文件的列表可能出现问题。 您需要尝试输出 $content 并找出 243 上的正则表达式无法处理该 html 的位置和原因。
编辑:如果您不介意发布小组,我相信我自己或这里的其他人可以尝试并自行排除故障。 当问题无法重复时,很难确定到底发生了什么。 另外,尝试一下我所做的同一组,看看它是否适合你。 如果可行的话,当然与您正在尝试的团队有关。
By the filename I'm assuming you're using Yahoo Group archiver found here: http://sourceforge.net/projects/grabyahoogroup/
I ran the files script against the SubEthaEdit group and it works great. All of the files downloaded without incident.
Looking at the code it seems to barf while processing an html table in a while loop if $cells is empty.
Considering the code did work when I tested it it's possible there's something going on with the listing of that group's files. You'll want to try outputting $content and figure out where and why the regular expression on 243 isn't able to process that html.
EDIT: If you don't mind posting the group this is happening with I'm sure myself or someone else here can try it out and troubleshoot on their own. It's tough to pinpoint what's up when the issue can't be duplicated. Also, try the same group I did and see if it works out for you. Certainly something up with the group you're trying if that works.
不知道它是否会对您有帮助,但这就是我为使消息下载工作所做的事情:(
我只使用消息下载,我没有查看文件下载)
Dunno if it will help you, but here's what I did to get the message-download working:
(I only used message-download, I didn't look at file-download)
不久前正在修补这个问题,以备份我女朋友在大学的群组消息和文件。 在对最新脚本进行调试后,我发现
group_domain
声明似乎存在错误(我在yahoo2maildir.pl
上发现了一个组声明错误) > 同一个项目,请参阅$request
)在这种情况下,我已经用
sub download_folder()
函数覆盖了 $request varWas tinkering on this a while ago to backup my girlfriend's group messages and files from uni. Upon debugging on the latest scripts I've found out that there seems to be a bug on
group_domain
declaration (theres also a group declaration bug that i've found onyahoo2maildir.pl
of the same project, see$request
)in this case, i've overwritten the $request var under the function
sub download_folder()
withgrabyahoogroup 在最新版本中运行良好,可以在 svn 存储库中找到:
http ://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/
sourceforge.net/projects/grabyahoogroup/files/ 上的版本有错误并且不适合我。
grabyahoogroup works well in the latest edition, which can be found at the svn repo:
http://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/
The version at sourceforge.net/projects/grabyahoogroup/files/ HAS BUGS AND DID NOT WORK FOR ME.
我一直在寻找一种从雅虎群组收集消息/对话的工具!。 我终于找到了这个可以转换您的 Yahoo! 的工具。 在努力尝试自己制作并在互联网上到处搜索之后,将消息分组为 MBOX 格式。
下载工具
以下两个都是 Google Chrome 扩展程序。
纯字符串到 Base64 二进制数据
在 2010 年 9 月 16 日过去的某个时间(至少对我来说),检索到的消息不再是纯文本,而是 Base 64 二进制数据 (ASCII)。 使用这个瑞士转换器工具可以让您按原样读取数据。
MBOX 格式的示例内容
VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4=
转换后的示例结果
敏捷的棕色狐狸跳过了懒狗。
I've been looking for a tool that collects messages/conversations from Yahoo Groups!. I finally found this tool that converts your Yahoo! Groups messages into MBOX format after struggling to try to make my own and searching everywhere on the internet.
Download tools
Both of the following are Google Chrome extensions.
Plain string to Base64 binary data
At some time past September 16, 2010 (at least for me), the messages retrieved are no longer plain text and instead Base 64 binary data (ASCII). Using this swiss converter tool can allow you to read the data as it is.
Sample content from the MBOX format
VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4=
Sample result after conversion
The quick brown fox jumps over the lazy dog.
出于原因,截至 2019/09
https://github.com/csaftoiu/yahoo-groups-备份
......
for cause, as of 2019/09
https://github.com/csaftoiu/yahoo-groups-backup
.....