这是为我的屏幕抓取器缓存页面的糟糕方法吗?
我编写了一个简单的屏幕抓取工具来帮助我为希腊语课制作词汇抽认卡。它从在线词典中抓取单词,并以我的闪存卡管理器可以理解的格式输出它们。
我不想每次运行抓取工具时都用请求轰炸字典,所以我在第一次加载它们时将每个页面缓存到我的硬盘驱动器上(当然,这也使它更快)。我以前从未做过任何缓存,所以我不确定此类事情的最佳实践是什么。这是我的解决方案:
using System;
using System.IO;
using System.Net;
using System.Web;
public class PerseusDocument
{
readonly string url;
public PerseusDocument (string url)
{
this.url = url;
... // (load the DOM with a third-party library)
}
static string cacheUrl;
static string CacheUrl {
get {
if (cacheUrl == null) {
cacheUrl = Path.Combine (Environment.GetFolderPath (Environment.SpecialFolder.LocalApplicationData), "perseus");
Directory.CreateDirectory (cacheUrl);
}
return cacheUrl;
}
}
string FullCacheUrl {
get { return Path.Combine (CacheUrl, HttpUtility.UrlEncode (url)); }
}
bool IsCached {
get { return File.Exists (FullCacheUrl); }
}
string Html {
get {
if (IsCached)
return File.ReadAllText (FullCacheUrl);
WebClient client = new WebClient ();
string html = client.DownloadString (url);
using (StreamWriter file = new StreamWriter (FullCacheUrl))
file.Write (html);
return html;
}
}
}
换句话说,我只是检查缓存中是否存在与 url 同名的文件。如果是,我加载它,如果没有,我将 html 保存到一个新文件中。以这种方式做事有什么明显的问题吗?
I wrote a simple screen-scraper to help me make vocabulary flash cards for my Greek class. It scrapes the words from an online dictionary, and outputs them in a format that my flash card manager can understand.
I don't want to bombard the dictionary with requests every time I run the scraper, so I cache each page to my hard drive the first time I load them (of course, this also makes it much faster). I've never done any caching before, so I'm not sure what best practices for this sort of thing are. Here is my solution:
using System;
using System.IO;
using System.Net;
using System.Web;
public class PerseusDocument
{
readonly string url;
public PerseusDocument (string url)
{
this.url = url;
... // (load the DOM with a third-party library)
}
static string cacheUrl;
static string CacheUrl {
get {
if (cacheUrl == null) {
cacheUrl = Path.Combine (Environment.GetFolderPath (Environment.SpecialFolder.LocalApplicationData), "perseus");
Directory.CreateDirectory (cacheUrl);
}
return cacheUrl;
}
}
string FullCacheUrl {
get { return Path.Combine (CacheUrl, HttpUtility.UrlEncode (url)); }
}
bool IsCached {
get { return File.Exists (FullCacheUrl); }
}
string Html {
get {
if (IsCached)
return File.ReadAllText (FullCacheUrl);
WebClient client = new WebClient ();
string html = client.DownloadString (url);
using (StreamWriter file = new StreamWriter (FullCacheUrl))
file.Write (html);
return html;
}
}
}
In other words, I simply check if a file with the same name as the url exists in the cache. If so, I load it, if not, I save the html to a new file. Are there any glaring issues with doing things this way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您不需要构建自己的缓存。您的所有请求将自动通过 WinINetCache 路由。要打开缓存,只需执行以下操作:
只要服务器将其页面定义为可缓存,那么缓存就会自动发生。
You don't need to build your own cache. All your requests will automatically be routed through WinINetCache. To turn on the cache simply do:
As long as the server has defined their pages as cacheable, then caching will happen automatically.