这是为我的屏幕抓取器缓存页面的糟糕方法吗?

发布于 2024-10-14 18:24:18 字数 1466 浏览 9 评论 0原文

我编写了一个简单的屏幕抓取工具来帮助我为希腊语课制作词汇抽认卡。它从在线词典中抓取单词,并以我的闪存卡管理器可以理解的格式输出它们。

我不想每次运行抓取工具时都用请求轰炸字典,所以我在第一次加载它们时将每个页面缓存到我的硬盘驱动器上(当然,这也使它更快)。我以前从未做过任何缓存,所以我不确定此类事情的最佳实践是什么。这是我的解决方案:

using System;
using System.IO;
using System.Net;
using System.Web;

public class PerseusDocument
{
    readonly string url;

    public PerseusDocument (string url)
    {
        this.url = url;

        ... // (load the DOM with a third-party library)
    }

    static string cacheUrl;
    static string CacheUrl {
        get {
            if (cacheUrl == null) {
                cacheUrl = Path.Combine (Environment.GetFolderPath (Environment.SpecialFolder.LocalApplicationData), "perseus");
                Directory.CreateDirectory (cacheUrl);
            }

            return cacheUrl;
        }
    }

    string FullCacheUrl {
        get { return Path.Combine (CacheUrl, HttpUtility.UrlEncode (url)); }
    }

    bool IsCached {
        get { return File.Exists (FullCacheUrl); }
    }

    string Html {
        get {
            if (IsCached)
                return File.ReadAllText (FullCacheUrl);

            WebClient client = new WebClient ();
            string html = client.DownloadString (url);

            using (StreamWriter file = new StreamWriter (FullCacheUrl))
                file.Write (html);

            return html;
        }
    }
}

换句话说,我只是检查缓存中是否存在与 url 同名的文件。如果是,我加载它,如果没有,我将 html 保存到一个新文件中。以这种方式做事有什么明显的问题吗?

I wrote a simple screen-scraper to help me make vocabulary flash cards for my Greek class. It scrapes the words from an online dictionary, and outputs them in a format that my flash card manager can understand.

I don't want to bombard the dictionary with requests every time I run the scraper, so I cache each page to my hard drive the first time I load them (of course, this also makes it much faster). I've never done any caching before, so I'm not sure what best practices for this sort of thing are. Here is my solution:

using System;
using System.IO;
using System.Net;
using System.Web;

public class PerseusDocument
{
    readonly string url;

    public PerseusDocument (string url)
    {
        this.url = url;

        ... // (load the DOM with a third-party library)
    }

    static string cacheUrl;
    static string CacheUrl {
        get {
            if (cacheUrl == null) {
                cacheUrl = Path.Combine (Environment.GetFolderPath (Environment.SpecialFolder.LocalApplicationData), "perseus");
                Directory.CreateDirectory (cacheUrl);
            }

            return cacheUrl;
        }
    }

    string FullCacheUrl {
        get { return Path.Combine (CacheUrl, HttpUtility.UrlEncode (url)); }
    }

    bool IsCached {
        get { return File.Exists (FullCacheUrl); }
    }

    string Html {
        get {
            if (IsCached)
                return File.ReadAllText (FullCacheUrl);

            WebClient client = new WebClient ();
            string html = client.DownloadString (url);

            using (StreamWriter file = new StreamWriter (FullCacheUrl))
                file.Write (html);

            return html;
        }
    }
}

In other words, I simply check if a file with the same name as the url exists in the cache. If so, I load it, if not, I save the html to a new file. Are there any glaring issues with doing things this way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

最近可好 2024-10-21 18:24:18

您不需要构建自己的缓存。您的所有请求将自动通过 WinINetCache 路由。要打开缓存,只需执行以下操作:

webClient.CachePolicy = new RequestCachePolicy(RequestCacheLevel.Default);

只要服务器将其页面定义为可缓存,那么缓存就会自动发生。

You don't need to build your own cache. All your requests will automatically be routed through WinINetCache. To turn on the cache simply do:

webClient.CachePolicy = new RequestCachePolicy(RequestCacheLevel.Default);

As long as the server has defined their pages as cacheable, then caching will happen automatically.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文