查找字符串列表是否多次包含相同元素

发布于 2025-01-02 04:29:35 字数 1124 浏览 5 评论 0原文

我正在为产品销售网站编写自己的特定网络爬虫。由于它们的编码性质非常糟糕,我得到了指向同一页面的网址。

示例一

http://www.hizlial.com/bilgisayar/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

例如上面的页面与下面的页面相同

http://www.hizlial.com/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

正如你所看到的,当你通过'/'字符分割时它包含2个“bilgisayar”元素

所以我想要的是我想像这样分割网址

 string[] lstSPlit = srURL.Split('/');

之后检查该列表是否多次包含相同的元素。任何元素。如果包含任何元素,我将跳过该 url,因为我已经从其他页面提取了真实的 url。那么最好的方法是什么?

更长但有效的版本

string[] lstSPlit = srHref.Split('/');
bool blDoNotAdd = false;
HashSet<string> splitHashSet=new HashSet<string>();
foreach (var vrLstValue in lstSPlit)
{
    if (vrLstValue.Length > 1)
    {
        if (splitHashSet.Contains(vrLstValue) == false)
        {
            splitHashSet.Add(vrLstValue);
        }
        else
        {
            blDoNotAdd = true;
            break;
        }
    }
}

I am writing my own specific web crawler for product selling websites. Due to their very bad coding nature i get with getting urls pointing same page.

Example one

http://www.hizlial.com/bilgisayar/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

For example the page above is same as below

http://www.hizlial.com/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

As you can see it contains 2 "bilgisayar" element when you split via '/' character

So what i want is i want to split urls like this

 string[] lstSPlit = srURL.Split('/');

After that check that whether that list contains same element more than once or not. Any element. If contains any element i will skip the url because i would have already have the real url extracted from some other page. So what is the best way of doing this ?

Longer but working version

string[] lstSPlit = srHref.Split('/');
bool blDoNotAdd = false;
HashSet<string> splitHashSet=new HashSet<string>();
foreach (var vrLstValue in lstSPlit)
{
    if (vrLstValue.Length > 1)
    {
        if (splitHashSet.Contains(vrLstValue) == false)
        {
            splitHashSet.Add(vrLstValue);
        }
        else
        {
            blDoNotAdd = true;
            break;
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

放手` 2025-01-09 04:29:35
if (list.Distinct().Count() < list.Count)

这应该比分组更快。 (我没有测量)

您可以通过编写自己的扩展方法来使其更快,该方法将项目添加到 HashSet 并在 Add() 时立即返回 false返回假。

你甚至可以使用邪恶的速记来做到这一点:

if (!list.All(new HashSet<string>().Add))
if (list.Distinct().Count() < list.Count)

This ought to be faster than grouping. (I haven't measured)

You can make it even faster by writing your own extension method that adds items to a HashSet<T> and returns false immediately if Add() returns false.

You can even do that using a wicked shorthand:

if (!list.All(new HashSet<string>().Add))
私藏温柔 2025-01-09 04:29:35
if(lstSPlit.GroupBy(i => i).Where(g => g.Count() > 1).Any())
{
    // found more than once
}
if(lstSPlit.GroupBy(i => i).Where(g => g.Count() > 1).Any())
{
    // found more than once
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文