查找字符串列表是否多次包含相同元素

发布于 2025-01-02 04:29:35 字数 1124 浏览 5 评论 0原文

我正在为产品销售网站编写自己的特定网络爬虫。由于它们的编码性质非常糟糕，我得到了指向同一页面的网址。

示例一

http://www.hizlial.com/bilgisayar/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

例如上面的页面与下面的页面相同

http://www.hizlial.com/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

正如你所看到的，当你通过'/'字符分割时它包含2个“bilgisayar”元素

所以我想要的是我想像这样分割网址

 string[] lstSPlit = srURL.Split('/');

之后检查该列表是否多次包含相同的元素。任何元素。如果包含任何元素，我将跳过该 url，因为我已经从其他页面提取了真实的 url。那么最好的方法是什么？

更长但有效的版本

string[] lstSPlit = srHref.Split('/');
bool blDoNotAdd = false;
HashSet<string> splitHashSet=new HashSet<string>();
foreach (var vrLstValue in lstSPlit)
{
    if (vrLstValue.Length > 1)
    {
        if (splitHashSet.Contains(vrLstValue) == false)
        {
            splitHashSet.Add(vrLstValue);
        }
        else
        {
            blDoNotAdd = true;
            break;
        }
    }
}

原文

I am writing my own specific web crawler for product selling websites. Due to their very bad coding nature i get with getting urls pointing same page.

Example one

http://www.hizlial.com/bilgisayar/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

For example the page above is same as below

http://www.hizlial.com/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm

As you can see it contains 2 "bilgisayar" element when you split via '/' character

So what i want is i want to split urls like this

 string[] lstSPlit = srURL.Split('/');

After that check that whether that list contains same element more than once or not. Any element. If contains any element i will skip the url because i would have already have the real url extracted from some other page. So what is the best way of doing this ?

Longer but working version

string[] lstSPlit = srHref.Split('/');
bool blDoNotAdd = false;
HashSet<string> splitHashSet=new HashSet<string>();
foreach (var vrLstValue in lstSPlit)
{
    if (vrLstValue.Length > 1)
    {
        if (splitHashSet.Contains(vrLstValue) == false)
        {
            splitHashSet.Add(vrLstValue);
        }
        else
        {
            blDoNotAdd = true;
            break;
        }
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放手` 2025-01-09 04:29:35

if (list.Distinct().Count() < list.Count)

这应该比分组更快。（我没有测量）

您可以通过编写自己的扩展方法来使其更快，该方法将项目添加到 HashSet 并在 Add() 时立即返回 false返回假。

你甚至可以使用邪恶的速记来做到这一点：

if (!list.All(new HashSet<string>().Add))

if (list.Distinct().Count() < list.Count)

This ought to be faster than grouping. (I haven't measured)

You can make it even faster by writing your own extension method that adds items to a HashSet<T> and returns false immediately if Add() returns false.

You can even do that using a wicked shorthand:

if (!list.All(new HashSet<string>().Add))

回复收藏 0 原文

私藏温柔 2025-01-09 04:29:35

if(lstSPlit.GroupBy(i => i).Where(g => g.Count() > 1).Any())
{
    // found more than once
}

if(lstSPlit.GroupBy(i => i).Where(g => g.Count() > 1).Any())
{
    // found more than once
}

回复收藏 0 原文

~没有更多了~

关于作者

少女情怀诗

暂无简介

文章

489 人气

关注发私信

友情链接

文江博客

查找字符串列表是否多次包含相同元素

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

浪子阿飞

JK.Yang

人间不值得

静待花开

只涨不跌

污浊的双黑

友情链接

查找字符串列表是否多次包含相同元素

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

浪子阿飞

JK.Yang

人间不值得

静待花开

只涨不跌

污浊的双黑

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。