读取文件大文件很慢，请帮忙

发布于 2024-12-01 15:35:41 字数 2721 浏览 6 评论 0原文

这段代码需要大约 30 分钟并且 CPU 使用率很高，问题是什么？

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

比较、读取文件或正则表达式导致速度缓慢？

编辑

像这样分析代码

 Dim myTimer As New System.Diagnostics.Stopwatch()
        Dim t1 As Integer = 0
        Dim t2 As Integer = 0
        Dim t3 As Integer = 0
        'read the file line by line, collecting valid proxy
        Do
            'Read a line fromn the file
            myTimer.Reset()
            myTimer.Start()
            strLine = objReader.ReadLine()
            If strLine Is Nothing Then
                Exit Do
            End If
            myTimer.Stop()
            t1 = myTimer.Elapsed.Milliseconds
            'check valid proxy
            myTimer.Reset()
            myTimer.Start()
            m = Regex.Match(strLine.Trim, strProxyParttern)
            strMatch = m.Value.Trim
            If String.IsNullOrEmpty(strMatch) = True OrElse _
                strMatch.Contains("..") = True Then
                Continue Do
            End If
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            ' create proxy
            myTimer.Reset()
            myTimer.Start()
            tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
            tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
            tmpProxy.Status = "new"

            ' check 
            If lstProxys.Contains(tmpProxy) = True Then
                Continue Do
            End If
            lstProxys.Add(tmpProxy)
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
        Loop Until strLine Is Nothing

给出了这些结果，

Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc

看起来代码没问题，除了添加到列表之外

原文

this code takes about 30 mins and high cpu usage, what is the problem

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

is the slowness from the comparism or from reading the file or from the regex?

EDIT

profiling the code like this

 Dim myTimer As New System.Diagnostics.Stopwatch()
        Dim t1 As Integer = 0
        Dim t2 As Integer = 0
        Dim t3 As Integer = 0
        'read the file line by line, collecting valid proxy
        Do
            'Read a line fromn the file
            myTimer.Reset()
            myTimer.Start()
            strLine = objReader.ReadLine()
            If strLine Is Nothing Then
                Exit Do
            End If
            myTimer.Stop()
            t1 = myTimer.Elapsed.Milliseconds
            'check valid proxy
            myTimer.Reset()
            myTimer.Start()
            m = Regex.Match(strLine.Trim, strProxyParttern)
            strMatch = m.Value.Trim
            If String.IsNullOrEmpty(strMatch) = True OrElse _
                strMatch.Contains("..") = True Then
                Continue Do
            End If
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            ' create proxy
            myTimer.Reset()
            myTimer.Start()
            tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
            tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
            tmpProxy.Status = "new"

            ' check 
            If lstProxys.Contains(tmpProxy) = True Then
                Continue Do
            End If
            lstProxys.Add(tmpProxy)
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
        Loop Until strLine Is Nothing

gave these results

Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc

looks like the code is ok right, except for the add to the list

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦在夏天 2024-12-08 15:35:41

速度问题是因为您使用的是列表（结构）。 List.Contains 方法是线性搜索（它会遍历列表中的每个项目以查看是否匹配），因此添加到列表中的唯一项目越多，所需的时间就越长。

由于您正在处理大量项目，因此请将 lstProxys 更改为 HashSet(Of T)。您应该会看到性能的巨大提升。您需要做的就是更改 lstProxys 的定义：

Dim lstProxys as New HashSet(Of structure)

The speed issue is because you are using a List(Of Structure). The List.Contains method is a linear search (it goes through each item of the list to see if it matches) so it takes increasingly longer the more unique items you add to the list.

Because you're dealing with a large number of items, change lstProxys into a HashSet(Of T). You should see a huge performance boost. All you should need to do is change the definition of lstProxys:

Dim lstProxys as New HashSet(Of structure)

回复收藏 0 原文

薄凉少年不暖心 2024-12-08 15:35:41

磁盘 I/O 通常是此类问题的限制因素。根据磁盘速度，您预计吞吐量约为每秒 5-20 兆字节。

如果正则表达式包含导致大量回溯的表达式，那么它们可能会很慢，因此这是一种可能性，但与磁盘 I/O 相比，它应该是相当糟糕的。

由于代理列表中永远不会有多个项目，因此该比较不会成为问题。您没有创建任何新的代理对象，而是重复使用相同的代理对象，这意味着您更改了已放入列表中的对象的属性。当您将对象与其自身进行比较时，列表将始终包含第一次迭代后的对象，并且永远不会第二次添加。

当您为代理类的属性赋值时，代理类会执行任何操作吗？如果它执行诸如创建连接之类的操作，则可能会花费很长时间。

回复收藏 0 原文

梨涡少年 2024-12-08 15:35:41

速度缓慢是由于比较、读取文件还是正则表达式导致的？

我们可以进行有根据的猜测，但为什么不直接测量呢？

例如，在发布模式下且不附加调试器的情况下分别运行以下三个测试，看看需要多长时间

'Test 1 Just IO

Do
  strLine = objReader.ReadLine()

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 2 IO + Regex

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 3 IO + regex and Compare
Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

is the slowness from the comparism or from reading the file or from the regex?

We could take educated guesses but why not measure it instead.

For example run the following three tests separately under release mode and without the debugger attached and see how long it takes

'Test 1 Just IO

Do
  strLine = objReader.ReadLine()

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 2 IO + Regex

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 3 IO + regex and Compare
Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

回复收藏 0 原文

~没有更多了~