读取文件大文件很慢,请帮忙

发布于 2024-12-01 15:35:41 字数 2721 浏览 0 评论 0原文

这段代码需要大约 30 分钟并且 CPU 使用率很高,问题是什么?

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

比较、读取文件或正则表达式导致速度缓慢?

编辑

像这样分析代码

 Dim myTimer As New System.Diagnostics.Stopwatch()
        Dim t1 As Integer = 0
        Dim t2 As Integer = 0
        Dim t3 As Integer = 0
        'read the file line by line, collecting valid proxy
        Do
            'Read a line fromn the file
            myTimer.Reset()
            myTimer.Start()
            strLine = objReader.ReadLine()
            If strLine Is Nothing Then
                Exit Do
            End If
            myTimer.Stop()
            t1 = myTimer.Elapsed.Milliseconds
            'check valid proxy
            myTimer.Reset()
            myTimer.Start()
            m = Regex.Match(strLine.Trim, strProxyParttern)
            strMatch = m.Value.Trim
            If String.IsNullOrEmpty(strMatch) = True OrElse _
                strMatch.Contains("..") = True Then
                Continue Do
            End If
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            ' create proxy
            myTimer.Reset()
            myTimer.Start()
            tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
            tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
            tmpProxy.Status = "new"

            ' check 
            If lstProxys.Contains(tmpProxy) = True Then
                Continue Do
            End If
            lstProxys.Add(tmpProxy)
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
        Loop Until strLine Is Nothing

给出了这些结果,

Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc

看起来代码没问题,除了添加到列表之外

this code takes about 30 mins and high cpu usage, what is the problem

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

is the slowness from the comparism or from reading the file or from the regex?

EDIT

profiling the code like this

 Dim myTimer As New System.Diagnostics.Stopwatch()
        Dim t1 As Integer = 0
        Dim t2 As Integer = 0
        Dim t3 As Integer = 0
        'read the file line by line, collecting valid proxy
        Do
            'Read a line fromn the file
            myTimer.Reset()
            myTimer.Start()
            strLine = objReader.ReadLine()
            If strLine Is Nothing Then
                Exit Do
            End If
            myTimer.Stop()
            t1 = myTimer.Elapsed.Milliseconds
            'check valid proxy
            myTimer.Reset()
            myTimer.Start()
            m = Regex.Match(strLine.Trim, strProxyParttern)
            strMatch = m.Value.Trim
            If String.IsNullOrEmpty(strMatch) = True OrElse _
                strMatch.Contains("..") = True Then
                Continue Do
            End If
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            ' create proxy
            myTimer.Reset()
            myTimer.Start()
            tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
            tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
            tmpProxy.Status = "new"

            ' check 
            If lstProxys.Contains(tmpProxy) = True Then
                Continue Do
            End If
            lstProxys.Add(tmpProxy)
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
        Loop Until strLine Is Nothing

gave these results

Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc

looks like the code is ok right, except for the add to the list

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦在夏天 2024-12-08 15:35:41

速度问题是因为您使用的是列表(结构)。 List.Contains 方法是线性搜索(它会遍历列表中的每个项目以查看是否匹配),因此添加到列表中的唯一项目越多,所需的时间就越长。

由于您正在处理大量项目,因此请将 lstProxys 更改为 HashSet(Of T)。您应该会看到性能的巨大提升。您需要做的就是更改 lstProxys 的定义:

Dim lstProxys as New HashSet(Of structure)

The speed issue is because you are using a List(Of Structure). The List.Contains method is a linear search (it goes through each item of the list to see if it matches) so it takes increasingly longer the more unique items you add to the list.

Because you're dealing with a large number of items, change lstProxys into a HashSet(Of T). You should see a huge performance boost. All you should need to do is change the definition of lstProxys:

Dim lstProxys as New HashSet(Of structure)
薄凉少年不暖心 2024-12-08 15:35:41

磁盘 I/O 通常是此类问题的限制因素。根据磁盘速度,您预计吞吐量约为每秒 5-20 兆字节。

如果正则表达式包含导致大量回溯的表达式,那么它们可能会很慢,因此这是一种可能性,但与磁盘 I/O 相比,它应该是相当糟糕的。

由于代理列表中永远不会有多个项目,因此该比较不会成为问题。您没有创建任何新的代理对象,而是重复使用相同的代理对象,这意味着您更改了已放入列表中的对象的属性。当您将对象与其自身进行比较时,列表将始终包含第一次迭代后的对象,并且永远不会第二次添加。

当您为代理类的属性赋值时,代理类会执行任何操作吗?如果它执行诸如创建连接之类的操作,则可能会花费很长时间。

The disk I/O is usually the limiting factor for something like this. Depending on the disk speed you could expect a throughput of about 5-20 megabyte per second.

Regular expressions can be slow if they contain expressions that cause a lot of backtracking, so that is a possibility, but it should be pretty bad to be noticable compared to the disk I/O.

As there will never be more than one item in the proxy list, that comparion can't be the problem. You are not creating any new proxy object, but reusing the same, which means that you change the property of the object that you have already put in the list. As you are comparing the object with itself, the list will always contain the object after the first iteration, and will never be added a second time.

Does the proxy class do anything when you assign values to its properties? If it does something like creating a connection, that might be what's taking so long.

梨涡少年 2024-12-08 15:35:41

速度缓慢是由于比较、读取文件还是正则表达式导致的?

我们可以进行有根据的猜测,但为什么不直接测量呢?

例如,在发布模式下且不附加调试器的情况下分别运行以下三个测试,看看需要多长时间

'Test 1 Just IO

Do
  strLine = objReader.ReadLine()

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 2 IO + Regex

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 3 IO + regex and Compare
Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

is the slowness from the comparism or from reading the file or from the regex?

We could take educated guesses but why not measure it instead.

For example run the following three tests separately under release mode and without the debugger attached and see how long it takes

'Test 1 Just IO

Do
  strLine = objReader.ReadLine()

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 2 IO + Regex

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 3 IO + regex and Compare
Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文