如何提取文本字符串中的文本

发布于 2024-11-30 10:53:29 字数 408 浏览 1 评论 0原文

我有一个简单的问题,我希望在不使用 VBA 的情况下解决它,但如果这是解决它的唯一方法,那就这样吧。

我有一个包含多行(均为一列)的文件。每行的数据如下所示:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-氢氧化物 gi|297338191|gb|23343|随机随机

2 5.09E-09 >gi|168010496|ref|xp_00| 2-丙酮酸

等...

我想要的是某种方法来提取以“gi|”开头的数字字符串并以“|”结尾。对于某些行,这可能意味着多达 5 个 gi 编号,而对于其他行,则只有 1 个。

我希望输出看起来像这样:

297848936,297338191

168010496

等等...

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

梦年海沫深 2024-12-07 10:53:29

这是使用正则表达式对象的非常灵活的 VBA 答案。该函数的作用是提取它找到的每个子组匹配(括号内的内容),并用您想要的任何字符串分隔(默认为“,”)。 可以在此处找到有关正则表达式的信息:http://www.regular-expressions.info/

您 会这样称呼它,假设第一个字符串位于 A1 中:

=RegexExtract(A1,"gi[|](\d+)[|]")

因为这会查找所有出现的“gi|”接下来是一系列数字,然后是另一个“|”,对于问题中的第一行,这将为您提供以下结果:

297848936, 297338191

只需将其沿着列运行即可,您就完成了!

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

You would call it like this, assuming that first string is in A1:

=RegexExtract(A1,"gi[|](\d+)[|]")

Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

297848936, 297338191

Just run this down the column and you're all done!

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function
未蓝澄海的烟 2024-12-07 10:53:29

这是(假设数据在 A 列中)

=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))

不是最好的公式,但它可以提取数字。

我刚刚注意到,因为每行有两个值,输出用逗号分隔。您需要检查是否有第二个匹配项、第三个匹配项等,以使其适用于每个单元格的多个数字。

参考您的确切样本(假设每个单元格最多有 2 个值),以下代码将起作用:

=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ), 
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) 
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2)) 
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))

怎么样? VBA 解决方案可能更适合您,但我将把它留在这里给您。

要达到 5 个数字,请研究模式并在公式中手动递归。它会变得很长!

Here it is (assuming data is in column A)

=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))

Not the nicest formula, but it will work to extract the number.

I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.

In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:

=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ), 
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) 
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2)) 
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))

How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.

To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!

走过海棠暮 2024-12-07 10:53:29

我可能会首先使用将文本转换为列向导在 | 分隔符上拆分数据。
在 Excel 2007 中,位于数据选项卡上的数据工具组,然后选择文本到列。指定 其他:| 作为分隔符。

从您发布的示例数据来看,执行此操作后,数字将全部位于同一列中,因此您可以删除不需要的列。

I'd probably split the data first on the | delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.

From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.

静谧幽蓝 2024-12-07 10:53:29

由于其他人提出了没有 VBA 的解决方案...我将介绍确实使用的解决方案。现在,您可以决定是否使用它。

刚刚看到@Issun 提出了正则表达式的解决方案,非常好!无论哪种方式,都将仅使用“简单”VBA 为问题提供“适度”的解决方案。

Option Explicit
Option Base 0

Sub findGi()

    Dim oCell As Excel.Range
    Set oCell = Sheets(1).Range("A1")

    'Loops through every row until empty cell
    While Not oCell.Value = ""

        oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
        Set oCell = oCell.Offset(1, 0)

    Wend

End Sub

Private Function GetGi(ByVal sValue As String) As String

    Dim sResult As String
    Dim vArray As Variant
    Dim vItem As Variant
    Dim iCount As Integer

    vArray = Split(sValue, "|")
    iCount = 0

    'Loops through the array...
    For Each vItem In vArray

        'Searches for the 'Gi' factor...
        If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then

            'Concatenates the results...
            sResult = sResult & vArray(iCount + 1) & ","

        End If

        iCount = iCount + 1

    Next vItem

    'And removes trail comma
    If Len(sResult) > 0 Then

        sResult = Left(sResult, Len(sResult) - 1)

    End If

    GetGi = sResult

End Function

As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.

Just saw that @Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.

Option Explicit
Option Base 0

Sub findGi()

    Dim oCell As Excel.Range
    Set oCell = Sheets(1).Range("A1")

    'Loops through every row until empty cell
    While Not oCell.Value = ""

        oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
        Set oCell = oCell.Offset(1, 0)

    Wend

End Sub

Private Function GetGi(ByVal sValue As String) As String

    Dim sResult As String
    Dim vArray As Variant
    Dim vItem As Variant
    Dim iCount As Integer

    vArray = Split(sValue, "|")
    iCount = 0

    'Loops through the array...
    For Each vItem In vArray

        'Searches for the 'Gi' factor...
        If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then

            'Concatenates the results...
            sResult = sResult & vArray(iCount + 1) & ","

        End If

        iCount = iCount + 1

    Next vItem

    'And removes trail comma
    If Len(sResult) > 0 Then

        sResult = Left(sResult, Len(sResult) - 1)

    End If

    GetGi = sResult

End Function
久隐师 2024-12-07 10:53:29

在 Google 表格中打开 Excel 并使用正则表达式 REGEXEXTRACT

示例使用

=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")

提示:在此示例中,REGEXEXTRACT 将返回 241,因为它返回第一个匹配的大小写。

在你的情况下

=REGEXEXTRACT(A1,"gi[|](\d+)[|]")

open your excel in Google Sheets and use the regular expression with REGEXEXTRACT

Sample Usage

=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")

Tip: REGEXEXTRACT will return 241 in this example because it returns the first matching case.

In your case

=REGEXEXTRACT(A1,"gi[|](\d+)[|]")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文