在Python中识别字符串中的子字符串的最有效方法？

发布于 2024-10-11 20:27:22 字数 339 浏览 5 评论 0原文

我需要搜索相当长的字符串来查找 CPV（通用采购词汇）代码。

目前我正在使用一个简单的 for 循环和 str.find() 来执行此操作，

问题是，如果 CPV 代码以稍微不同的格式列出，则该算法将找不到它。

搜索字符串中代码的所有不同迭代的最有效方法是什么？这只是重新格式化多达 10,000 个 CPV 代码并为每个实例使用 str.find() 的情况吗？

不同格式的示例如下

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

等。

谢谢:)

原文

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.

at the moment i'm doing this with a simple for loop and str.find()

the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.

what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?

An example of different formatting could be as follows

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

etc.

Thanks :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

丿*梦醉红颜 2024-10-18 20:27:22

尝试使用正则表达式：（

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

进行修改，直到它与数据中的 CPV 紧密匹配。）

Try a regular expression:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it matches the CPVs in your data closely.)

回复收藏 0 原文

笑忘罢 2024-10-18 20:27:22

尝试使用 re（Python 正则表达式）中的任何函数。有关详细信息，请参阅文档。

您可以制作一个正则表达式来接受这些代码的多种不同格式，然后使用 re.findall 或类似的东西来提取信息。我不确定 CPV 是什么，所以我没有它的正则表达式（尽管也许你可以看看 Google 是否有正则表达式？）

回复收藏 0 原文

老娘不死你永远是小三 2024-10-18 20:27:22

cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

应用于您的示例数据返回正

则表达式可以读作

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

希望有帮助！

cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

applied to your sample data returns

The regular expression can be read as

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

Hope that helps!

回复收藏 0 原文

~没有更多了~