在Python中识别字符串中的子字符串的最有效方法?

发布于 2024-10-11 20:27:22 字数 339 浏览 5 评论 0原文

我需要搜索相当长的字符串来查找 CPV(通用采购词汇)代码。

目前我正在使用一个简单的 for 循环和 str.find() 来执行此操作,

问题是,如果 CPV 代码以稍微不同的格式列出,则该算法将找不到它。

搜索字符串中代码的所有不同迭代的最有效方法是什么?这只是重新格式化多达 10,000 个 CPV 代码并为每个实例使用 str.find() 的情况吗?

不同格式的示例如下

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

等。

谢谢:)

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.

at the moment i'm doing this with a simple for loop and str.find()

the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.

what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?

An example of different formatting could be as follows

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

etc.

Thanks :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

丿*梦醉红颜 2024-10-18 20:27:22

尝试使用正则表达式:(

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

进行修改,直到它与数据中的 CPV 紧密匹配。)

Try a regular expression:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it matches the CPVs in your data closely.)

笑忘罢 2024-10-18 20:27:22

尝试使用 re(Python 正则表达式)中的任何函数。有关详细信息,请参阅文档

您可以制作一个正则表达式来接受这些代码的多种不同格式,然后使用 re.findall 或类似的东西来提取信息。我不确定 CPV 是什么,所以我没有它的正则表达式(尽管也许你可以看看 Google 是否有正则表达式?)

Try using any of the functions in re (regular expressions for Python). See the docs for more info.

You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)

老娘不死你永远是小三 2024-10-18 20:27:22
cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

应用于您的示例数据返回正

30124120-1
30124120-1
30124120-1
30124120-1
30124120-1

则表达式可以读作

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

希望有帮助!

cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

applied to your sample data returns

30124120-1
30124120-1
30124120-1
30124120-1
30124120-1

The regular expression can be read as

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

Hope that helps!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文