读取嵌套拉链档案中的shapefile

发布于 2025-02-05 20:28:10 字数 1555 浏览 2 评论 0 原文

我有一个大的Zip存档“ Polska_shp.zip”,其中包含另一个Zip Archives(名为“ 02_SHP.ZIP”,“ 04_SHP.ZIP”等)。这些档案中的每一个都包含另一个邮政档案(例如,档案“ 02_shp.zip”具有“ 0201_shp.zip”,“ 0202_shp.zip”内部等等)。最后,这些档案包含许多shapefiles,我需要与所有Shapefiles一起阅读所有Shapefiles到目前为止

import zipfile
from io import BytesIO
import geopandas as gpd

with zipfile.ZipFile("Polska_SHP.zip", "r") as main_zfile:
    for name in main_zfile.namelist(): # lista archiwów w głównym folderze
        print("name: ", name)
        if ".zip" in name:
            zfiledata = BytesIO(main_zfile.read(name))
            with zipfile.ZipFile(zfiledata) as zfile2:
                for name2 in zfile2.namelist():
                    print("name2: ", name2)
                    if ".zip" in name2:
                        zfiledata2 = BytesIO(zfile2.read(name2))
                        with zipfile.ZipFile(zfiledata2) as zfile3:
                            for name3 in zfile3.namelist():
                                if "SWRS" in name3 and ".shp" in name3:
                                    print("name3: ", name3)
                                    gdf = gpd.read_file(name3)
                                    gdf.head()

name:  32_SHP.zip
name2:  32/3209_SHP.zip
name3:  PL.PZGiK.339.3209__OT_SWRS_L.shp

​阅读Shapefile:

CPLE_OPENFAILEDERROR TRACEBACK(最近的最新电话) fiona/_shim.pyx in fiona._shim.gdal_open_vector() fiona/_err.pyx in fiona._err.exc_wrap_pointer() CPLE_OPENFAILEDERROR:PL.PZGIK.339.3209__OT_SWRS_L.SHP:没有此类文件或目录

I have a big zip archive "Polska_SHP.zip" that contains another zip archives (named "02_SHP.zip", "04_SHP.zip" etc.). Each of these archives contains another zip archives (for example archive "02_SHP.zip" has "0201_SHP.zip", "0202_SHP.zip inside and so on). Finally, these archives contain many shapefiles, and I need to read all shapefiles with "SWRS" in the name into one geopandas dataframe. So far I've been able to search for the names of these shapefiles and I've tried to read them:

import zipfile
from io import BytesIO
import geopandas as gpd

with zipfile.ZipFile("Polska_SHP.zip", "r") as main_zfile:
    for name in main_zfile.namelist(): # lista archiwów w głównym folderze
        print("name: ", name)
        if ".zip" in name:
            zfiledata = BytesIO(main_zfile.read(name))
            with zipfile.ZipFile(zfiledata) as zfile2:
                for name2 in zfile2.namelist():
                    print("name2: ", name2)
                    if ".zip" in name2:
                        zfiledata2 = BytesIO(zfile2.read(name2))
                        with zipfile.ZipFile(zfiledata2) as zfile3:
                            for name3 in zfile3.namelist():
                                if "SWRS" in name3 and ".shp" in name3:
                                    print("name3: ", name3)
                                    gdf = gpd.read_file(name3)
                                    gdf.head()

and it prints the name I need:

name:  32_SHP.zip
name2:  32/3209_SHP.zip
name3:  PL.PZGiK.339.3209__OT_SWRS_L.shp

but it fails when it comes to reading shapefile:

CPLE_OpenFailedError Traceback (most recent call last)
fiona/_shim.pyx in fiona._shim.gdal_open_vector()
fiona/_err.pyx in fiona._err.exc_wrap_pointer()
CPLE_OpenFailedError: PL.PZGiK.339.3209__OT_SWRS_L.shp: No such file or directory

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

初熏 2025-02-12 20:28:10

截至此答案时,Geopandas支持Zipfile内部的路径。您只需要使用将它们分开

Zipfile_path = "archive.zip"
nested_file_path = "folder/geofile.shp"

gdf = geopandas.read_file(f"{Zipfile_path}!{nested_file_path}")

As of the time of this answer, geopandas supports the path inside zipfile. You just need to separate them by using !

Zipfile_path = "archive.zip"
nested_file_path = "folder/geofile.shp"

gdf = geopandas.read_file(f"{Zipfile_path}!{nested_file_path}")
山色无中 2025-02-12 20:28:10

name3 变量您要传递到 gpd.read_file()只是zip中文件的名称,为此,您首先必须提取zip。

另一个选项是传递存档的类似文件状对象,尽管这假设ZIP和SHP文件中只有一个数据集,其所有朋友都位于顶级目录中。请注意,我的样本只有 2 嵌套档案的水平。 ShapeFiles具有不同的属性,因此使用GeodataFrames的列表 - gdfs - 用于收集所有数据。在您的情况下,您可能想使用 pandas.concat()
(顺便说一句,您当前的循环尝试覆盖 gdf ​​每次)

# python     : 3.8.13
# geopandas  : 0.10.2
# fiona      : 1.8.18

import geopandas as gpd
import zipfile
import re

# shp_regex = "SWRS.*\.shp$"
shp_regex = "^ne_.*\.shp$"

# list of geodataframes
gdfs = []
with zipfile.ZipFile("nat_earth.zip", "r") as main_zfile:
    main_zfile.printdir()
    print("- " * 40)
    # only cycle through *.zip files
    for name in [fname for fname in main_zfile.namelist() if fname.endswith(".zip")]:
        print(f'>> {name}:')
        with main_zfile.open(name, "r") as zipped_shp:
            zipped_shp_namelist = zipfile.ZipFile(zipped_shp).namelist()
            print(", ".join(zipped_shp_namelist))
            # check if any of the files actually matches the pattern
            if any(re.search(shp_regex, level2_fname) for level2_fname in zipped_shp_namelist):
                # for gpd.read_file() file position must be changed back to 0
                zipped_shp.seek(0)
                gdfs.append(gpd.read_file(zipped_shp))
                rows, cols = gdfs[-1].shape
                print(f'GeoDataFrame: {rows} rows, {cols} columns\n')

# head of first gdf
print(gdfs[0].head())

输出:

File Name                                             Modified             Size
ne_50m_admin_0_countries.zip                   2021-12-08 03:47:44       792663
ne_50m_lakes.zip                               2021-12-08 03:49:54       252615
ne_50m_ocean.zip                               2021-09-04 08:56:52       461745
ne_50m_rivers_lake_centerlines.zip             2021-12-08 03:49:54       504454
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
>> ne_50m_admin_0_countries.zip:
ne_50m_admin_0_countries.README.html, ne_50m_admin_0_countries.VERSION.txt, 
ne_50m_admin_0_countries.cpg, ne_50m_admin_0_countries.dbf, 
ne_50m_admin_0_countries.prj, ne_50m_admin_0_countries.shp, 
ne_50m_admin_0_countries.shx
GeoDataFrame: 242 rows, 162 columns

>> ne_50m_lakes.zip:
ne_50m_lakes.README.html, ne_50m_lakes.VERSION.txt, ne_50m_lakes.cpg, 
ne_50m_lakes.dbf, ne_50m_lakes.prj, ne_50m_lakes.shp, ne_50m_lakes.shx
GeoDataFrame: 412 rows, 40 columns

>> ne_50m_ocean.zip:
ne_50m_ocean.README.html, ne_50m_ocean.VERSION.txt, ne_50m_ocean.cpg, 
ne_50m_ocean.dbf, ne_50m_ocean.prj, ne_50m_ocean.shp, ne_50m_ocean.shx
GeoDataFrame: 1 rows, 4 columns

>> ne_50m_rivers_lake_centerlines.zip:
ne_50m_rivers_lake_centerlines.README.html, 
ne_50m_rivers_lake_centerlines.VERSION.txt, 
ne_50m_rivers_lake_centerlines.cpg,  
ne_50m_rivers_lake_centerlines.dbf, ne_50m_rivers_lake_centerlines.prj, 
ne_50m_rivers_lake_centerlines.shp, ne_50m_rivers_lake_centerlines.shx
GeoDataFrame: 478 rows, 37 columns

        featurecla  scalerank  LABELRANK SOVEREIGNT SOV_A3  ADM0_DIF  LEVEL  \
0  Admin-0 country          1          3   Zimbabwe    ZWE         0      2   
1  Admin-0 country          1          3     Zambia    ZMB         0      2   
2  Admin-0 country          1          3      Yemen    YEM         0      2   
3  Admin-0 country          3          2    Vietnam    VNM         0      2   
4  Admin-0 country          5          3  Venezuela    VEN         0      2   
... 
                                            geometry  
0  POLYGON ((31.28789 -22.40205, 31.19727 -22.344...  
1  POLYGON ((30.39609 -15.64307, 30.25068 -15.643...  
2  MULTIPOLYGON (((53.08564 16.64839, 52.58145 16...  
3  MULTIPOLYGON (((104.06396 10.39082, 104.08301 ...  
4  MULTIPOLYGON (((-60.82119 9.13838, -60.94141 9...  

[5 rows x 162 columns]

The name3 variable you are passing to gpd.read_file() is just the name of the file in ZIP, for this to work you would first have to extract the ZIP.

Another option would be passing the file-like object of the archive, though this assumes there's only one dataset included in the zip and shp-file with all its friends are in top level directory. Please note that my sample had just 2 levels of nested archives. And shapefiles had different attributes, thus a list of geodataframes - gdfs - is used to collect all the data. In your case you probably want to use pandas.concat().
(BTW, your current loop attempts to overwrite gdf each time)

# python     : 3.8.13
# geopandas  : 0.10.2
# fiona      : 1.8.18

import geopandas as gpd
import zipfile
import re

# shp_regex = "SWRS.*\.shp
quot;
shp_regex = "^ne_.*\.shp
quot;

# list of geodataframes
gdfs = []
with zipfile.ZipFile("nat_earth.zip", "r") as main_zfile:
    main_zfile.printdir()
    print("- " * 40)
    # only cycle through *.zip files
    for name in [fname for fname in main_zfile.namelist() if fname.endswith(".zip")]:
        print(f'>> {name}:')
        with main_zfile.open(name, "r") as zipped_shp:
            zipped_shp_namelist = zipfile.ZipFile(zipped_shp).namelist()
            print(", ".join(zipped_shp_namelist))
            # check if any of the files actually matches the pattern
            if any(re.search(shp_regex, level2_fname) for level2_fname in zipped_shp_namelist):
                # for gpd.read_file() file position must be changed back to 0
                zipped_shp.seek(0)
                gdfs.append(gpd.read_file(zipped_shp))
                rows, cols = gdfs[-1].shape
                print(f'GeoDataFrame: {rows} rows, {cols} columns\n')

# head of first gdf
print(gdfs[0].head())

Output:

File Name                                             Modified             Size
ne_50m_admin_0_countries.zip                   2021-12-08 03:47:44       792663
ne_50m_lakes.zip                               2021-12-08 03:49:54       252615
ne_50m_ocean.zip                               2021-09-04 08:56:52       461745
ne_50m_rivers_lake_centerlines.zip             2021-12-08 03:49:54       504454
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
>> ne_50m_admin_0_countries.zip:
ne_50m_admin_0_countries.README.html, ne_50m_admin_0_countries.VERSION.txt, 
ne_50m_admin_0_countries.cpg, ne_50m_admin_0_countries.dbf, 
ne_50m_admin_0_countries.prj, ne_50m_admin_0_countries.shp, 
ne_50m_admin_0_countries.shx
GeoDataFrame: 242 rows, 162 columns

>> ne_50m_lakes.zip:
ne_50m_lakes.README.html, ne_50m_lakes.VERSION.txt, ne_50m_lakes.cpg, 
ne_50m_lakes.dbf, ne_50m_lakes.prj, ne_50m_lakes.shp, ne_50m_lakes.shx
GeoDataFrame: 412 rows, 40 columns

>> ne_50m_ocean.zip:
ne_50m_ocean.README.html, ne_50m_ocean.VERSION.txt, ne_50m_ocean.cpg, 
ne_50m_ocean.dbf, ne_50m_ocean.prj, ne_50m_ocean.shp, ne_50m_ocean.shx
GeoDataFrame: 1 rows, 4 columns

>> ne_50m_rivers_lake_centerlines.zip:
ne_50m_rivers_lake_centerlines.README.html, 
ne_50m_rivers_lake_centerlines.VERSION.txt, 
ne_50m_rivers_lake_centerlines.cpg,  
ne_50m_rivers_lake_centerlines.dbf, ne_50m_rivers_lake_centerlines.prj, 
ne_50m_rivers_lake_centerlines.shp, ne_50m_rivers_lake_centerlines.shx
GeoDataFrame: 478 rows, 37 columns

        featurecla  scalerank  LABELRANK SOVEREIGNT SOV_A3  ADM0_DIF  LEVEL  \
0  Admin-0 country          1          3   Zimbabwe    ZWE         0      2   
1  Admin-0 country          1          3     Zambia    ZMB         0      2   
2  Admin-0 country          1          3      Yemen    YEM         0      2   
3  Admin-0 country          3          2    Vietnam    VNM         0      2   
4  Admin-0 country          5          3  Venezuela    VEN         0      2   
... 
                                            geometry  
0  POLYGON ((31.28789 -22.40205, 31.19727 -22.344...  
1  POLYGON ((30.39609 -15.64307, 30.25068 -15.643...  
2  MULTIPOLYGON (((53.08564 16.64839, 52.58145 16...  
3  MULTIPOLYGON (((104.06396 10.39082, 104.08301 ...  
4  MULTIPOLYGON (((-60.82119 9.13838, -60.94141 9...  

[5 rows x 162 columns]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文