如何从另一个CSV中的一个CSV查找数据?

发布于 2025-02-13 04:06:17 字数 1326 浏览 0 评论 0原文

  • crq_data文件中,我有一个城市和状态来自用户上传的 *.csv文件,
  • code> citydoordinates.csv文件我有一个美国城市库,并有坐标。 ,我希望这是一种“查找工具”,以比较上传的.csv文件以查找其坐标,

以便在folium中映射,它逐行读取行因此,它一次(n秒)一次附加坐标,我希望它运行得更快,以便如果有6000行,则用户不必等待6000秒。

这是我代码的一部分:

crq_file = askopenfilename(filetypes=[('CSV Files', '*csv')])
crq_data = pd.read_csv(crq_file, encoding="utf8")
coords = pd.read_csv("cityCoordinates.csv")

for crq in range(len(crq_data)):
    task_city = crq_data.iloc[crq]["TaskCity"]
    task_state = crq_data.iloc[crq]["TaskState"]

    for coordinates in range(len(coords)):
        cityCoord = coords.iloc[coordinates]["City"]
        stateCoord = coords.iloc[coordinates]["State"]
        latCoord = coords.iloc[coordinates]["Latitude"]
        lngCoord = coords.iloc[coordinates]["Longitude"]

        if task_city == cityCoord and task_state == stateCoord:
            crq_data["CRQ Latitude"] = latCoord
            crq_data["CRQ Longitude"] = lngCoord
                
            print(cityCoord, stateCoord, latCoord, lngCoord)

这是当前终端输出的一个示例

上传的示例.CSV文件

  • In the crq_data file I have cities and states from a user uploaded *.csv file
  • In the cityDoordinates.csv file I have a library of American cities and states along with their coordinates, I would like this to be a sort of "look up tool" to compare an uploaded .csv file to find their coordinates to map in Folium

Right now, it reads line by line so it appends the coordinates one at a time (n seconds) I would like it to run much faster so that if there are 6000 lines the user doesn't have to wait for 6000 seconds.

Here is part of my code:

crq_file = askopenfilename(filetypes=[('CSV Files', '*csv')])
crq_data = pd.read_csv(crq_file, encoding="utf8")
coords = pd.read_csv("cityCoordinates.csv")

for crq in range(len(crq_data)):
    task_city = crq_data.iloc[crq]["TaskCity"]
    task_state = crq_data.iloc[crq]["TaskState"]

    for coordinates in range(len(coords)):
        cityCoord = coords.iloc[coordinates]["City"]
        stateCoord = coords.iloc[coordinates]["State"]
        latCoord = coords.iloc[coordinates]["Latitude"]
        lngCoord = coords.iloc[coordinates]["Longitude"]

        if task_city == cityCoord and task_state == stateCoord:
            crq_data["CRQ Latitude"] = latCoord
            crq_data["CRQ Longitude"] = lngCoord
                
            print(cityCoord, stateCoord, latCoord, lngCoord)

This is an example of the current Terminal Output

Example of uploaded .csv file

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

两仪 2025-02-20 04:06:20

我认为这不是与大熊猫进行优化的问题,而是为快速查找找到一个良好的数据结构:快速查找的良好数据结构是dict。不过,dict记忆了;您需要为自己评估这笔费用。

我嘲笑了您的CityCoordinates CSV的外观:

| City     | State | Latitude   | Longitude   |
|----------|-------|------------|-------------|
| Portland | OR    | 45°31′12″N | 122°40′55″W |
| Dallas   | TX    | 32°46′45″N | 96°48′32″W  |
| Portland | ME    | 43°39′36″N | 70°15′18″W  |
import csv
import pprint


def cs_key(city_name: str, state_name: str) -> str:
    """Make a normalized City-State key."""
    return city_name.strip().lower() + "--" + state_name.strip().lower()


# A dict of { "City_name-State_name": (latitude, longitude), ... }
coords_lookup = {}

with open("cityCoordinates.csv", newline="") as f:
    reader = csv.DictReader(f)  # your coords file appears to have a header
    for row in reader:
        city = row["City"]
        state = row["State"]
        lat = row["Latitude"]
        lon = row["Longitude"]

        key = cs_key(city, state)
        coords_lookup[key] = (lat, lon)


pprint.pprint(coords_lookup, sort_dicts=False)

当我运行它时,我得到:

{'portland--or': ('45°31′12″N', '122°40′55″W'),
 'dallas--tx':   ('32°46′45″N', '96°48′32″W'),
 'portland--me': ('43°39′36″N', '70°15′18″W')}

现在,迭代任务数据看起来几乎相同:我们采用一对城市并进行了状态,将其制作出标准化的钥匙,然后尝试尝试查找已知坐标的钥匙。

我嘲笑了一些任务数据:

| TaskCity   | TaskState |
|------------|-----------|
| Portland   | OR        |
| Fort Worth | TX        |
| Dallas     | TX        |
| Boston     | MA        |
| Portland   | ME        |

当我运行此任务时:

with open("crq_data.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        city = row["TaskCity"]
        state = row["TaskState"]

        key = cs_key(city, state)
        coords = coords_lookup.get(key, (None, None))
        if coords != (None, None):
            print(city, state, coords[0], coords[1])

我得到:

Portland OR 45°31′12″N 122°40′55″W
Dallas TX 32°46′45″N 96°48′32″W
Portland ME 43°39′36″N 70°15′18″W

此解决方案原则上会更快,因为您不做City> City> CityData-Rows x Taskdata-Rows Quadratic Loop。而且,在实践中,熊猫进行行迭代时会遭受痛苦^1 ,我不确定是否有同样的索引( iLoc),但总的来说,Pandas是为了操纵数据的,我想说的不是面向行的问题/解决方案。

I see this not as a problem w/optimizing Pandas, but finding a good data structure for fast lookups: and a good data structure for fast lookups is the dict. The dict takes memory, though; you'll need to evaluate that cost for yourself.

I mocked up what your cityCoordinates CSV could look like:

| City     | State | Latitude   | Longitude   |
|----------|-------|------------|-------------|
| Portland | OR    | 45°31′12″N | 122°40′55″W |
| Dallas   | TX    | 32°46′45″N | 96°48′32″W  |
| Portland | ME    | 43°39′36″N | 70°15′18″W  |
import csv
import pprint


def cs_key(city_name: str, state_name: str) -> str:
    """Make a normalized City-State key."""
    return city_name.strip().lower() + "--" + state_name.strip().lower()


# A dict of { "City_name-State_name": (latitude, longitude), ... }
coords_lookup = {}

with open("cityCoordinates.csv", newline="") as f:
    reader = csv.DictReader(f)  # your coords file appears to have a header
    for row in reader:
        city = row["City"]
        state = row["State"]
        lat = row["Latitude"]
        lon = row["Longitude"]

        key = cs_key(city, state)
        coords_lookup[key] = (lat, lon)


pprint.pprint(coords_lookup, sort_dicts=False)

When I run that, I get:

{'portland--or': ('45°31′12″N', '122°40′55″W'),
 'dallas--tx':   ('32°46′45″N', '96°48′32″W'),
 'portland--me': ('43°39′36″N', '70°15′18″W')}

Now, iterating the task data looks pretty much the same: we take a pair of City and State, make a normalized key out of them, then try to look up that key for known coordinates.

I mocked up some task data:

| TaskCity   | TaskState |
|------------|-----------|
| Portland   | OR        |
| Fort Worth | TX        |
| Dallas     | TX        |
| Boston     | MA        |
| Portland   | ME        |

and when I run this:

with open("crq_data.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        city = row["TaskCity"]
        state = row["TaskState"]

        key = cs_key(city, state)
        coords = coords_lookup.get(key, (None, None))
        if coords != (None, None):
            print(city, state, coords[0], coords[1])

I get:

Portland OR 45°31′12″N 122°40′55″W
Dallas TX 32°46′45″N 96°48′32″W
Portland ME 43°39′36″N 70°15′18″W

This solution is going to be much faster in principle because you're not doing a cityCoordinates-ROWS x taskData-ROWS quadratic loop. And, in practice, Pandas suffers when doing row iteration^1, I'm not sure if the same holds for indexing (iloc), but in general Pandas is for manipulating columns of data, and I would say is not for row-oriented problems/solutions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文