优化大型CSV文件Python的处理

发布于 2025-02-02 07:01:28 字数 692 浏览 3 评论 0 原文

我有一个约1.75亿行（2.86 GB）的CSV文件，由三列组成，如下所示：

我需要在“ val”给定“ id1”和“ id2”列中获取值。我会以ID1和ID2的不同组合不断地查询此数据框，这在整个文件中是唯一的。

我试图使用如下所示的熊猫，但结果花费了很多时间。

def is_av(Qterm, Cterm, df):
    try:
        return df.loc[(Qterm, Cterm),'val']
    except KeyError:
        return 0

知道此值位于整个文件的一行中，是否有更快的方法访问CSV值。如果没有，您可以检查此功能，并告诉我缓慢处理的问题

for nc in L:#ID1
    score = 0.0
    for ni in id_list:#ID2
        e = is_av(ni,nc,df_g)
        InDegree = df1.loc[ni].values[0]
        SumInMap = df2.loc[nc].values[0]
        score = score + term_score(InDegree, SumInMap, e) #compute a score 
    key  = pd_df3.loc[nc].values[0]
    tmt[key] = score

原文

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :

I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.

I have tried to use pandas as shown below, but results are taking a lot of time.

def is_av(Qterm, Cterm, df):
    try:
        return df.loc[(Qterm, Cterm),'val']
    except KeyError:
        return 0

Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file.
If not could you check this function and tell me what might be the issue of slow processing

for nc in L:#ID1
    score = 0.0
    for ni in id_list:#ID2
        e = is_av(ni,nc,df_g)
        InDegree = df1.loc[ni].values[0]
        SumInMap = df2.loc[nc].values[0]
        score = score + term_score(InDegree, SumInMap, e) #compute a score 
    key  = pd_df3.loc[nc].values[0]
    tmt[key] = score

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一口甜 2025-02-09 07:01:29

tl; dr：使用dbms（我建议mysql或postgresql）。熊猫绝对不适合这种工作。 Dask更好，但不如传统DBM好。

这样做的绝对最佳方法是使用SQL，将MySQL或PostgreSQL视为启动器（对于当前用例而言是免费和非常有效的替代方案）。尽管Pandas是一个非常强大的库，但在索引和快速阅读方面，这并不是它所脱颖而出的事情，因为它需要将数据加载到内存中，或者与DBMS相比，无法控制的数据流传输。

考虑一下您具有多个值的用例，并且想跳过特定的行，假设您正在寻找（ID1，ID2），值为（3108，4813）。您想跳过以31以外的其他任何事物开头的每一行，然后再跳过31，然后跳过以3108,4以外的任何其他行动（假设您的CSV定界符是逗号），所以直到您完全找到您要寻找的ID1和ID2之前，这是在字符级别读取数据。

熊猫不允许您这样做（据我所知，如果有人可以纠正此响应）。另一个示例使用dask，该图书馆默认设计用于处理比RAM大得多的数据，但并未像DBMS那样针对索引管理进行优化。不要误会我的意思，dask很好，但不适合您的用例。

另一个非常基本的选择是基于ID1和ID2索引数据，将它们存储为索引，仅通过跳过未从指定的ID1开始的实际文件读取数据，然后跳过没有启动的行但是，使用您的ID2，依此类推，最好的做法是使用DBM，作为缓存，阅读优化以及许多其他认真的专业人士，也可以使用；减少I/O从您的磁盘上阅读时间。

您可以在此处开始使用mySQL： https：// https://dev.mysql.com /doc/mysql-getting-started/en/

您可以在此处开始使用PostgreSQL： https://www.postgresqltutorial.com/postgresql-getting-started/

TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.

The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.

Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.

Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.

Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.

You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/

You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/

回复收藏 0 原文

一场春暖 2025-02-09 07:01:29

 import os
 os.system('pip install dask')
 import dask.dataframe as dd
 dd_data = dd.read_csv('sample.csv')
 bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
 dd_result = dd_data[bool_filter_conditions][['val']]
 dd_output = dd_result.compute()
 dd_output

 import os
 os.system('pip install dask')
 import dask.dataframe as dd
 dd_data = dd.read_csv('sample.csv')
 bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
 dd_result = dd_data[bool_filter_conditions][['val']]
 dd_output = dd_result.compute()
 dd_output

回复收藏 0 原文

~没有更多了~