数据科学神器-Diskcache

在实际工作中或者打比赛中，我们常见的一个巨大问题就是数据读取加载的问题，比如数据集大，多，处理耗时等问题，今天这篇文章就用diskcache来解决数据读取方面的问题。（手把手教）

产生大量的数据

一时找不到大量的数据的我，只能先产生大量的数据了。

1
2
3


for i in range(10000):
    x = np.random.rand(512, 512)
    np.save(f'data/file-{i}', x)

产生的每个文件大概是2MB（512**2*8/1024/1024）,虽然每个文件不大，但是我们产生了1w个，总共20G，就很大了。

测试性能

接下来使用传统的Pytorch的DataLoader来读取数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


def get_data(i):
    x = np.load(f'data/file-{i}.npy')
    x = np.linalg.inv(x) # 模拟一些耗时的操作
    return x


class D(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 10000

    def __getitem__(self, index):
        return get_data(index)

loader = DataLoader(D(), batch_size=4)
    for i in tqdm(loader):
        pass

在笔者的Macbook Pro上的性能为[02:28<00:00, 16.89it/s]，很慢很慢。

使用diskcache

首先安装

1

pip install diskcache

接下来使用

1
2
3
4
5
6
7


from diskcache import FanoutCache
cache = FanoutCache(directory='cache', shards=8, timeout=1, size_limit=3e11)
@cache.memoize(typed=True)
def get_data(i):
    x = np.load(f'data/file-{i}.npy')
    x = np.linalg.inv(x) # 模拟耗时操作
    return x

这个需要运行两次，第一次是很慢很慢的，但是第二次运行就很快了，[00:18<00:00, 135.58it/s],可以看到有数据读取有7倍的提升，赶紧来试试！

文章目录

diskcache

数据科学神器-Diskcache

产生大量的数据

测试性能

使用diskcache

See Also

最近文章

分类

标签

友情链接

其它