tqdm
tricks¶The notebook source code is here. Open this notebook at Google Colab.
This came up when processing the Stack Exchange Dataset with @Oggai. The dataset contains very large (70G+) XML files which should be processed line-by-line in a memory-efficient way (i.e. you shouldn't read the whole file directly into memory).
The idea is to iterate over the file line-by-line in a memory-efficient way as follows:
with open(filename) as f:
for line in f:
handle_line(line)
while keeping track of the bytes processed so far (i.e. count the bytes of each line).
Combining the running byte-count with the total file size obtained from os.path.getsize()
, we can have a progress bar showing how many bytes we have processed so far, and an ETA.
However, the byte count of each line
is not so easy to obtain, since len(line)
returns the number of characters, not the number of bytes. This answer provides a solution:
def utf8len(s):
return len(s.encode('utf-8'))
The result is a function:
def tqdm_read_file_line_by_line(filename, encoding='utf-8'):
"""Reads & returns a text file line by line, while showing a tqdm progress bar."""
import os
from tqdm import tqdm
# get byte count of each line
def line_size(line):
return len(line.encode(encoding, 'replace'))
# total byte count
total_size = os.path.getsize(filename)
with open(filename, encoding=encoding) as f, \
tqdm(total=total_size, unit='B', unit_scale=True, unit_divisor=1024) as pbar:
for line in f:
yield line
# update running byte-count
pbar.update(line_size(line))
filename = 'sample_data/california_housing_train.csv'
lc = 0
for line in tqdm_read_file_line_by_line(filename):
# perform some operation based on each line
lc += 1
for _ in range(10000):
pass
print()
print(f'Total line count: {lc}')
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.63M/1.63M [00:04<00:00, 349kB/s]
Total line count: 17001
This is only a rough estimage of the running byte-count, because of some possible utf8-related caveats (like this one concerning Byte Order Mark). But it is good enough as a progress indication.