tqdm tricks¶

The notebook source code is here. Open this notebook at Google Colab.

Show progress in bytes when reading a large file line by line¶

This came up when processing the Stack Exchange Dataset with @Oggai. The dataset contains very large (70G+) XML files which should be processed line-by-line in a memory-efficient way (i.e. you shouldn't read the whole file directly into memory).

The idea is to iterate over the file line-by-line in a memory-efficient way as follows:

with open(filename) as f:
    for line in f:
        handle_line(line)

while keeping track of the bytes processed so far (i.e. count the bytes of each line).

Combining the running byte-count with the total file size obtained from os.path.getsize(), we can have a progress bar showing how many bytes we have processed so far, and an ETA.

However, the byte count of each line is not so easy to obtain, since len(line) returns the number of characters, not the number of bytes. This answer provides a solution:

def utf8len(s):
    return len(s.encode('utf-8'))

The result is a function:

In [1]:
def tqdm_read_file_line_by_line(filename, encoding='utf-8'):
    """Reads & returns a text file line by line, while showing a tqdm progress bar."""
    import os
    from tqdm import tqdm

    # get byte count of each line
    def line_size(line):
        return len(line.encode(encoding, 'replace'))

    # total byte count
    total_size = os.path.getsize(filename)

    with open(filename, encoding=encoding) as f, \
            tqdm(total=total_size, unit='B', unit_scale=True, unit_divisor=1024) as pbar:
        for line in f:
            yield line
            # update running byte-count
            pbar.update(line_size(line))


filename = 'sample_data/california_housing_train.csv'
lc = 0
for line in tqdm_read_file_line_by_line(filename):
    # perform some operation based on each line
    lc += 1
    for _ in range(10000):
        pass

print()
print(f'Total line count: {lc}')
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.63M/1.63M [00:04<00:00, 349kB/s]
Total line count: 17001

This is only a rough estimage of the running byte-count, because of some possible utf8-related caveats (like this one concerning Byte Order Mark). But it is good enough as a progress indication.