Dev Ups

Published 2022-09-20 in python

Measuring performance of indexed arrays vs dicts, comprehensions vs explicit loops

Python has an advantage over PHP in the server-side scripting stakes; indexed arrays. PHP maintains an ordering in its associative arrays (which all arrays are in PHP). Until python 3.7 there was OrderedDict. Python 3.7+ guarantees to maintain insertion order in plain dict.

I wanted to (re-)evaluate the speed of using different structures at scale. I've had to make similar optimisations in leet-code type challenges in the past.

The results were surprising, and not what I had experienced whilst optimising code for Java and Android, where allocation is more costly, and recycling better compensated.

Problem statement

I have 38 thousand lines of NGINX logs I want to analyse for the current month. My server is by no means exceptionally busy and I would like to be able to run analysis on demand, including previous months if required.

This is what I was looking to optimise:

def most_failures(req_dict: dict[str, list]):
    fails_per_ip = {k: 0 for k in req_dict.keys()}
    for ip, reqs in req_dict.items():
        fails_per_ip[ip] = sum([1 for req in reqs if 400 <= req[2] < 500])
    return fails_per_ip

Test rig

Although python 3.7 guarantees ordering of keys in its dict. These are the 3 variants of the original algorithm under test:

def most_failures_using_list(req_dict: dict[str, list], indexed_ips: list[str], times):
    start = time.process_time()
    fails_per_ip_list = [sum([1 for req in req_dict[ip] if 400 <= req[2] < 500]) for ip in indexed_ips]
    t1 = time.process_time() - start
    times.append(t1)
    return fails_per_ip_list

def most_failures_explicit(req_dict: dict[str, list], times):
    start = time.process_time()
    fails_per_ip = {k: 0 for k in req_dict.keys()}
    for ip, reqs in req_dict.items():
        fails_per_ip[ip] = sum([1 for req in reqs if 400 <= req[2] < 500])
    t2 = time.process_time() - start
    times.append(t2)
    return fails_per_ip

def most_failures_using_dict(req_dict: dict[str, list], times):
    start = time.process_time()
    fails_per_ip2 = {ip: sum([1 for req in reqs if 400 <= req[2] < 500]) for ip, reqs in req_dict.items()}
    t3 = time.process_time() - start
    times.append(t3)
    return fails_per_ip2

I've given them all a copy of the times: list because in the interests of fairness, I will shuffle their order of execution.

Before I reveal how these are being run, let's compare the results.

Results

Comprehensions are very good for performance, possibly worth hearing the complaints from colleagues that they are unreadable.

I ran over 100 iterations, on 10+ occasions. Never was the list beaten for speed. Typically, it is a few hundred microseconds faster than dictionary comprehensions. About 1%.

The original, explicitly iterated function is about 3% slower than dictionary comprehensions.

The differences are much smaller than I anticipated, but very consistent. The processing algorithm takes the lion's share of processing time. The comprehensions mostly just reduce memory allocation overhead. That has made me want to test one more variant:

def most_failures_explicit_no_pre_alloc(req_dict: dict[str, list], times):
    start = time.process_time()
    fails_per_ip = {}
    for ip, reqs in req_dict.items():
        fails_per_ip[ip] = sum([1 for req in reqs if 400 <= req[2] < 500])
    t2 = time.process_time() - start
    times.append(t2)
    return fails_per_ip

For completeness, here is another variant, pushing to a list instead of using a comprehension:

def most_failures_using_list_append(req_dict: dict[str, list], indexed_ips: list[str], times):
    start = time.process_time()
    fails_per_ip_list = []
    for ip in indexed_ips:
        fails_per_ip_list.append(sum([1 for req in req_dict[ip] if 400 <= req[2] < 500]))
    t1 = time.process_time() - start
    times.append(t1)
    return fails_per_ip_list

Here's the pytest function I've commandeered to orchestrate tests:

from functools import partial
import random

def test_most_failures(req_dict):
    # req_dict is a fixture condensing 38,642 log lines into a couple thousand distinct IPs.
    ip_list = list(req_dict.keys())
    times = [[], [], [], [], []]
    funcs = [
        partial(most_failures_using_list, req_dict, ip_list, times[0]),
        partial(most_failures_using_list_append, req_dict, ip_list, times[1]),
        partial(most_failures_explicit, req_dict, times[2]),
        partial(most_failures_explicit_no_pre_alloc, req_dict, times[3]),
        partial(most_failures_using_dict, req_dict, times[4]),
    ]
    iter_cnt = 1000
    for _ in range(iter_cnt):
        random.shuffle(funcs)
        results = []
        for func in funcs:
            results.append(func())
        assert len(results[0]) == len(results[1]) == len(results[2]) == len(results[3]) == len(results[4])
    print("list iteration of failures took {:.04f}".format(sum(times[0]) / iter_cnt))
    print("list appending failures took {:.04f}".format(sum(times[1]) / iter_cnt))
    print("Pre declared dict of failures took {:.04f}".format(sum(times[2]) / iter_cnt))
    print("Ad-hoc allocated dict, of failures, took {:.04f}".format(sum(times[3]) / iter_cnt))
    print("dict comprehension of failures took {:.04f}".format(sum(times[4]) / iter_cnt))

Revised results

Indexed arrays are no longer always the fastest, over 10,000 iteration they work out the same as the equivalent associative array. Comprehensions are always faster.

By not pre-allocating the dict, to zeros, in the explicitly iterated function, we save about 1.5% of the time. I was most surprised by this; I am used to a pre allocated memory block taking much less time than repeated, smaller, allocation requests.

Here are the results for 10,000 iterations:

list iteration of failures took 0.0182
list appending failures took 0.0184
Pre declared dict of failures took 0.0186
Ad-hoc allocated dict, of failures, took 0.0182
dict comprehension of failures took 0.0183

When I ran it before, neglecting the "list appending" variant, both comprehensions were equal:

list of failures took 0.0175
Pre declared dict of failures took 0.0180
Ad-hoc allocated dict, of failures, took 0.0177
dict comprehension of failures took 0.0175

Discussed results, and more testing

I wanted to say comprehensions are fast, but those last 10,000 iterations showed an explicit loop to be joint fastest, faster even than a dict comprehension. Though over 10,000 iterations, I'm sceptical. It could be the last sample was just well-optimised, by an intelligent interpreter, or adding the fifth variant has caused more cache misses.

I'm using Python 3.9.5 on Ubuntu.

I decided to dumb down the functions so that memory allocations featured more heavily. I did this by simply removing the lt and gt tests, if 400 <= req[2] < 500, leaving the original ripe for optimising away:

sum([1 for req in reqs])

The results are closer to what I was expecting before introducing the fifth variant:

list iteration of failures took 0.0089
list appending failures took 0.0091
Pre declared dict of failures took 0.0093
Ad-hoc allocated dict, of failures, took 0.0090
dict comprehension of failures took 0.0089

So comprehensions are fast. To test my caching theory I reinstated the 4xx test and removed one of the tests:

list iteration of failures took 0.0165
list appending failures took 0.0169
Ad-hoc allocated dict, of failures, took 0.0167
dict comprehension of failures took 0.0166

Conclusion

Different structures make only a tiny difference and that has to be measured in the context in which they will be used.