Avenger Penguin

Python and Set Theory

Read more posts on Software

The Python programming language has a nice declarative syntax for data sructures known as comprehensions. They provide a quick, concise way to set up lists, sets, generators and dictionaries.

List comprehensions

So, for example, we can quickly create a list of square numbers:

In [1]:
squares = [n * n for n in range(1, 11)]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

Compare this with an imperative approach as is required in many other languages:

In [2]:
squares = []
for n in range(1, 11):
    squares.append(n * n)
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

This is perhaps only a little more verbose, but the difference is even greater once we introduce conditions:

In [3]:
even_squares = [
    n * n for n in range(1, 11) if n % 2 == 0
[4, 16, 36, 64, 100]

Compared to:

In [4]:
even_squares = []
for n in range(1, 11):
    if n % 2 == 0:
        even_squares.append(n * n)
[4, 16, 36, 64, 100]

But the real expressivity comes from the fact that a single-expression syntax can be used anonymously without the need to declare a variable at all:

In [5]:
[n * n for n in range(1, 11) if n % 2 == 0]
[4, 16, 36, 64, 100]

Other comprehensions

Additionally, comprehensions go beyond lists as we can do similar for generators, dictionaries and sets. See if you can understand what these data structures are:

In [6]:
import string

{string.ascii_uppercase[n]: n for n in range(0,6)}
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5}
In [7]:
# set of unique letters in sentence
{letter for letter in 'python' + 'set' + 'theory'}
{'e', 'h', 'n', 'o', 'p', 'r', 's', 't', 'y'}
In [8]:
from itertools import count, takewhile

# Creates infinite list of squares
square_generator = (n * n for n in count())
# Print out the first four

There are many blog posts out there on how to use comprehensions in different contexts, but here I want to focus on one of my favourite applications: working with applications of set theory.

Set theory and Python comprehensions

In mathematics, one might describe a basic set of six numbers as:

\begin{equation*} A = \{1, 2, 3, 4, 5, 6\} \end{equation*}

And Python’s set comprehension nicely happens to be identical here:

In [9]:
A = {1, 2, 3, 4, 5, 6}
{1, 2, 3, 4, 5, 6}

In mathematics, we might describe subset of all the even numbers in $A$ as:

\begin{equation*} B = \{n \in A \: | \: even(n) \} \end{equation*}

And in Python this becomes:

In [10]:
even = lambda x: x % 2 == 0

B = {n for n in A if even(n)}
{2, 4, 6}

It should be clear that there is a nice one-to-one correspondence between n for n in and $\in$ here. Additionally, the vertical bar | corresponds to the if within the comprehension. This makes it pretty easy to transliterate mathematical expressions directly into Python code with simple syntax mappings.

More imperitive languages such as C, Java or JavaScript would need a bit more step-by-step code just to create such mathematical objects, adding some overhead on top of code that actually then works with those objects.

I personally like the simple correspondence as implementing mathematical concepts become easier. I have found this useful for rapidly prototying ideas from academic papers.

This is especially powerful when working with practical data structures that follow set-theoretic principles, such as RDF graphs. I will cover this in more depth in future articles, but for now I can give one simpler example where I have applied this.

How might we actually apply this?

An example of where this can be applied is in synchronising users (or other data) between two systems. Imagine we have a directory of users (e.g. LDAP or Active Directory) that we want to keep in sync with a user database behind another system in your organisation such as a directory lookup application. In our organisation a user record is only made up of a first name and the department they work in:

In [11]:
from collections import namedtuple

User = namedtuple('User', ['name', 'department'])

We might also have some code that can query the lists of users in each system which we can stub out here for for illustrative purposes:

In [12]:
def get_users_from_directory():
    return [
        User(name='Alice', department='R&D'),
        User(name='Bob', department='HR'),
        User(name='Clare', department='Sales'),

def get_users_from_lookup_service():
    return [
        User(name='Bob', department='HR'),
        User(name='Clare', department='HR'),
        User(name='David', department='IT'),

Note how in this example we are simulating three situations: firstly, Alice is new to the organisation and is yet to be synced to the lookup service. Secondly, Clare has recently moved from Sales to HR. Thirdly, David used to work in the IT department but has left the organisation since these systems were last synchronised.

If we are using Python and comprehensions, we can use sets and set-theoretic operations as part of our synchronisation:

In [13]:
synced_users = {
    user for user in get_users_from_lookup_service()
actual_users = {
    user for user in get_users_from_directory()

discrepancies = synced_users ^ actual_users
{User(name='David', department='IT'), User(name='Clare', department='HR'), User(name='Alice', department='R&D'), User(name='Clare', department='Sales')}

So, what’s happened here? We’ve declared two sets: synced_users is the set of user profiles that are currently in our target lookup service and actual_users is the authoritative set of users that are actually employed in our organisation.

We can then do a symmetric difference (aka an XOR operation) to get a list of user profiles that are missing, need to be moved or are otherwise out of sync.

Finally, we just need to loop through the discrepancies and process each as appropriate:

In [14]:
def remove_user_from_loop_service(user):
    print('Removing: {}'.format(user))

def add_user_to_loop_service(user):
    print('Adding:   {}'.format(user))

for user in discrepancies:
    if user in synced_users:
    if user in actual_users:
Removing: User(name='David', department='IT')
Removing: User(name='Clare', department='HR')
Adding:   User(name='Alice', department='R&D')
Adding:   User(name='Clare', department='Sales')

So, in a small amount of (hopefully) readable code, we can do a basic bidirectional sync between two systems. This simple approach makes the assumption that it is ok to remove and re-add Clare as she has moved departments. This might not be appropriate for all cases and in some implementations of this I have explicitly looked for the same user appearing twice in the discrepancies set to process them differently to new or removed users.

Python has a reputation as an expressive and concise language and I think that comprehensions go a long way to drive that. The fact that I can transcribe mathematical expressions directly into the language is both useful for me but also shows how small bits of declarative syntax — in a language that is relatively imperative in other ways — can really create clear, readable code without lots of distracting noise around it.