S3 path/pseudo-folder locking

Efficient concurrent access when using S3 as a filesystem using read/write tree locking

Sunday November 18th, 2018

AWS S3 is a key/value store, with operations that only operate on a single key at a time. There is no native concept of a folder: the closest thing is a group of keys with the same prefix. These facts mean that in applications that treat S3 as a filesystem, operations on such pseudo-folders, such as a renames or copies, are not atomic: if performed by different users at the same time, corruption of the key structure can occur. Here, I present a method to mitigate the chance of this based on locking.

Some familiarity with S3 is assumed, especially the PUT/GET/DELETE APIs for objects.

This post is not finished: expect lots of typos and little sense

Pseudo-folders, paths, and key prefixes

We describe pseudo-folders through an example. Consider 4 objects stored in an S3 bucket, with slashes in their keys. Other than the fact that they are in the same bucket, there is no structure.

- \( a/b/c \)
- \( a/b/d \)
- \( e/f/g \)
- \( e/f/h \)

However, since these keys have slashes, and they have common prefixes, we can seem them as though they are part of a tree [also known as a hierarchy] of paths, below the root path \(/\), where the lowest branches in the tree map to the objects in the bucket.

\(/\)
- \(/a\)
  - \(/a/b\)
    - \(/a/b/c → a/b/c\)
    - \(/a/b/d → a/b/d\)
- \(/e\)
  - \(/e/f\)
    - \(/e/f/g → e/f/g\)
    - \(/e/f/h → e/f/h\)

The pseudo-folders in this example are those at paths \(/\), \(/a\), \(/a/b\), \(/a/b/c\), \(/e\), \(/e/f\) and \(/e/f/h\).

This list of pseudo-folders includes the paths that map to objects: /a/b/c and /e/f/h. While a traditional filesystem does not allow a path to be both a folder and a file, S3 does not forbid the equivalent: for example both a/b and a/b/c could be keys to objects. This fact means that every path should be treated as a pseudo-folder.

Note that List Objects API allows you to list objects in a pseudo-folder, i.e. objects with the same key prefix, using the prefix and delimeter options, and the AWS console uses this to provide a reasonable illusion of folders. However, there is no API that provides atomic operations on all the objects on such lists.

Notation

S3 object keys will not start with a slash. Corresponding paths and pseudo-folders will always be written in full, and starting with a foward slash /.

Scope

We consider 4 categories of operations on pseudo-folders that we would like to be atomic.

Write & Delete: A write of one pseudo-folder
Rename: A write of two pseudo-folders
Copy: A read of one pseudo-folder, write of another pseudo-folder
Read: A read of one pseudo-folder

We don't try to maintain atomicity in the case of the server going down mid-way through such operations. We also don't concern ourselves with the eventual consistency properties of some operations on S3. We also only consider locking schemes, and only consider schemes where locking and unlocking on a given path takes a number of operations that is independent of the number of locks currently held, and independent of the number of descendents the path has, i.e. is constant-time.

Single-object GETs

It should be noted that a GET to a single object in S3 is atomic: it will either succeed with a consistent object, or fail with a 404. Therefore GETs to single objects won't typically need to be protected by locks. Note that whenever we mention a read of a pseudo-folder, we are concerned with a read that need an atomic view of the entire pseudo-folder. For example, if we needed to make a zip file from all of its contents.

Single-object PUTs however, even though they are also atomic, depending on the strucure of keys the application enforces, may need locking in order to maintain.

Locking interface & usage

Knowing the categories of operations we are interested in, we can design an API for a class, PathLock. If implemented, we could write something like the below.

import asyncio
from path_lock import PathLock

lock = PathLock()

async def delete(path):
  async with lock(read=[], write=[path]):
    ...

async def write(path, ...):
  async with lock(read=[], write=[path]):
    ...

async def rename(path_from, path_to):
  async with lock(read=[], write=[path_from, path_to]):
    ...

async def copy(path_from, path_to):
  async with lock(read=[path_from], write=[path_to]):
    ...

async def read(path):
  async with lock(read=[path], write=[]):
    ...

Each path argument can be some object that represents the path of a pseudo folder, such as PurePosixPath.

Granularity

As is typical with locks, we have the problem of making the lock as granular as possible: it should only block operations that would be unsafe to run concurrently, and other operations should progress unhampered. In our case unsafe combinations of operations on a pseudo-folder are "writes and writes", and "writes and reads". We aim to make a locking system that blocks these, but allows everything else to proceed

At first thought, it might appear that a read/write lock on each pseudo-folder could be enough. However, given that pseudo folders can be nested, it requires some thought to actually implement such a lock around such operations in order to acheive a granular lock and constant time locking.

We'll start with an example, and then generalise.

Consider a copy of \(/a/b\) to \(/a/b'\). This is a read of the pseudo folder at the path \(/a/b\), i.e. from all possible keys prefixed with \(a/b/\), and a write of \(/a/b'\), i.e. to keys prefixed with \(a/b'/\).

\(/\)
- \(/a\)
  - \(/a/b\) (read)
    - \(/a/b/c → a/b/c\)
    - \(/a/b/d → a/b/d\)
  - \(/a/b'\) (write)
    - \(/a/b'/c → a/b'/c\)
    - \(/a/b'/d → a/b'/d\)
- \(/e\)
  - \(/e/f\)
    - \(/e/f/g → e/f/g\)
    - \(/e/f/h → e/f/h\)

During this copy, the following operations are not compatible, and should be blocked to ensure atomicity:

reads of the pseudo-folder \(/a/b'\)
reads of any descendant pseudo-folders of \(/a/b'\)
reads of the ancestor pseudo-folders of \(/a/b\) and \(/a/b'\), namely \(/a\) and \(/\)
writes of the pseudo-folders \(/a/b\) and \(/a/b'\);
writes of an any descendent pseudo-folders of \(/a/b\) and \(/a/b'\);
write of the ancestor pseudo-folders of \(/a/b\) and \(/a/b'\), namely \(/a\) and \(/\)

All other read and write operations are compatible.

We can write these rules more succinctly using the concept of lineage. We define \(\mathbb{L}(p)\) as the lineage of a path \(p\): the union of all of \(p\)'s ancestors, descendants, and \(\{p\}\) itself. We also define \(R(A)\) as all the read operations on the set of paths \(A\), and \(W(A)\) as all the write operations on the set of paths \(A\). Using these terms, we can say that a copy from the pseudo-folder at path \(/a/b\) to \(/a/b'\) should block

\(W(\mathbb{L}(/a/b))\)
\(R(\mathbb{L}(/a/b'))\)
\(W(\mathbb{L}(/a/b'))\).

We generalise this further and state that for a path \(p\)

\(R(\{p\})\) should block \(W(\mathbb{L}(p))\)
\(W(\{p\})\) should block \(W(\mathbb{L}(p))\) and \(R(\mathbb{L}(p))\).

We can express these rules a different way, and say that for any path \(p\)

\(R(\mathbb{L}(p))\) should block \(W(\mathbb{L}(p))\)
\(W(\mathbb{L}(p))\) should block \(W(\mathbb{L}(p))\) and \(R(\mathbb{L}(p))\).

The second rule above implies the first, so we just need:

\(W(\mathbb{L}(p))\) should block \(W(\mathbb{L}(p))\) and \(R(\mathbb{L}(p))\).

We can see that out aim is to be able to construct a read/write lock on \(\mathbb{L}(p)\) for all paths \(p\).

We can also express this as a compatibility table, showing what operations should, and should not, run concurrently for two tasks. We also define \(\mathbb{L}^c(p)\) as the set of all paths not in the lineage of \(p\) in order to better show how locking schemes differ in the concurrency they allow.

\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}^c(p))\)	✓
\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓

This table also gives us a way to compare locking schemes. All schemes must block the 2 incompatible combinations of operations, but that leaves 4 combinations they should allow for maximum concurrency, and so we can assign each potential scheme a score out of 4.

Global exclusive lock: 0/4

As an initial example, we can consider an extremly simple implementation of PathLock that ignores its arguments and defers to a single instance of asyncio.Lock.

As you might expect, this doesn't have great concurrency properties:

\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}^c(p))\)	✗
\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✗
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✗

But the implementation is tremendously simple. Keeping in mind that single-object GETs don't need to be locked, this may be good enough for many situations.

Global read/write lock: 2/4

Better locking can be acheived by using a global read/write lock, with two modes \(\overline{R}\) and \(\overline{W}\), with compatibility table

	\(\overline{R}\)	\(\overline{W}\)
\(\overline{R}\)	✓	✗
\(\overline{W}\)	✗	✗

[To distinguish read and write locks from the similarly named operations, we write locks with a horizontal bar.]

which could be implemented as below.

As you might expect, this allows more concurrency than the global exclusive lock.

\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}^c(p))\)	✗
\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✗
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓

As when using the global exclusive lock, this may well be good enough for many cases.

Ancestor-locking and read/write locks on each path: 3/4

To improve upon the above, we construct an algorithm inspired by [1]. Instead of a global read/write lock, we maintain a read/write lock for each path \(p\).

	\(\overline{R}(p)\)	\(\overline{W}(p)\)
\(\overline{R}(p)\)	✓	✗
\(\overline{W}(p)\)	✗	✗

When we need to access, either read or write, to a pseudo-folders, we acquire the write lock \(\overline{W}(p)\) on the paths of the pseudo folders, and for all of the ancestor paths of the pseudo-folders, acquire a read lock \(\overline{R}(p)\) on each.

In order to access pseudo-folders at paths \(P\), we acquire the locks

\(\overline{W}(p)\) for all \(p\) in \(P\)
\(\overline{R}(p)\) for all \(p\) that are ancestors of the paths in \(P\), except any in \(P\)

To avoid deadlock, the locks are acquired ancestor-first, and then in lexographical order on path-component name; and each task attempts to acquire at most one lock at a time [proof]. For example, if a task needed to to access \(/a/b\) and \(/a/b'\) it would

acquire a read lock on \(/\),
acquire a read lock on \(/a\),
acquire a write lock on \(/a/b\).
acquire a write lock on \(/a/b'\).

If a concurrent task wanted to lock anything in \(\mathbb{L}(/a/b)\), it would get blocked. For example, if another task tried to access an ancestor or \(/a/b\), such as, \(/a\), it would try to acquire the write lock on \(/a\), and get blocked by the original task's read lock on \(/a\). Alternatively, if a task tried to access a descendant of \(/a/b\), such as \(/a/b/c\), it would get blocked by the write lock on \(/a/b\).

A downside of this is that there is no difference between reads and writes, which means that concurrent reads to the same lineage are forbidden, as can be seen in the below compatibility table.

\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}^c(p))\)	✓
\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓

However, keeping in mind the single-object GET exception, this may be acceptable.

Note also that if a path isn't locked, then the lock object itself doesn't need to be retained in memory. As in the implementation below, using a weak-reference cache can be used for these.

Ancestor locking and Read/write/ancestor locks on each path: 4/4

We can improve on the above by maintaining a different sort of lock on each path. Instead of a read/write lock with two modes, \(\overline{R}\) and \(\overline{W}\), we construct one with 4: \(\overline{R}\), \(\overline{W}\), \(\overline{R_A}\), read-ancestor, and \(\overline{W_A}\), write-ancestor.

	\(\overline{R_A}(p)\)	\(\overline{R}(p)\)	\(\overline{W_A}(p)\)	\(\overline{W}(p)\)
\(\overline{R_A}(p)\)	✓	✓	✓	✗
\(\overline{R}(p)\)	✓	✓	✗	✗
\(\overline{W_A}(p)\)	✓	✗	✓	✗
\(\overline{W}(p)\)	✗	✗	✗	✗

In order to read the pseudo-folders at paths \(P_R\) and write the pseudo-folders at paths \(P_W\), we acquire the locks

\(\overline{W}(p)\) for all \(p\) in \(P_W\)
\(\overline{W_A}(p)\) for all \(p\) that are ancestors of the paths in \(P_W\), except any in \(P_W\)
\(\overline{R}(p)\) for all \(p\) in \(P_R\), except any in in \(P_W\) and any ancestors of \(P_W\)
\(\overline{R_A}(p)\) for all \(p\) that are ancestors of the paths in \(P_R\), except any in \(P_W\), and any ancestors of \(P_W\), and any in \(P_R\)

To avoid deadlock, these lock are acquired ancestor-first, and then in lexographical order on path-component name; rand each task attempts to acquire at most one lock at a time [proof].

Using this set of locks, we then can acheive a perfect concurrency score of 4/4, and acheive the following concurrency table.

\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}(p))\)	✗
\(W(\mathbb{L}(p))\)	\(W(\mathbb{L}^c(p))\)	✓
\(W(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}(p))\)	✓
\(R(\mathbb{L}(p))\)	\(R(\mathbb{L}^c(p))\)	✓

As per the version that uses read/write locks, if a path isn't locked, then the lock object itself doesn't need to be retained in memory: a weak-reference cache can be used.

Proof that ancestor-locking is deadlock-free

The locks are always acquired along a sequence: ancestor first and then lexographic, and therefore the locks form a total order. So we prove that for any algorithm where each task acquires its locks in a strictly increasing sequence along a total order, waiting for at most one lock at a time, that deadlock is impossible. The proof doesn't depend on properties of the modes of the lock, and so can be applied to both of the ancestor locking methods presented in this post.

Conditions

Let \(L=\{L_i\}_{i \in S} \) for some \(S \subset \mathbb{Z} \) be a set of independent locks, each with at least one mode, and let \(<_L\) be a strict total order relation between elements of \(L\). Let \(T = \{T_i\}_0^{M-1}\) for some \(M \in \mathbb{N} \) be a set of tasks, where each acquires locks in \(L\) in a strictly increasing order according to \(<_L\), and at any given moment is waiting to acquire at most one lock.

Statement

It is not possible for \(T\) to be in deadlock.

Proof

Assume \(T\) are in deadlock. We'll show that this leads to a contradiction.

Without loss of generality, we index the \( \{L_i\}_{i \in S} \) such that they form strictly increasing sequence according to \(<_L\)

\[ L_0 <_L L_1 <_L \ldots .\]

By definition of deadlock, \(M \geq 2\), each \(T_i\) has acquired at least one a lock, and each is waiting for a lock held by another task in \( T \). For each \(T_i\), let \( A_i \) be the indexes of the locks that \(T_i\) has acquired, and let \(w_i\) be the index that it waits for. Since all tasks acquire their locks in increasing order,

\[ L_a <_L L_{w_i}, \forall a \in A_i \]

or equivalently

\[ a < w_i, \forall a \in A_i \tag{1}\label{eq:a} .\]

Without loss of generality, we index the tasks such that each \( T_i \) is waiting for a lock held by \( T_{i+1 \pmod{M} } \). So each \(w_i \in A_{i+1 \pmod{M}}\), and so by \eqref{eq:a} \(w_i < w_{i + 1 \pmod{M}} \). This is true for all \( i \in {0 \ldots M-1} \), and so

\[ w_0 < w_{1 \pmod{M}} < \ldots < w_{M-1\pmod{M}} < w_{M \pmod{M}} \]

and therefore

\[ w_0 < w_{M \pmod{M}} .\]

However, \(M \pmod{M} = 0\), which implies \[ w_0 < w_0 .\]

This is a contradiction. \(\square\)

[1] Ritik Malhotra; An Efficient Locking Scheme for Path-based File Systems; https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F14/projects/reports/project6_report.pdf