I implemented a double-ended queue type (deque) in Swift to satisfy a particular need in one of my programs. The algorithm itself isn’t very interesting and I’ll only discuss it briefly.
Instead, I’ll focus on the challenges involved in implementing a copy-on-write type in Swift and optimizing the code to satisfy my core aim of running at least as fast as Array
when used as a first-in-first-out (FIFO) queue for queue lengths around 20 items.
It turns out to be a tricky optimization task due to problems with the functions used to access the copy-on-write buffer. Join me for a look at optimizing a Swift copy-on-write collection.
Introduction
The Swift standard library doesn’t have many built-in container types. It was Swift 1.2 before it introduced a basic Set
type. While there are a number of utility storage types, like CollectionOfOne
, in general, Swift offers an Array
, Dictionary
and Set
as the only general purpose collection types.
A common data structure for FIFO queues is a double-ended queue or Deque
type but Swift lacks this or any other type well suited to use as a FIFO queue.
You won’t find a double-ended queue in the Foundation library either but the CFArray
/NSMutableArray
addresses this deficiency by devolving into a binary heap under queue-like operations. This lets you use an NSMutableArray
as a double-ended queue. A million years ago (in 2005) the Ridiculous Fish blog wrote about this in an article elaborately titled: “Array”. It’s a fascinating read, if you haven’t read it before. However, elements in an NSMutableArray
need to be heap allocated and the container itself is not copy-on-write – neither of which seem ideal in Swift so I didn’t want to use this option.
There is an existing, commonly used Deque
implementation which appears to be solid and well maintained. However, I was predominantly interested in queues of length 0 to 20 with an initial capacity of zero and under these conditions, I found this implementation performed slower than Array
. I chose not to use this option because I wanted a better focus on this small queue, small memory footprint use-case.
I could try to graft a Swift copy-on-write wrapper onto a C++ std::deque
but ultimately, I chose not to take this option either – although I’d still like to try this in future as I’m curious to work out what would be involved in using C++ to store Swift data (I assume C++ would function as little more than byte-level storage).
Ultimately, I found myself in a position where I would need to write the type myself. I wanted to satisfy the following aims:
- Copy-on-write
- No heap allocation until the first element is added
- Automatic growing and downsizing of storage (down to a minimum capacity)
- As fast as
Array
(or faster) for pushing 10 elements then popping them all in a FIFO fashion, considerably faster for 50. - Implement all of
RandomAccessCollection
,RangeReplaceableCollection
,ExpressibleByArrayLiteral
andCustomDebugStringConvertible
Since it is the easiest type of double-ended queue to implement, I’ll be implementing a “circular-buffer” style double-ended queue – one where the first element is allowed to have an “offset” from the front of the storage and successive pops from one end and pushes to the other will cause initialized values in the queue to offset within the storage until they wrap around at the end.
If you only append elements to the end of a double-ended queue, it will behave like a standard array. Differences start to occur when you remove elements from the front.
If you remove elements from the front of an Array
, it will move the values so index 0
is at the start of the buffer. By comparison, the circular-buffer allows the index 0
element to be offset within the buffer – in the last diagram, the index 0
element is at buffer start plus 3. The caveat is that the storage might not be contiguous – in the last diagram, the index 3
element has wrapped around to the start of the buffer.
This is not usually the most optimal approach to a double-ended queue (that would involve a more complex split-buffer or heap storage) but we’ll see how it goes.
Copy-on-write in Swift
In Swift, idiomatic collection types are usually “copy-on-write”. This means that assigning a collection to a new variable merely increases the reference count to the original. Subsequent attempts to mutate a multiply referenced collection cause the mutated reference to copy itself to a unique location and mutate the unique, rather than shared version.
This is how Array
, Dictionary
and Set
behave.
For this to work, the underlying storage of the collection must be a reference counted class type. To avoid a redundant allocation (one for the reference counted class type and one for the raw storage buffer) we use the ManagedBuffer<Header, Element>.create
function to allocate the instance (a subclass of ManagedBuffer<Header, Element>
) and a raw storage buffer in a single allocation.
Combined with the isKnownUniquelyReferenced
function (which tests if any class type is singly or multiply reference) this gives us all the tools required.
Overview of the Deque design
A rough description of a Deque
type using a ManagedBuffer<Header, Element>
would then look like this:
public struct Deque<T>: RandomAccessCollection, RangeReplaceableCollection,
ExpressibleByArrayLiteral, CustomDebugStringConvertible {
public typealias Index = Int
public typealias Indices = CountableRange<Int>
public typealias Element = T
var buffer: ManagedBuffer<DequeHeader, T>? = nil
public init()
public init(arrayLiteral: T...)
public var debugDescription: String
public subscript(_ at: Index) -> T
public var startIndex: Index
public var endIndex: Index
public mutating func replaceSubrange<C>(_ subrange: Range<Int>, with newElements: C)
where C: Collection, C.Iterator.Element == T {
if isKnownUniquelyReferenced(&buffer) && /* values fit in existing storage */ {
// Mutate in place
} else {
// Use the `create` method on ManagedBuffer
}
}
}
struct DequeHeader {
var offset: Int
var count: Int
var capacity: Int
}
That’s the entire set of required methods, properties and typealiases. You get quite a bit for free from the Collection
protocols that all ultimately gets channeled through the subscript
or the replaceSubrange
function. It gets a lot more difficult if the Index
isn’t Int
but fortunately that’s not the case here.
Storing values safely in an uninitialized buffer
The first tricky hurdle to overcome is safely maintaining instance lifecycles for data stored in an unsafe buffer.
In typical Swift programming, strictly enforced initialization, reference counting and cleanup lets us ignore most aspects of instance lifecycles. Working with an unsafe buffer is not typical Swift programming. When the buffer is allocated, it is “uninitialized” (exists in an unknown state). If we were to try and assign values into the buffer normally like this:
buffer.withUnsafeMutablePointerToElements { pointer in
pointer[index] = newValue
}
Swift’s automatic reference counting would try to automatically release the object at pointer[index]
before assigning newValue
to that location. This is a problem: when uninitialized, the location could be garbage data and attempting to release its contents could cause a EXC_BAD_ACCESS or other crash or misbehavior.
To handle an unsafe buffer correctly, we must manually perform the setup and teardown of any instance locations in the buffer. This means that when we copy a value into the buffer for the first time, we must use initialize
:
buffer.withUnsafeMutablePointerToElements { pointer in
pointer.advanced(by: index).initialize(to: newValue)
}
and when we move values from one unsafe buffer location to another:
oldBuffer.withUnsafeMutablePointerToElements { oldPointer in
newBuffer.withUnsafeMutablePointerToElements { newPointer in
newPointer.advanced(by: newIndex).moveInitialize(from: oldPointer.advanced(by:
oldIndex), count: moveCount)
}
}
and when we’re done, we need to clean up the locations:
buffer.withUnsafeMutablePointerToElements { pointer in
pointer.advanced(by: offset).deinitialize(count: valueCount)
}
The requirement to deinitialize when we’re done with values generally means that we need to perform deinitialize
in a deinit
method – so that it will be performed when the Deque
itself is finally released. Custom work in a deinit
means that we can’t rely on the default ManagedBuffer
class alone; while it calls deinitialize
on the Header
, it doesn’t do anything with the Element
buffer. At a minimum, we need to subclass ManagedBuffer
and add a deinitialize
for any values that are still in the buffer at deinit
time.
Here’s one I prepared earlier
I went ahead and implemented the entire Deque
based on the information I’ve discussed so far. The implementation is a little tedious but isn’t very difficult.
While not required, I included optimized versions of the following functions:
append(_:)
insert(_:, at:)
remove(at:)
to improve performance for commonly expected operations.
The implementation also includes a runtime specified minimumCapacity
so that the queue can dynamically grow and shrink but won’t shrink below the minimum (the default value of this is zero and falling back to zero capacity causes the entire buffer
object tob be deallocated).
Since the key metric that I’m using the gauge success or failure is how this compares relative to Array
for small queues, I be evaluating performance with the following test:
let outerCount = 100_000
let innerCount = 20
var accumulator = 0
for _ in 1...outerCount {
var deque = Deque<Int>()
for i in 1...innerCount {
deque.append(i)
accumulator ^= (deque.last ?? 0)
}
for _ in 1...innerCount {
accumulator ^= (deque.first ?? 0)
deque.remove(at: 0)
}
}
XCTAssert(accumulator == 0)
This test case appends 20 elements sequentially at the end of a new queue and then removes them sequentially from the front. This process is them repeated 100000 times.
I’ll compare against Array
. The only difference between the two tests is the var deque = Deque<Int>
line will be var deque = Array<Int>
for the Array
version.
Test case | Time taken (seconds) |
---|---|
Initial Deque implementation |
3.898 |
Array |
0.130 |
Around 40 times slower than Array
. That’s not good. What’s the problem?
Problem number 1: Specialization needed
The first difficulty is that, relative to third-party code, the standard library cheats.
My Deque
implementation was compiled into a separate module and therefore received no generic specialization, the Swift standard library is always compiled as though it is part of the same module – so it can always take advantage of generic specialization.
The separate module for my Deque
implementation is preventing specialization of the Deque
type and functions and causing the following accessor into the buffer:
buffer.withUnsafeMutablePointerToElements { header, elements in
elements.advanced(by: header.pointee.offset).initialize(to: newValue)
}
to use capturing closures with heap allocation overhead, and this is responsible for most of the execution time.
Performance problems associated with separate modules is a problem I’ve discussed previously. The solution is similar: ensure that the Deque
and the test case are linked as part of the same module and ensure that “Whole program optimization” is enabled.
Test case | Time taken (seconds) |
---|---|
Initial Deque compiled in the same module |
0.179 |
Array |
0.135 |
Okay, that’s more than an order of magnitude better but it’s still not great. What’s the problem now?
Problem number 2: deinit closure overheads
It turns out that despite being fully specialized, Swift is still wasting about a third of its time allocating closures in the DequeBuffer.deinit
function when it calls withUnsafeMutablePointers
to get a pointer to the header and elements. I can’t really understand the reason here but I’m unable to work around the problem and use ManagedBuffer
the way it is intended to be used.
I therefore I implemented my own functions to calculate a pointer to the header and elements and bypass the withUnsafeMutablePointers
function in the deinit
function.
private static var headerOffset: Int {
return Int(roundUp(UInt(MemoryLayout<HeapObject>.size),
toAlignment: MemoryLayout<DequeHeader>.alignment))
}
private static var elementOffset: Int {
return Int(roundUp(UInt(headerOffset) + UInt(MemoryLayout<DequeHeader>.size),
toAlignment: MemoryLayout<T>.alignment))
}
private var bodyPtr: UnsafeMutablePointer<T> {
return Unmanaged<DequeBuffer<T>>.passUnretained(self).toOpaque()
.advanced(by: DequeBuffer<T>.elementOffset).assumingMemoryBound(to: T.self)
}
private var headerPtr: UnsafeMutablePointer<DequeHeader> {
return Unmanaged<DequeBuffer<T>>.passUnretained(self).toOpaque()
.advanced(by: DequeBuffer<T>.headerOffset).assumingMemoryBound(to: DequeHeader.self)
}
As you can see, all I’m really doing is advancing the self
pointer to the correct locations for the Header
and Elements
in the buffer allocated by ManagedBuffer
– which is the same work that ManagedBuffer
would perform.
These functions, and their usage in deinit
, are wrapped by #if true
blocks in the code so they can be disabled easily.
Time taken (seconds) | |
---|---|
Final Deque implementation |
0.062 |
Array |
0.135 |
We’re finally faster than Swift’s builtin Array
type in this test case.
Isn’t this risky?
I’m offsetting a pointer into the middle of a buffer I didn’t allocate and assuming the layout within the buffer won’t change in future.
Isn’t this an ugly hack? Doesn’t this go against the “maintainable apps” aim that I declared for Cocoa with Love in “A new era for Cocoa with Love”?
Yes and yes.
It’s disappointing but the reality is that if we implement the Deque
using ManagedBuffer
to allocate the underlying storage, we’re forced to choose between an “optimized” version that delivers reasonable performance or an “unoptimized” version that is so slow that it isn’t really competitive with Array
until queue lengths reach 80 to 200 (depending on initial capacity).
Frankly, the performance problems with the “unoptimized” version limit its usefulness so much that it’s practically useless. If you’re paranoid about memory safety, then you’d be better off simply using an Array
or writing a Deque
without ManagedBufferPointer
.
I’ve chosen to go with the optimized version and guard against any potential unsafety with assert
statements on construction of DequeBuffer
that confirm the values returned from accessors remain the same as the results from ManagedBuffer
would be. This should guard against the layout changing and the pointers being invalid.
Comparing with std::deque
Beating the Swift Array
isn’t really a giant accomplishment: it’s not designed to be high performance in this case. Let’s instead compare with something designed for use as a FIFO queue.
Here’s how Deque
compares with libc++’s std::deque
:
Time taken (seconds) | |
---|---|
Final Deque implementation |
0.062 |
std::deque |
0.043 |
Ouch. How did we get beaten again?
While std::deque
’s “split-buffer” is a more efficient implementation than our “circular-buffer” implementation (since extending the capacity doesn’t require moving the existing contents to the newly allocated buffer), that’s not the source of the problem here.
The source of the problem is that these small tests are entirely determined by memory allocations and std::deque
performs just 2 allocations to exceed a capacity of 20 (two allocations of 16 elements) whereas my implementation takes 3 allocations (reallocations at capacity 5, 12, 26).
If I set a minimum capacity of 20
on construction of my Deque
, the numbers go the other way:
Time taken (seconds) | |
---|---|
Final Deque with minimum capacity of 20 |
0.025 |
std::deque |
0.045 |
Hardly a fair fight when I rig the conditions like this but the point is that the two are competitive, give or take allocation patterns, and that’s sufficient.
Usage
The project containing this
Deque
implementation is available on github: mattgallagher/CwlUtils.
The CwlDeque.swift file is fully self-contained so you can just copy the file, if that’s all you need. Otherwise, the ReadMe.md file for the project contains detailed information on cloning the whole repository and adding the framework it produces to your own projects.
NOTE: the CwlUtils project compiles CwlDeque.swift inside its own module. As described in this article, the best performance requires taking the CwlDeque.swift file and compiling it inside your own module instead.
Conclusion
This was partly fun, partly an exercise in frustration.
It was fun to implement a basic copy-on-write, circular-buffer deque in Swift. Needing to manually call initialize and deinitialize feels like a delightful throwback to the C era but it’s straightforward once you realize that it’s required.
The frustration came as I tried to understand the cause of performance problems. While I know a little about the Swift compiler internals, optimizing code like this is still largely a black-box exercise for me: looking at profiling results and haphazardly trying things on problem areas, hoping the situation will improve somehow. I don’t really know why the ManagedBuffer.withUnsafeMutablePointers
function caused unwanted allocations in the deinit
function or if I could have fixed the problem a different way. I do know that I shouldn’t have needed to do anything.
The final solution delivers excellent performance but I’m not happy with the unsafe pointer arithmetic. With the assert
statements double checking the pointer arithmetic, it should remain correct but ultimately, I look forward to disabling this code once Swift addresses the underlying cause.
Looking forward…
In a future article, I will be sharing my primary use case for this Deque type. Edit from the future… it’s used to queue messages in CwlSignal).
Updating Cocoa with Love for Swift 3
Specifying function execution contexts