ImapGoose is a small program to keep local mailboxes in sync with an IMAP
server. The wording “keep […] in sync” implies that it does so continuously,
rather than a one-time sync. ImapGoose is designed as a daemon, monitoring both
the IMAP server and the local filesystem, and immediately synchronising changes.
When the IMAP server receives an email, it shows up in the filesystem within a
second. When an email is deleted on another email client, it is removed
from the filesystem within a second.
ImapGoose is highly optimised to reduce the amount of network traffic and tasks
performed. To do so, it relies on a few modern IMAP extensions and only supports
modern email servers. “Modern servers” in the context of email means servers
which support extensions which were standardised between 2005 and 2009.
ImapGoose uses the CONDSTORE extension (standardised in 2006),
which basically allows it to tell the server “I last saw this mailbox when it
was in state XYZ, please tell me what’s new”. This avoids the need to download
an entire message list (which can be tens of thousands of emails), making
incremental syncs much more efficient. It also uses the QRESYNC extension
(standardised in 2008) so that the server includes a list of deleted
messages too (i.e. VANISHED). Finally, ImapGoose uses the NOTIFY extension
(standardised in 2009), which allows an IMAP client to tell the server
“please let me know when there are changes to these mailboxes”, and then leave a
connection open. NOTIFY has two nice consequences: (1) the client doesn’t need
to ask the server if there have been any changes at regular intervals, and (2)
the client is informed of any changes immediately, so they can be processed
without delay. Unlike the older IDLE extension (from 1996), NOTIFY (from 2009)
allows monitoring multiple mailboxes per connection, rather than just one.
In this article, I’ll cover some of the general design details, inner workings
and other development details.
First off, ImapGoose keeps a small status database with some minor metadata
about the last-seen status of both the server and local Maildirs. This includes
the mapping between server UIDs and filesystem filenames. Its general design is
strongly inspired by how OfflineIMAP works.
At start-up, ImapGoose lists all mailboxes in the server and in the local
filesystem. It then starts monitoring them (the server via NOTIFY, the client
via inotify/kqueue), so we receive notifications of any changes that may happen
after our initial listing. This ensures that, for example, if we receive a new
email while performing the initial sync, we get a notification for it.
Once monitoring is set up, ImapGoose queues a task to perform a full sync of
each mailbox. Initially, we determine if this is the first time we see this
mailbox by its absence in the status database. If this mailbox has not been seen
before, then we request all messages. The server returns all of these along with
a HIGHESTMODSEQ, which we store in the status database. This HIGHESTMODSEQ
is a numeric property of each mailbox and increases every time a change occurs
inside that mailbox. If a mailbox has been seen before, then we can ask the
server for changes since that HIGHESTMODSEQ, which delivers only the minimal
amount of data which we need, and nothing else about all the other thousands of
unchanged messages.
When a message is present in the server and absent in the filesystem (or vice
versa), we need to determine whether it is a new message, or if it is a message
that was previously present in both and deleted from the local filesystem. To
determine this, we use the status database and apply the exact same algorithm as
offlineimap. It’s simple and well tested.
At times, ImapGoose may disconnect from the server (for example, due to a laptop
disconnecting from Wi-Fi, or going into sleep mode). It will try to re-connect
automatically using an exponential back-off: after 1 second, then after 2
seconds, 4 seconds, 8 seconds, 16 seconds, 32 seconds,… all the way up to 17
minutes. Then it will continue retrying every 17 minutes. This means users don’t
really have to worry about ImapGoose’s current state, whether it’s still
working, etc. It knows how to back-off when there’s no network and how to get
back to work when it is feasible again.
As mentioned above, ImapGoose “queues” sync tasks. Internally, it uses a task
queue; when changes are detected on the server, a task to sync that entire
mailbox is queued. A worker picks this up from the queue, asks for changes in
that mailbox, and synchronises them. When changes are detected in the
filesystem, a task to sync that particular message is queued. It may happen that
multiple messages arrive in quick succession for the same mailbox. In this case,
we don’t want to trigger multiple syncs of the same mailbox, and we especially
don’t want two workers to sync the same mailbox concurrently: this would quickly
lead to duplicate emails.
To work around concurrent syncs and redundant mailbox updates, ImapGoose uses a
“dispatcher”, which hands off sync tasks to workers. When a task to sync a
specific mailbox is handed to a worker, that mailbox is marked as “busy”, and we
don’t process other tasks for that queue until that worker notifies that it has
finished its work on that mailbox. While a worker is synchronising a mailbox, we
may receive several notifications that changes have happened to that mailbox.
These changes could be the result of the changes made by the worker, or they
could be new emails being delivered, so we have to queue another task to sync
that mailbox. These tasks are kept in queue until the worker frees up the
mailbox, and the dispatcher additionally de-duplicates them: synchronising a
mailbox just once after the last change notification is enough to synchronise
the changes in all the notifications.
When a message changes in the filesystem, ImapGoose receives an inotify event.
This doesn’t trigger a sync of the full mailbox, but instead a “targeted” sync,
which focuses only on that email message. We know that a single message has
changed, so there’s no point in re-scanning the thousands of messages in the
mailbox. These targeted syncs are taken into account in deduplication; they only
get de-duplicated if the path for them is the same.
While the connection which is listening for changes from the server is kept
alive by sending periodic NOOP commands, the connections for workers are allowed
to time out. If no activity is happening, these connections simply time out, but
a connection is re-established once a worker needs it again. Great care has been
taken to avoid unnecessary churn in all possible aspects.
Before developing ImapGoose, I studied prior art in the field. In particular,
offlineimap does a great job at synchronising mailboxes. However, it doesn’t
“keep in sync” in the same way; offlineimap needs to execute periodic syncs,
doesn’t rely on modern extensions, and tends to “hang” when there are network
time-outs. ImapGoose is new and has no existing users, so it can just require
modern extensions or declare other scenarios as unsupported. Existing tools have
to maintain compatibility for existing users, which might rely on some legacy
email server. If I couldn’t rely on NOTIFY, implementing ImapGoose in such a
clean efficient way would not have been possible. If I couldn’t rely on
CONDSTORE and QRESYNC, I would have had to download lists of thousands of
emails each time even a single one changes. Thanks to UIDPLUS, the server
returns the UID of a newly uploaded message, and we don’t need any ugly
workarounds to retrieve it.
If someone needs to sync data from legacy servers, plenty of tools are still out
there, providing the best experience which those servers can offer.
When working on ImapGoose, I focused exactly on my needs for my particular use
case: keep my local mailboxes in sync with an IMAP server. There’s no other
supported scenario, there’s no fallback for legacy servers, and there’s no
support for alternative email backends. All these constraints allowed me to
focus on making a tool that’s great for a single use case: it does one thing and
does it well.
I strongly believe that my keeping tight constraints (e.g.: focusing on just one
use case, ignoring support for legacy servers, keeping things as simple as
possible) helped develop this much faster and with much cleaner results.
I started with a very clear picture of how the whole thing would work. I was
also familiar with go-imap, and knew it to be a well designed and well
implemented IMAP library. My immense appreciation goes to emersion and the
contributors who’ve worked on it. I didn’t need to worry about the inner details
of talking to an IMAP server, parsing responses, tracking connection state, etc.
go-imap provides a simple idiomatic Go interface for IMAP commands and their
responses.
go-imap was lacking two features which I needed: support for the NOTIFY command
and for VANISHED (rfc5162). While still standing on the shoulders of giants, I
implemented both of these and sent patches for both of them
(NOTIFY, VANISHED). Until those are merged, ImapGoose is built
using my own (temporary) fork which has those two patches applied.
For configuration, I opted for the very simple and straightforward scfg
configuration format. The configuration file looks something like:
account example {
server imap.example.com:993
username hugo@example.com
password-cmd pass show email/example
local-path ~/mail/example
}
I wanted something easy to remember, easy to pronounce and that won’t yield
thousands of unrelated search engine results. There’s also room for an obvious
mascot/logo: a goose wearing a postman’s hat carrying an envelope, using the
colour palette from the Go ecosystem. Please reach out if you are an illustrator
willing to contribute with artwork.
ImapGoose is open source and distributed under the terms of the ISC
licence. The source code is available via git. Feedback is
welcome, including bug reports.
