phil karlton said there are only two hard problems in computer science. they are, in order: cache invalidation, naming things, and off-by-one errors. this post is about the first one, which is to say, it is about all of them.
the symptom
stale data, served confidently. our cache layer was 'eventually consistent', which is industry shorthand for 'wrong, but only sometimes, and only when it matters'. a customer would update their email, see the new email on the dashboard, and then receive password resets at the old one. nobody on the team could reproduce it. nobody on the team disbelieved it either.
@cache(ttl=300)
def get_user(uid):
return db.fetch(uid)
# elsewhere, in a different process, at 4am:
db.update(uid, {'email': new_email})
# the cache, of course, knows nothing.
the fix, and why it took three tries
write-through caches solve the problem when you have one writer. we had four. event-driven invalidation solves the problem if your bus delivers exactly-once. ours delivers somewhere between zero and several. eventually we reached for the classic answer: stop caching things you can't afford to be wrong about.
"if your cache and your db disagree, it is never the db that's lying."
we kept the cache for read-heavy listings where 'pretty close' is fine. we ripped it out from anything related to identity, money, or notifications. the dashboard got 40ms slower. the support queue got 60% lighter. tradeoffs.