I am a generalist by nature. I have spent a fair majority of the last 5-6 years writing code on, using and deploying fairly large scale distributed systems as part of my work. It annoys me a little bit that my understanding of them remains equivalent to that of a dilettante.
Combine that annoyance with some availability thanks to the lull between switching jobs, I decided to make an attempt to grok distributed systems. In this post I describe my approach as well as recommendations to some foundational material which I have gone through myself.
As a practitioner it’s my intention to reach the real world distributed systems (DynamoDB, Spanner or more recently Flighttracker for example) as soon as possible. wasn't necessarily starting from scratch, I had read Designing Data Intensive Applications and even done the course that Martin released last year. So, I tried reading the papers describing these systems right away. I felt like a 5 year old reading Shakespeare. The words made sense but I am not sure I understood them at a fundamental level. Clearly I needed to do more reading.
My approach is to group the topics discussed by both Murat and Henry, and attempt to read and find approachable material which contextualises the results and ideas - bridge material. If you choose to follow me along this journey of mine, you can expect links to these resources as well as a summary of my own understanding of these topics.
I have roughly arrived at the following grouping of topics.
Failure and Time
Impossibility Results (CAP, FLP, Two Generals, Byzantine Generals)
Broadcasting, Replication and Consensus
Real World Systems
This may not be perfect but I have also made peace with the fact that the most logical way to group these topics may appear to me only after I have a sufficient understanding of these topics. Please send me a message if you think there can be a better grouping or there are topics I have missed.
Right, so let’s get to foundations. For the most part this is copied right from Henry’s list. I have made a few additions including adding Martin’s course.
The distributed system equivalent of the commandments. A lot of the system models that we will encounter are based on these fallacies.
The Distributed Data section (Chapters 5-9) in this book are a very solid introduction to distributed systems. If you read them thoroughly and internalize it, I think you are already ahead of the curve. Personally I find that the way I learn best is revisiting the same content repeatedly from different angles.
Recommended in the paper trail blog post. I quickly skimmed over this and found some things that did not sit comfortably with me - 2 Phase commit as an example of a CA (CAP theorem) system. My preference would be to stick to Martin’s book or his video course. However in the spirit of most models are wrong, some models are useful, one could also go through this as an introduction.
My preferred way would be to go over this set of videos from Martin Kleppmann. Part of a course he gave to second year students at University of Cambridge. I have personally done this course. Fairly approachable and just around 8 hours of videos.
I prefer the video course over the book as I think it just about covers a bit more topics, is more approachable. Just like the book, you could choose to stop here and you will have more of an understanding than most. Personally, I feel while its a start, the course wasnt enough for me to internalize all that I learnt.
Distributed Systems in Practice
While this was posted more than 5 years ago, most of this holds up. I found myself vigorously nodding to almost every point.
This paper from 1994 argues that distributed computing is fundamentally different from ‘local’ computing because of differences in latency expectations, partial failures and concurrency. It’s an approachable read. You could also see how even some of the current developments (for example service meshes) are an attempt to tackle the same difficulties in distributed systems that are raised in the paper. It also illustrates the problems in distributed computing with an example of NFS. It’s interesting that you could see the tug of war between safety and liveness even then. I have written a summary here.
I must confess I have only skimmed over this article and its only tangentially related to distributed systems. However a lot of the advice there is timeless (The paper is from 1983).
Right, so that's that. If you find the list overwhelming, just do Martin’s course. You will be fine.
If you liked that you should follow me on this journey by subscribing to this newsletter or twitter or both. Onto Failure and Time next.