Distributed Systems Learning Resources
Most links tend to be readings on architecture itself rather than code itself.
Bootcamp
Read things here before you start.
- CAP Theorem, Also plain english explanation
- Fallacies of Distributed Computing, expect things to break, everything
- Distributed systems theory for the distributed engineer, most of the papers/books in the blog might reappear in this list again. Still a good BFS approach to distributed systems.
- FLP Impossibility Result (paper), an easier blog post to follow along
- An Introduction to Distributed Systems @aphyr's excellent introduction to distributed systems
Books
- Distributed Systems for fun and profit [Free]
- Distributed Systems Principles and Paradigms, Andrew Tanenbaum [Free with registration]
- Scalable Web Architecture and Distributed Systems [Free]
- Principles of Distributed Systems [Free][ETH Zurich University]
- Making reliable distributed systems in the presence of software errors, [Free] Joe Amstrong's (Author of Erlang) PhD thesis
- Designing Data Intensive Applications [Amazon Link]
- Distributed Computing, By Hagit Attiya and Jennifer Welch
- Distributed Algorithms, Nancy Lynch [Amazon Link]
- Impossibility Results for Distributed Computing [paywall]
- Designing Distributed Systems, Brandon Burns [Free with registration]
- Making Sense of Edge Computing
Papers
Must read papers on distributed systems. While nearly all of Lamport's work should feature here, just adding a few that must be read.
- Times, Clocks and Ordering of Events in Distributed Systems Lamport's paper, the Quintessential distributed systems primer
- Session Guarantees for Weakly Consistent Replicated Data a '94 paper that talks about various recommendations for session guarantees for eventually consistent systems, many of this would be standard vocabulary in reading other dist. sys papers, like monotonic reads, read your writes etc.
Storage & Databases
- Dynamo: Amazon's Highly Available Key Value Store Paraphrasing @fogus from their blog, it is very rare for a paper describing an active production system to influence the state of active research in any industry; this is one of those seminal distributed systems paper that solves the problem of a highly available and fault tolerant database in an elegant way, later paving the way for systems like Cassandra, and many other AP systems using a consistent hashing.
- Bigtable: A Distributed Storage System for Structured Data
- The Google File System
- Cassandra: A Decentralized Structured Storage System Inspired heavily by Dynamo, an now an open source
- CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data, the algorithm for the basis of Ceph distributed storage system, for the architecture itself read RADOS
Messaging systems
- The Log: What every software engineer should know about real-time data's unifying abstraction, a somewhat long read, but covers brilliantly on logs, which are at the heart of most distributed systems
- Kafka: a Distributed Messaging System for Log Processing
Distributed Consensus and Fault-Tolerance
- Practical Byzantine Fault Tolerance
- The Byzantine Generals Problem
- Impossibility of Distributed Consensus with One Faulty Process
- The Part Time Parliament Paxos, Lamport's original Paxos paper, a bit difficult to understand, may require multiple passes
- Paxos Made Simple, a more terse readable Paxos paper by Lamport himself. Shorter and more easier compared to the original.
- The Chubby Lock Service for loosely coupled distributed systems Google's lock service used for loosely coupled distributed systems. Sort of Paxos as a Service for building other distributed systems. Primary inspiration behind other Service Discovery & Coordination tools like Zookeeper, etcd, Consul etc.
- Paxos made live - An engineering perspective Google's learning while implementing systems atop of Paxos. Demonstrates various practical issues encountered while implementing a theoretical concept.
- Raft Consensus Algorithm An alternative to Paxos for distributed consensus, that is much simpler to understand. Do checkout an interesting visualization of raft
- Conflict-free Replicated Data Types presents an approach for Strong Eventual Consistency which as been applied in projects such as Riak, Redis and Akka. A great talk on the subject by Martin Kleppmann can be found here
Testing, monitoring and tracing
While designing distributed systems are hard enough, testing them is even harder.
- Dapper, Google's large scale distributed-systems tracing infrastructure, this was also the basis for the design of open source projects such as Zipkin, Apache SkyWalking, Pinpoint and HTrace.
Programming Models
- Distributed Programming Model
- PSync: a partially synchronous language for fault-tolerant distributed algorithms Video: Conference Video
- Programming Models for Distributed Computing
- Logic and Lattices for Distributed Programming
Verification of Distributed Systems
- Jepsen A framework for distributed systems verification, with fault injection @aphyr has featured enough times in this list already, but Jepsen and the blog posts that go with are a quintessntial addition to any distributed systems reading list.
- Verdi A Framework for Implementing and Formally Verifying Distributed Systems Paper
Videos
- Distributed Deep Dive interview series by Ably Relatime.
- Distributed Systems in One Lesson Distributed Systems in One Lesson by Tim Berglund
Courses
- Reliable Distributed Algorithms, Part 1, KTH Sweden
- Reliable Distributed Algorithms, Part 2, KTH Sweden
- Cloud Computing Concepts, University of Illinois
- CMU: Distributed Systems in Go Programming Language
- Software Defined Networking , Georgia Tech.
- ETH Zurich: Distributed Systems
- ETH Zurich: Distributed Systems Part 2, covers Distributed control algorithms, communication models, fault-tolerance among other things. In particular fault tolerance issues (models, consensus, agreement) and replication issues (2PC,3PC, Paxos), which are critical in understanding distributed systems are explained in great detail.
- Distributed Systems Course, A beginner course on distributed system by Chris Colohan, A google employee who contributed to SUIF, MapReduce, TCMalloc, Percolator, Caffeine, Borg, Omega, and Piper.
Blogs and other reading links
- Amazon Builder's Library, a collection of Amazon's learnings on distributed systems
- How we implemented consistent hashing efficiently
- Notes on Distributed Systems for Young Bloods
- High Scalability Several architectures of huge internet services, for eg twitter, whatsapp
- There is No Now, Problems with simultaneity in distributed systems
- Turing Lecture: The Computer Science of Concurrency: The Early Years, An article by Leslie Lamport on concurrency
- The Paper Trail blog, a very readable blog covering various aspects of distributed systems
- aphyr, Posts on jepsen series are pretty awesome
- All Things Distributed - Wernel Vogel's (Amazon CTO) blog on distributed systems
- Distributed Systems: Take Responsibility for Failover
- The C10K problem
- On Designing and Deploying Internet-Scale Services
- Files are hard A blog post on filesystem consistency, pretty important to read if you are into distributed storage or databases.
- Distributed Systems Testing: The Lost World Testing distributed systems are hard enough, a well researched blog post which again covers a lot of links to various approaches and other papers
- SWIM Protocol explained A blog post on popular SWIM failure detector
Meta Lists
Other lists like this one
- Readings in distributed systems
- Distributed Systems meta list
- List of required readings for Distributed Systems Part of CMU's Engineering Distributed Systems course
- The Distributed Reader
- A Distributed Systems Reading List, A collection of material, mostly papers on Distributed Systems Theory as well as seminal industry papers
- Distributed Systems Readings, A comprehensive list of online courses related to distributed systems
- Awesome Distributed Consensus, Another list of materials on distributed consensus protocols
- Distributed Systems Journey - 📚 Roadmap to becoming a distributed systems specialist 🎓
Organization Design / Team Dynamics
- How Do Committees Invent? 🔸PDF - Melvin E. Conway, Datamation magazine 1968. The original article defining Conway's Law.
- Service per Team - Each team is responsible for one or more business functions (e.g. business capabilities). A team owns a code base consisting of one or more modules. Its code base is sized so as to not exceed the cognitive capacity of team. The team deploys its code as one or more services. A team should have exactly one service unless there is a proven need to have multiple services.
- Start with Team Cognitive Load - Team Topologies 🔺YT - DOES19 London. The "monoliths vs microservices" debate often focuses on technological aspects, ignoring strategy and team dynamics. Instead of technology, smart-thinking organizations are beginning with team cognitive load as the guiding principle for modern software. In this talk, we explain how and why, illustrated by real case studies.
Real Life Stories
- Clean microservice architecture
- Failing at microservices
- How to talk to your friends about microservices
- How we build microservices at Karma
- How we ended up with microservices at SoundCloud
- Microservices: lessons from the frontline
- Monolith first
- Scaling microservices at Gilt with Scala, Docker and AWS
- From a Monolithic Big Data System to a Microservices Event-Driven Architecture: Challenges and Lessons Learned
Enterprise & Verticals
- Commercetools ![c] - Headless commerce platform.
- Flamingo - Framework to build flexible and modern e-commerce applications.
- Interact ![c] - CRM microservices for rapid delivery of tailored solutions.
- Moltin ![c] - E-commerce API for developers.
- Predix ![c] - Industrial microservices platform.
- Skava ![c] - Provides microservices for all the functions of your store, and the glue to hold them together.
Theory
Articles & Papers
- A sidecar for your service mesh - A short service mesh introduction.
- AKF Scale Cube - Model depicting the dimensions to scale a service.
- Benchmark Requirements for Microservices Architecture Research 🔸PDF - Set of requirements that may be useful in selecting a community-owned architecture benchmark to support repeatable microservices research.
- Building Microservices? Here is What You Should Know - A practical overview, based on real-world experience, of what one would need to know in order to build microservices.
- CALM 🔸PDF - Consistency as logical monotonicity.
- Canary Release - Technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.
- CAP Theorem - States that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability and Partition tolerance.
- Formal Foundations of Serverless Computing 🔸PDF - The serverless computing abstraction exposes several low-level operational details that make it hard for programmers to write and reason about their code. This paper sheds light on this problem by presenting λ, an operational semantics of the essence of serverless computing.
- Introducing Domain-Oriented Microservice Architecture - Introduction to Uber Engineering generalized approach to microservice architectures, named “Domain-Oriented Microservice Architecture” (DOMA).
- Java Microservices: A Practical Guide - You can use this guide to understand what Java microservices are, how you architect and build them. Also: A look at Java microservice libraries & common questions.
- Microservice Architecture - Particular way of designing software applications as suites of independently deployable services.
- Microservices – Please, don’t - Critical advice about some problems regarding a microservices approach.
- Microservices RefCard - Getting started with microservices.
- Microservices Trade-Offs - Guide to ponder costs and benefits of the mircoservices architectural style.
- Reactive Manifesto - Reactive systems definition.
- Reactive Streams - Initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.
- ROCAS 🔸PDF - Resource Oriented Computing for Adaptive Systems.
- SECO 🔸PDF - Understanding software ecosystems: a strategic modeling approach.
- Service Discovery in a Microservice Architecture - Overview of discovery and registration patterns.
- Testing Strategies in a Microservice Architecture - Approaches for managing the additional testing complexity of multiple independently deployable components.
- Your Server as a Function 🔸PDF - Describes three abstractions which combine to present a powerful programming model for building safe, modular, and efficient server software: Composable futures, services and filters.
- Microservices - The Journey So Far and Challenges Ahead 🔸PDF - Overview of the state of microservices in both industrial and academia.
Talks
- 10 Tips For Failing Badly at Microservices 🔺YT - A presentation at Voxxed Days by David Schmitz.
- Bla Bla Microservices Bla Bla - A talk at the O’Reilly Software Architecture Conference, April 2016.
- Challenges in Implementing MicroServices 🔺YT - A presentation at GOTO 2015 by Fred George.
- Mastering Chaos - A Netflix Guide to Microservices 🔺YT - A presentation at QCon 2016 by Josh Evans.
- Microservices 🔺YT - A presentation at GOTO Berlin 2014 by Martin Fowler.
- Principles Of Microservices 🔺YT - A presentation at Devoxx Belgium by Sam Newman.
Tutorials
- Developing a RESTful Microservice in Python - A story of how an aging Java project was replaced with a microservice built with Python and Flask.
- Developing and Testing Microservices With Docker - An example of the processes involved in creating a simple Docker-packaged Node microservice.
- Microservices without the Servers - Step by step demo-driven talk about serverless architecture.
- Microservices in C#: Part 1, Part 2, Part 3, Part 4, Part 5.
- Microservices with Python, RabbitMQ and Nameko
- Reactive Microservices - Project showcasing different microservice communication styles using Scala, Akka, Play and other tools from Scala ecosystem.
- Using Packer and Ansible to build immutable infrastructure
Books
- Building Microservices 🔸PDF - Building Microservices: Designing Fine-grained Systems. Sam Newman. Preview Edition.
- Istio in Action - Teaches you how to implement a full-featured Istio-based service mesh to manage a microservices application.
- Microservice Architecture: Aligning Principles, Practices, and Culture - Practical advice for the strategy and design of Microservices.
- Microservices in Action - A practical book about building and deploying microservice-based applications.
- Microservices in .NET Core - A comprehensive guide to building microservice systems using the .NET stack.
- Microservice Patterns - Teaches how to build applications with the microservice architecture and how to refactor a monolithic application to a microservices.
- Microservices from Theory to Practice - Microservices from Theory to Practice: Creating Applications in IBM Bluemix Using the Microservices Approach. IBM Redbooks publication.
- Migrating to Cloud Native Application Architectures - This O’Reilly report defines the unique characteristics of cloud native application architectures such as microservices and twelve-factor applications.
- Pulsar in Action - A practical book about developing microservice-based applications using Apache Pulsar by Manning Press.
- Testing Microservices with Mountebank - Provides a testing strategy using mountebank for service virtualization, promoting independent releases of Microservices
- The Art of Scalability - The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise. Martin L. Abbott, Michael T. Fisher.
- The New Stack eBook Series - A Comprehensive Overview of the Docker and Container Ecosystem.
- Book 1: The Docker Container Ecosystem.
- Book 2: Applications & Microservices with Docker & Containers.
- Book 3: Automation & Orchestration with Docker & Containers.
- Book 4: Network, Security & Storage with Docker & Containers.
- Book 5: Monitoring & Management with Docker & Containers.
- The Tao of Microservices - Teaches the path to understanding how to apply microservices architecture with your own real-world projects.
- Micro Frontends in Action - A practical guide that teaches how to develop large software projects with multiple independent teams.
Sites & Organizations
- Cloud Native Computing Foundation - The Cloud Native Computing Foundation builds sustainable ecosystems and fosters a community around a constellation of high-quality projects that orchestrate containers as part of a microservices architecture.
- CNCF Cloud Native Interactive Landscape - Interactive landscape of cloud native technologies.
- Microservices Resource Guide - Martin Fowler's choice of articles, videos, books, and podcasts that can teach you more about the microservices architectural style.
- Microservice Patterns - Microservice architecture patterns and best practices.
- Microservice Antipatterns and Pitfalls - Microservice mostly known antipatterns and pitfalls.
Emerging Technologies
- Enso - Visual and textual functional programming language with a focus on productivity, collaboration and development ergonomics.
- Holochain - A framework for distributed applications, allowing you to build apps without any network constraints. This means every user controls their own data, and it can't be sold or exposed to third parties.
- Ops - Free open source tool to build, run and deploy existing linux applications as unikernels.
- SAFE Network - Powered by the spare capacity of everyday computers, SAFE replaces the vulnerable structure of the existing Web with a decentralised, autonomous network. One that is secure, and accessible to everyone.
- Solid - Empowers users and organizations to separate their data from the applications that use it. It allows people to look at the same data with different apps at the same time. It opens brand new avenues for creativity, problem-solving, and commerce.