Distributed Systems & team size
In 2011, a million simultaneous requests per minute against a central inventory system would have been considered a lot. It would have been a remarkable system to have built in Spring. I was a replacement java architect brought in to cover a departing Java Architect while working as a senior consultant and my job would be to comprehend the departing architect’s design, evolve it as we discovered implementation problems, as well as head off any new challenges altering the prescribed design as the situation evolved. For me, at 12 members with a half dozen stake-holders, it would be the largest team I’d managed in my career.
Team Sizes and Coordination Costs
I normally prefer teams of 3 to 6 developers. I like to have contact with developers once a week at least. The more attention I can pay each person, the better I can learn their strengths and weaknesses. Larger than 5 developers and I start having to partition work and my ability to cycle through the whole team starts to lag. Beyond 10 developers and I have to essentially create two teams.
This means I have to in practice start breaking the team into sub-teams. Which is fine if I know my delegates well and I trust they’ll follow through on our shared vision. The issue becomes the coordination cost between group members. That’s effectively the same problem distributed systems have.
CAP Theorem and Human Teams
The now well known CAP Theorem wasn’t popular yet so this concept was in my head intuitively. There’s an intuitive correctness to the idea that a distributed group is some compromise between consistent, available, and partition-tolerant. Extending the analogy to people, for a human team: CAP Theorem is about a team having a consistent vision for their work, the team being available for each other to collaborate with, and the ability for the team to handle partitioning as in workers in other remote locations.
For a work-place, in 2011, a common solution to avoiding the CAP problem was to co-locate the teams in a single large open space. This avoided partitioning by physically putting everyone together. Putting everyone on the same mandatory work schedule avoided availability problems, but what avoids the issue with consistency in all the different ways a team of developers have to remain consistent? In simple human terms: technical leadership and management.
And then COVID-19
With the new COVID-19 inspired work remote trend we see the tricks to short-circuiting problems in consistency and availability by utterly sacrificing partition tolerance (remote work) just doesn’t work anymore. This forces software development technical leads and managers to adopt different strategies. You are going to be partitioned at the finest grain possible… partitioned by the individual person. So what do we do now?
Consensus Algorithms in human terms
The Relational Data Base Management System (RDBMS) should be a friendly reminder of the inventions of the 1980’s that created the 1990’s. The RDBMS was also the source of The Vietnam of Computer Science. The various RDBMS solve consistency problems in a variety of ways: row locks, index partitions, organic indexes, read-only replicas, and multi-master architectures.
In these classic techniques of managing distributed consensus is handled the way I resolved my large developer team problems: appointing (officially or organically) local leaders who have higher authority over sub-domains of the unified problem domain. In human terms this means a database team, a model team, a service team, a UI/UX team, and so on. When an issue cuts across teams you now have to up-level the discussion to appeal to either a group consensus arrived by good-faith argument &voting or appeal to a higher authority.
I’ve been calling the last ditch “appeal to higher authority” gambit appealing to a jurisdiction. That’s because when we look at law as a technology it has the same basic problems any human-centric technology has. You need to have appellate courts in human court systems as well as the ability to bend or break the rules when a situation demands it. So to will a computerized system have such out-of-bounds situations that a central original designer simply won’t be able to anticipate.
CI/CD as a distributed consensus system
In my goal of creating distributed teams unlocked from having to work in the same time and places, I’ve discovered that the CI/CD system actually can act as an asynchronous consensus system. But, this system, like how NoSQL systems lack features of RDBMS lacks some features around consistency and accessibility (in terms of rapid fire conversations). The system tolerates partitioned teams separated by time-zones and possibly work goals better but has a long consistency tail to it. Arriving at a consistent vision simply takes longer and requires more amenability from leaders to the idea of allowing for competing visions to coexist.
Automated Technical Leadership
This lack of consistent vision and practice shifts the technical lead’s role from creating power-points and peer-reviews to more of a role of creating automations to enforce consistency where it matters. The tight-fisted technical leader will simply work themselves sick trying to engender lock-step consistency in practice and vision between all their developers.
In this spirit, I’ve tried creating pressure on teams to pass the CI/CD as a kind of automated specification review system. The idea is to codify which people need to be gate-keepers on what changes, and to put guidelines like test coverage, documentation standards, formatting, and specifications into the pre-build and test systems for a development team.
A lot of managers hear this and think: “that’s QE/QA work.” While that would not be completely wrong, it also misses the point. This early “quality” test can’t possibly replace real Quality Engineering work. For one, the pre-build quality we’re testing here is a completely different level of abstraction. This kind of work is attempting to heat-up issues in the consistency of developer practice and understanding so that it can be found without having to scrutinize every individual contribution.
A full Quality Engineering (QE) solution would not necessarily cover issues internal to the semantics of the developer’s practice. Instead, this separate and powerful practice is involved in isolating software system behaviors in increasingly accurate simulations of the production environment. The QE is interested in identifying mismatches between the code produced and it’s actual implicit requirements imposed on it by the real-world operating environment. This is subtly related to but independent from the vision, understanding, and execution of how the software achieves its goals.
Developer Consensus in Distributed Development
Unlike my first experience with a team writing code for a large distributed system, after COVID-19, we (developers) are not centrally located, time-zone synchronized, and individually consistent by top-down management. Instead, the teams creating distributed systems have similar problems to the distributed systems they themselves are creating. The trade offs are similar and a not necessarily intuitive.
The formula for succeeding at this distributed development endeavor is unsurprisingly: provide clear vision, clean boundaries, with jurisdictions and appeals. It’s not as clean as central control, but it is more human. And, the human factors are more important in the long run.