Who Says Distributed Monoliths are Bad?

I’m going down the rabbit hole of reading about microservices for an upcoming talk. As a UI guy, I’m always playing catchup, more so on the back-end. I’m intentionally targeting my survival bias trying to find _anything_ that cites Distributed Monoliths as good (most assume bad). This was originally posted on Twitter.

Maybe I’m just looking for terms. Either way, it’s fascinating the group think on forgiving monoliths bad reputation “lack of tests & easy deployments”. Like, all 5 essays I’ve read. All these writers are clearly smart, experienced, and appear to have benevolent intents.

Many actually wrote short articles citing the core things to care about: know what you’re actually building from the business so you know what too abstract. Only then when you grok what you’re building to do you carve off the odd things hard to manage.

… but they completely neglect to verbosely berate why Distributed Monoliths are bad. Example, here are things they imply bad: change in 1 service requires change(s) in other(s), deploying 1 requires other(s) deployed at same time, lots of communication, 

same devs work across many services, many share same datastore, code shares same code or models. That list is a rubric for symptoms indicating you may have a distributed monolith. My favorite is the reverse, though…

Like pointing out some exceptions to the good. i.e. Can a dev change a service without affecting others (within reason)? Did you see that “within reason”? Software is full of caveats, sure, but let’s explore those with the distributed monolith “bad symptoms”, please?

I know many of you reading this _know_ of the various problems inherently. My point is acknowledge when it’s ok from your perspective to have a service that when changed requires the entire company to know, forcefully upgrade their code, and WHY you do that. 

Maybe even that is a series of microservices, but has its own abstraction on top. Example, upgrading a library used by 1 to many services. People think of “running code” but gloss over deps of said service(s). Or Python 2 to Python 3. Forced horror show vs. Node 10 to 12. 

This ideal of releasing a service and NO ONE is affected is like FP people who dream of no side effects (Elm doesn’t count). If a service could be released and never break anyone… what… purpose does it actually have? Clearly SOMETHING is talking to it.

Once we acknowledge that truth, you can understand things like DLL hell mitigated by static linking is similar to npm using shrink, and later package-lock.json… and later still many using yarn or SHA tested node_module zips. Which then leads to schemas & contracts.

Our code is correct. How it talks has a contract. Contract testing, though, is rife with “Well, my version passes, so if it breaks, it’s not my fault, but I’ll end up fixing it anyway”. When someone says “microservices should be able to deployed independently”, dude…

… sure, but like, WHO is she interacting with?

“We can’t test all those downstream deps. That’s their job, we have a published contract and the new code still uses it”. 

Anyone who’s done async programming knows contracts don’t cover time or errors.

“Yes, I’m getting the Protobuff response decoded successfully… but it’s _wrong_.”

“Yes, I’m getting the correct error code 2003, but I _shouldn’t_ be getting it.”

“Why am I getting 2 responses?”

I get they’re focusing on the basics. Independent deployment is good if your product releases without care of another. That doesn’t mean that deploying things together is bad. They just come from slow, error prone, multi-day of releases & rollbacks, so lump ’em together. 

New Relic, in regards to warning about microservices making multiple calls, was the first I found to acknowledge distributed monoliths “can actually perform pretty well, and may never experience significant issues with response times”.

Yet in the next paragraph they use the metric “2 or 3 calls” may indicate poor coupling. Says who? What’s a good metric? 1? 0.2? 5? At the beginning of the project or at the 4th production release? Is there a business/SLA/developer is exhausted/has no tech lead reason(s)?

As many readers will know, they’re referring to “if 3 services keep calling each other over REST, but they could just be 1 service calling each other via function/class method calls, just refactor to 1”. But hold on… why? Why is that implied? 

What was the instinct of these developers creating each call as 1 service. Whether monorepo or many, did they ENJOY having just 1 entire code base around a single REST call? Would could be so complicated that a single REST would need to be it’s own service? 

Express hello-world: it’s own code base.

So wait, if you have 5 routes, do you have 5 code bases?

Depends on who you ask. The “fail forward” crowd says yes, and each has its own CI/CD pipeline, and independently deploys.

Serveless framework/SAM users are the opposite.

They have a single monorepo, but can either deploy a single service, or all, in a single deploy action. What does that mean for the “independent deployment” proponents? Are we, or are we not, negatively affecting other services? Yes. No. Both?

Second, is that good if we can test the services & stick both independently & together, and it’s reasonably fast? Are we allowed, then, to break that rule? Sometimes?

You can also now share code, drastically reducing duplicated logic across services.

I think we can summarize that these developers did it because it was easy, they had something working quickly, they could easily deploy, see it work on a server, and the cognitive load was SUPER LOW. The sense of accomplishment early on; was it really based on making progress?

The articles imply “no, they should think ahead of time of these calls, and put them into a monolith to reduce insane latency”. What if they didn’t know? What if they learn, and then refactor? If it’s a monorepo vs. a bunch of repo’s, describe that refactor story in your head.

Many will point out, “right, but now you have tons of ‘things’ to manage”. We need to piece of manage, though like Yan Cui points out in his article discussing many functions vs lambadliths: https://medium.com/hackernoon/aws-lambda-should-you-have-few-monolithic-functions-or-many-single-purposed-functions-8c3872d4338f

Now in recent years, I’ve focused more on just developer concerns, such as cognitive load, debugging, and feedback loop speed. But Yan also cites “scaling the team” which I like. Discoverability isn’t a concern for developers; we have everyone else’ code, but our own (sometimes).

Discoverability is also a problem with Monoliths/API’s, data, streams… you name it. It’s hard to find things. I know people’s partial role is strictly to reduce duplication within companies. Not just libraries, but like actual business apps. 

Key line is his justification for smaller functions for debugging: “A monolithic function that has more branching & in general does more things, would understandably take more cognitive effort to comprehend & follow through to the code that is relevant to the problem at hand.”

Contrast that with our earlier example of a developer staring with just 1 function doing a REST call/Express hello-world in a repo. Yan has a quote for that as well:

“HTTP error or an error stack trace in the logs, to the relevant function & then the repo is the same regardless whether the function does one thing or many different things.” This works _for him_, but I think he’s just being humble/modest/rad.

I’d argue ALL DEVELOPERS want that. Yes, many developers get a rush finding a bug. That rush continues well into your career. Sometimes, though, errors and bugs are stressful. Sometimes they have horrible consequences. Some don’t like days of debugging.

Can we surmise, then, inadvertently developers WANT a distributed monolith merely by starting to code that way, but for performance reasons they should refactor back parts of it to more monolith functions? What are the other costs here?

Scaling your team/organization as a benefit aside, let’s focus on this quote: “Also, restricting a function to doing just one thing also helps limit how complex a function can become.”

NO ONE can argue a truism of software, and they are few, is ALL code grows over time. 

No matter how small, how simple, how perfect, Entropy affects code as well. Sometimes it’s for good reasons; Cyber finds a security flaw, someone finds a speed optimization/cost saving, or it was adapted to a new business need.

Other times it’s more plumbing or uncaring. Error handling, enterprise JSON logging, more verbose logging, metrics, manual trace statements, accidental pollution from senior developers who don’t know what’s going on, or juniors who are flailing adding random bits.

So developers have found, the best way to fight that is to start as small as possible. How do you do that? Deploy 1 function. Sure, a class with 1 method counts. That’s an intentional, responsible, benevolent start, not a “desire to use microservices because we heard it’s cool”.

Here’s the brain warping quote:

“To make something more complex you would instead compose these simple functions together via other means, such as with AWS Step Functions.”

I got into Python and Node for creating orchestrating layers.

Basically either a Back-end for a Front-End (Sam Newman describes it best https://samnewman.io/patterns/architectural/bff/ ) or just a simple API returning the JSON I need from back-end services that cannot/will not change, or there is too much political ill will, or even just time, required to change.  

Need to get a user, parse some XML, and hit a database? As a front-end developer, doing that on the front-end, while do-able, just ends up exposing how much technical debt your back-end has and kills the user experience. Instead, building 1 REST call to abstract the nasty.

However, Yan’s referring to AWS Step Functions. They’re a Tweetstorm in their own right. Suffice to say it’s a solution that removes the “web of services calling each other increasing latency and showing we created a distributed monolith, oh noes”. 

I know I’m neglecting release friction here; let’s cover it right quick. Slow release, risky release, slow testing, slow rollbacks, are all bad. If you can deploy something quickly, slowly roll it out, quickly roll it back, and testing is fast throughout; that’s good.

Whether single repo or monorepo, both small offerings and large behind Enterprise firewalls & red tape have greatly sped up and been simplified. Releasing a Lambda function is a simple as “click the save button” or a shell script in 4 seconds, or a 20 second CloudFormation deploy

Many of us are still doing lift and shift: moving something old and using on-prem servers to the cloud with little to no architecture changes of the original app. That often means ignoring, on purpose, glaring problems of the app not taking advantage of what the cloud offers.

So these concerns ARE still relevant. There are various hybrid versions of this, and we do want to avoid large releases to avoid risk and increase our chance of success. These are a given. The how you do that aren’t.

Back to more interesting things in New Relic’s article https://blog.newrelic.com/engineering/distributed-monolith-vs-microservices/

They cite using a shared datastore as a bad thing. However, that’s often a GOOD thing. Databases tend to be the single source of truth; thus you cannot have “their own” in finance for example.

In fact, S3, or Elastic File Service which is built-in, are great ways to share the same data for many Lambda functions. Ask any programmer to deal with multiple sources of truth and they immediately ask “how do we correct this?” Yes, I know that’s not what New Relic meant, but

“datastore” isn’t really helpful when people are learning stateless microservices to know where they’re supposed to put state. S3 is awesome, battle tested, and has lots of google results for common problems. Data is hard. Having smart people handle that, and you don’t is good 

This means your services have 1 less thing to go wrong. Seems trite, but you’ll see this “their own data store” thing come up a lot, I think because Netflix was big on it years ago; around the same time Circuit Breaker Pattern became the greatest design pattern of them all.

Finally, New Relic encourages scaling of services independently. While they don’t expound on it much, it seems to imply the more, the better, because each that has a spike can be independently scaled. Lambdas have reserved concurrency you can up; ECS/EKS more containers.

Hardly the Distributed Monolith bashing I was looking for. In fact, while I get New Relic is selling services, they’re literally fixing some of the problems having so many services bring, specifically tracing: “Visually showing a request go through all the things”.

Basically how you debug it all at once. This also includes monitoring, which now includes not just all your services, but decreases blast radius. If 1 fails, it no longer throws an Exception potentially bringing the monolith down or putting the server in a bad state.

However, failure/errors no longer mean what they used to. Let me give you an example of something that embraces errors: Erlang. Or even Apollo 11. Or Akka. Erlang popularized “let it crash”.

Using a process to watch another process, think of a try/catch that waits awhile. You then can upgrade your code WHILE it’s running:

https://ferd.ca/a-pipeline-made-of-airbags.html

The good part of original Object Oriented Programming, message passing.

While the author is sad, it is our life now. Things crash, and SOMEONE ELSE figures it out. Lambda fail? Don’t worry, we’ll try 4 more times. Docker crash? ECS will start a new one. So “health”‘s meaning has changed. A monolith was pretty binary which is why perf tests…

Had stress tests; see what point she breaks at, and she if she gets stressed, does she become healthy again? You still do those types of tests with microservices, but they are SUPER resilient against failures vs. your try/catch-fu combined with your compiler enforcing throwable.

Health is now more transient. Pieces can be fixed in near-real time AS A DEPLOYMENT. Not many monoliths can do that save Erlang. Yes, many ECS/EKS/Kubernetes deployments just “spin up a new Docker container” so it uses that version, but I’m talking scalpel function level.

1 function broke? Fix it.

vs

1 function in code broke? Fix it, deploy new container, API weighted route will use it for Canary deployment.

Still struggling to find the article regaling me all the nuances in the above 2 statements.

I know WHY the New Relic articles are written like this; they’re selling their rad tech. They, like Datadog, have this “you to to build, monitor, and explore emergent behaviors in your microservices to change them over time”. 

A lot of the 2018 or earlier microservice articles made it sound like once in you’re in Distributed Monolith land, you’re toast, or should of just did a monolith first. Pretty sure TDD/Red Green Refactor was still popular then too, oddly.

It’s framed as “troubleshooting” by New Relic for making better UX, but I’d argue it’s like a magnifying glass you use to pain small figures. It’s another tool for an artist to do their work properly. These tools are now de-facto, not something you _may_ want to try. 

I really liked New Relic’s breaking of the narrative mold of “never stop developing”; some how the word “iteration” seemed to be removed from all microservice blog posts. I guess because many viewed those projects as un-saveable back then.

I also liked Jonathan Owens final take here on reviewing the human cost: https://thenewstack.io/5-things-to-know-before-adopting-microservice-and-container-architectures/

I’ll never be a manager, but really appreciated my managers view on us “learning” all this stuff. Yes, I know how to write code, but…

in many Lambdas + Step Functions + deployed in prod? How does it scale? How do you fix scalability problems? How do you deploy within our company’s cyber & regulatory challenges? That’s a HUGE managerial effort and only supported by (I think) a delegation/trust/empowerment style.

While not specifically called out yet (still reading), it seems many of the worries of microservice architectures are implying Docker, and not Serverless. I’m still learning the limits, but it seems a _very_ different mindset in the different camps, the Docker camp heavily Unix

This Ops familiarity I find interesting as those are typically the traditional front-end heroes. As a long time UI developer, I knew nothing beyond FTP of deploying my code, and Unix Ops people would work magic and bring my UI to the world. This same crew now is heavy into 

the K8/EKS container world, and it’s fascinating how “we’re both doing microservices”, but different. I don’t use AMI’s. I don’t refresh them. Excluding Batch, I don’t tweak things like file handle numbers, or care about Red Hat versions, or worry about global exception handling.

The nuances there are vast and I don’t see articles really cover this either. How do Distributed Monoliths in K8 compare to Serverless? Like do they have commonalities for anti-patterns or are there any interesting differences?

Many of these articles do NOT cover data models very much. They say “Model”, but they mean what you think of as your business problem you’re solving and the code wrapped around the data.

João Vazao Vasques covers that here: https://medium.com/@joaovasques/your-distributed-monoliths-are-secretly-plotting-against-you-4c1b20324a31

I remember reading this 2 years ago, and I stopped reading at “data”. I was too experienced to know what he meant. Now, with AWS EventBridge having built-in smart schemas, I get it: https://docs.aws.amazon.com/eventbridge/latest/userguide/eventbridge-schemas.html

But an interesting note you may gloss over is EventSourcing.

Yan Cui has another article called Choreography vs Orchestration, which I basically refer to as Reactive Architecture vs. Orchestration Architecture. https://medium.com/theburningmonk-com/choreography-vs-orchestration-in-the-land-of-serverless-8aaf26690889

Another thing easy to gloss over there is having EventBridge, not SNS/SQS, playing a key role.

EventSourcing, if you’re not aware, is basically like Git or Redux. Immutable events that stream, in order, to mean something. For Git it’s your current commit hash, or branch you’re currently on. For Redux, it’s what you’re UI is currently displaying for your Object.

Distributed _anything_ typically has some type of Event. For Docker’s message bus if it’s not REST could be Kafka/RabbitMQ, etc. For Serverless Lambdas or Step Functions… it’s an event as well. While typically JSON, they key here is people are thinking about data integrity.

John A De Goes, @jdegoes who’s helped make ZIO, a type-safe library for doing concurrency in Scala. If you’re an FP person, you’ll fall in love.

https://github.com/zio/zio

Anyway, relevant quote by this guy around data with link to thread:

“Statically-typed programming language designers give almost no thought to data, even though data dominates everyday programming (reading, loading, transforming, decoding, encoding, validating, persisting, enriching).”

He’s right; this is exactly why I find JavaScript/Python so much more rewarding in the beginning with developing microservices, and hate them at the end. Even in a monolith, the data changes? ALL your code changes.

The biggest learning I had from my latest microservice (Distributed Monolith?) project was MANY of the issues related to data. While a single app, controlled by 1 to 2 developers, had a basic data model, man, 1 change could wreak havoc.

… or not. And that yet again goes to what Yan Cui and John A De Goes and João Vazao Vasques are referring to around data. The code is stateless. Same input, same output. It SHOULD be easy to test, right? With the data I had _at the time_, yes. Change? Maybe boom Bomb.

Interesting, it was using Step Functions to wire all the microservices together. However, schemas are currently only for EventBridge & things like GraphQL (AppSync). Step Function’s do JSON; whether it’s legit JSON or not is on you. For Python/JavaScript? Who cares.

João Vazao Vasques final assertion “correct way to capture data changes is to have systems emit events that follow a specific contract” seems to jive with my experience. What’s interesting is are the Lambdalith (monolith in a single Lambda) using typed languages better equipped?

Even in 2016, Ben Christensen from Facebook was citing tooling problems: https://infoq.com/news/2016/02/services-distributed-monolith/

I wonder if, had we had those back then, what type of changed narratives would we have on what’s included in the anti-pattern vs. not?

Here, someone other than me, explaining why developers go for the single REST call in a single Lambda example: 

“we too often optimize for the short-term since it feels more productive”

Any developer who’s released at least product too prod and maintained it knows sometimes you have to do one, the other, or both. Client have a $1,000 budget? NBA game on Sunday so we have to release on Saturday hell or high water?

Contrast that with assuming you get to pay off your technical debt, or _know_ what you’re long term even is. He says “delaying the cost of decoupling is very high” and we should use easy tools in the beginning. I don’t know what these tools are, linked page 404’s. ☹️

I’m guessing he meant schemas (Avro, Protobuf, etc). Here https://infoq.com/presentations/bbc-distributed-monolith-microservices/, Blanca Garcia Gil quoting Sam Newman describes “The distributed monolith because life is not hard enough”. Petabytes of data processing in their app. Drives their various platforms.

They immediately call out Batch, which I love. Whenever I struggle to scale something in serverless like Lambda or streaming Kinesis/SQS, I fall on AWS Batch “because ECS without the drama”. It’s nice my gut feeling the BBC was like no bueno.

I deal with large data too (thousands, not billions), but the pain of digging through a failure is SO HARD. The lack of validation, heavy end to end tests, and no event sourcing capabilities. Lots of good nuggets in there, BUT

The best one is talking with the developers who didn’t like the data shape. People say “Protobuf” with a gruff way, but thinking “Oh well, I have the happy looking Go gopher so I’m not actually being gruff”: I get compile guarantees for my data, all is well, right?

As a long time UI developer, I hate all data. Yours, mine, analytics… it’s never right. I serve the almighty designer, and if he/she/they want a able or title formatted some way, I’ll first try formatting, then give up and just format it.

Asking back-end devs to change data for a Designer is like asking private security forces not to aim an AR at you; it’s pointless, and just leaves you angry. Better to run away and handle the problem yourself. The BBC focusing on Developer UX through data is SO RAD.

Again, you see her talk about “our architecture evolves over time”. You don’t see many of the microservice authors talk about this, nor continual learning/refactoring _once you’re in that architecture_. Their own event log forced an easier event sourcing test as well.

The natural language CLI to help developer onboarding, proactive monitoring, and reduce cognitive overhead is genius. Their event sourcing to clean bad data is MUCH more compelling than clean up services in a Saga Pattern: https://theburningmonk.com/2017/07/applying-the-saga-pattern-with-aws-lambda-and-step-functions/

Funny the new architecture is more complex than the first “because microservices”. Maybe Distributed Monoliths make microservices look bad so don’t label me as such?

Errands to run so will write more after reading 3 more blogs this afternoon.

Rehan van der Merwe has a great example here describing not only how to build a distributed monolith, but how to refactor it to microservices. 

https://rehanvdm.com/serverless/refactoring-a-distributed-monolith-to-microservices/index.html

Like me, he prefers Lambdaliths for API’s despite the ease of API Gateway or Application Load Balancers make it to point to Lambda functions. I have other reasons because of the CICD pipeline my company forces us to use and our restricted list of AWS we can use.

It’s important because he illustrates tight coupling that can happen. However, more important is how he _was_ able to build and deploy 3 microservices each on their own route hitting downstream systems, with e2e tests in place, ready to refactor. 😃

Interestingly I’d actually consider his first iteration a success. The latency in place, sure, but great job! I ponder what type of events would transpire to allow my team to refactor to EventBridge in a version. I always hated hearing “Phase 2” because it never arrived.

Oddly, he cites principles from OOP as justification. I say odd, but it’s not odd; OOP is notorious for “encompassing all of programming as the one, true way”. Yeah, ok, sounds great just don’t hurt me. 👍🏼

For what I would consider a small application, this would still take 3 months or more at some places I’ve worked. I can understand why developers who’ve experienced this, & never get the 3 months, write blog posts with prophetic warnings of Distributed Monoliths.

… that said, dude, chill out, your app works, is almost BASE, and you’ve done a great job documenting & coding it with e2e tests. Developers are so focused on clawing their way out of technical debt, they forget to stop, breathe, and embrace their awesome victories.

Sam Newman recently attacked Microservices directly: https://theregister.com/2020/03/04/microservices_last_resort/

If you want an article citing the pro’s and con’s of monolith vs microservice, this helps. Key message: It’s “hard to do microservices well”. Dude, it’s hard to do software well.

He cites lockstep release. If you can deploy pieces of a Distributed Monolith independently, is it still a Distributed Monolith? “Coordinating between multiple teams”; what if it’s just you, but your data changes 3 services?

His message appears too heavily lean on continuous delivery actually being real. Meaning, if it works in QA, then it’ll work in prod because you have the same environments, your tests are 2 legit to quit, and all the things are automated.

The audience? Again, fear and loathing of “the big ball of mud”. Consensus, to me, developers like little code bases. Details of how they work together… perhaps a John Stuart Mill Utilitarianism compromise? A bit of cognitive load pain for coding pleasure?

I like how he acknowledges if u know your domain, u can slice & dice to microservices easily. Is that a problem that going to that architecture is easy? If we can’t easily define our domain boundaries and create hard to deploy balls of mud, is it better we just follow our hearts?

I sadly say this as a Functional Programming fan amongst a world of Object Oriented Programmers, lol. They have a saying in Sociology, people’s interpretations of events may be wrong, but their reaction to it is real. This natural aversion to monoliths; hype or something more?

Interestingly, others have noted that Lambdaliths can be great ways for developers to learn Serverless. While they may not like monoliths, it’s the evil they know vs. the distributed monolith being the evil they don’t yet know.

In summary it appears Distributed Monoliths have no good definitions with recent Continuous Deployment tools like AWS SAM & Serverless Framework negating some of the previous problems. It appears the best I can currently do is the following:

Code change requiring other services to change? It’s bad IF those other services are hard to change. SAM? Easy. Coordinating with another team causing deploy delay? Bad.

Deploying one service requires deploying others in lockstep? In SAM or Serverless: easy. If require coordination with another team, hard to test app, or low monitoring visbility, bad.

Service overly chatty: Pssffff, it’s just version 1 and you wouldn’t have built it that way intentionally if you knew the business domain; refactor! 😃

Same developers work across multiple services: Seems like a good thing for knowledge share, but this one is confusing; don’t you have tests and schemas? Maybe they’re worried about Continuous Integration being a mess because 5 PR’s have to be merged in order for “app to work”.

Many services share a datastore? I get this from a resiliency perspective, but from a source of truth and test perspective, I get confused.

Microservice sharing a lot of the same code and models: Using SAM / Serverless for libraries and common util functions – awesome. Sharing Models, though… “Model” I’m assuming being a Model from the OOP world meaning “data of your business domain”, yeah, that seems red flag.

Hopefully you can see why I’m confused. Many of the perceptions written about the past 3 years about Distributed Monoliths can be fixed with schemas, deployment tools, and testing. What’s also left out is scope. If you build “just an app” full of microservices…

… on your team, and it’s deployed as a single app in AWS (look on left hand nav in Lambda in AWS Console), and that “app” interfaces with some other team… how is that any different from 2 monoliths talking to each other?

Maybe 1 team likes SAM and microservices, and other likes the AWS CDK and Docker with their monolith? I can’t say I succeeded in finding totally why Distributed Monoliths are bad, but I sure did learn a lot! I hope you did too.

To learn more, beyond rando googling, I found a lot of @theburningmonk articles just “get me”

https://theburningmonk.com/