When AWS Lambda functions fail with exceptions, you can have alarms setup to let you know. The problem is the Alarms don’t tell you what the error is, just that a Lambda crashed a bunch of times in a short time frame. Is it something you should keep an eye on? Did something catastrophic happen? Is it something you can ignore? Who knows 🤷🏻♂️.
Enter Lambda error parse from CloudWatch filter pattern subscriptions. In this article, we’ll show 2 languages used to do that in a Functional Programming style. Specifically ReScript and Python. While subjective, we’ll do our best to compare and contrast the different approaches. I don’t specifically recommend either language for the task, I just find it interesting to compare and contrast and hope you do too. I’ll cover each function, what it does, and how it fits into the larger whole.
Code – ReScript & Python on Github
Although you can skip around, I recommend reading the ReScript part first as I refer to the implementation details there from the Python code.
When Lambdas log to CloudWatch, it’s a stream a messages. Those streams can be subscribed too. Those streams can also be filtered to “only give me the ones that have errors in them”. You can then have that filtered stream invoke a Lambda with log error message, parse it, and send the error somewhere. This allows you to build better reactive, and proactive, monitoring, like sending error messages to Slack/PagerDuty/xMatters. Instead of “something broke” you get “the database couldn’t be connected to” or “an upstream service had a blip of downtime, but all is fine now”. Much more useful and actionable and nice to get English on your phone when it’s 3am you’re out of it vs “some error, please acknowledge page”. “NO, I demand you tell me what is exactly wrong waking me up at an un-godly hour!” “pLe@$3 ACK-kN0wl3DgE p@ge” “Stupid phone……”
Quick Note on Architecture
If you’ve never deployed a serverless application to AWS before, you may not understand the the need for using an error parsing Lambda. Here’s the crash course.
If you’re building an API, whether a REST one using API Gateway, or a GraphQL one using AppSync, you’ll have a monorepo with a bunch of Lambda functions in it. Each Lambda function is 1 to many files, and each function typically corresponds to a REST API route, or a GraphQL query/mutation. You’ll use something like Serverless Framework, AWS SAM, or AWS CDK to deploy. The deployment uses CloudFormation behind the scenes which means when you first deploy, you’re deploying “everything”. If you change 1 line of code in 1 Lambda and deploy, you’re only deploying 1 Lambda function update. Using Serverless deploy function, or AWS SAM accelerate, you’re bypassing CloudFormation to quickly test code in a dev/qa/stage environment, something common in Serverless. You can, and should, run and test your code locally, although for some runtimes (Go, Python with C++ libraries, custom runtimes, etc) this can be difficult so you just deploy and test in AWS only.
When you hit a URL and it runs your Lambda, or when you hit an AppSync URL and it runs a query/mutation Lambda, there is a chance it could fail. Whether it succeeds or fails, you’ll often see logs from your function in CloudWatch. It’s assumed you’re using a logger that formats as JSON for ease of log processing, but you don’t have to (read: you should). Most frameworks allow you to setup Alarms as part of your code so when Lambdas fail a number of times within a certain time frame, the AWS Alarm will go into an Alarm state, and immediately send a message to SNS. This SNS message can be routed to email, another URL endpoint, your phone, Slack, all of them, or whatever.
Alarms, however, merely say things like “Your Lambda crashed 3 times in the past 5 minutes”. They don’t answer why it crashed. In Imperative or Object Oriented Programming, crashes are fine; they’re often intentional. In Functional Programming, this is typically not done; crashes, while not academically defined as side effects, in practice are bad, unintentional side effects. However, the good news is AWS Lambda’s contract is “when you’re Lambda doesn’t work, you’re supposed to throw an Exception to signal to AWS”. This is good, and how things like SQS retries work, how Step Functions interpret whether they should retry or exponentially backoff, etc. That’s good for operations, but not at all helpful for monitoring and support.
To help us know what the errors are, and handle which ones we care about and ignore the rest, we do log processing, and ONLY process the errors. We can go from one extreme of “send all errors with their informative contents” to SNS or only sending the really bad/unexpected ones, or even have some business logic in there to handle error routing. The net result is our support systems such as PagerDuty or xMatters can page whoever is on call that day/week with helpful information allowing them a final choice to ignore or act on the error. If you’ve ever had to support software before, it probably fell between the 2 extremes of “every error we just ignore” or “every error is a fire drill”. Using Lambda error log parsing is a step in helping those extremes go away. You can then build upon that stream of errors and either further filter, route, etc. with others systems.
Below is an architecture of our application that for now forwards all error names and contents to SNS. We do some slight parsing and routing in xMatters which then gets sent to Slack (you can write JavaScript in xMatters to help make the filtering/routing easier). Whoever is on call that week also gets an email and text message.
ReScript
The ReScript programming language is like TypeScript; you write with types and it compiles to JavaScript. The shining features are the fastest compiler on the planet, sound vs strict types which leads to more confidence your code works when it compiles, a functional style, yet still has escape hatches to slowly integrate with your existing JavaScript codebase.
Pro’s:
- Compile and run quickly, getting the confidence in your code, a fast feedback loop, leverage existing Node.js & JavaScript ecosystem for AWS.
- If you get an error message, you have confidence it is an actual error message.
- Confidence you can decorate it with additional metadata after the fact.
- Crashes from your error parse Lambda can be safely ignored (mostly).
- Currying is built in.
- Function composition via slim arrows is built-in whereas Promise chains requires the Promise library. These would normally cancel each-other out, except you can seamlessly use this library with ReScript’s operators + it makes your code look more like JavaScript, so I’m keeping in the Pro’s.
Con’s:
- If any error messages even slightly deviate from a single structure, you’re in for a world of work. Types are best used by constraining your domain. If you have various ways your Lambdas can explode, and you need to know about them, that’s really hard to do in a soundly typed way.
- Lot’s of parsing code to ensure you correctly got an error message. Boilerplate types and parsing code is required. Ensure you like this structure upfront, then don’t change it.
- Parsing ReScript exceptions is painful… because they aren’t JavaScript exceptions. Meaning if you’re parsing other Lambdas written in ReScript, you’ll need to ensure “all of them are correct first” which isn’t trivial.
- ReScript does not support JavaScript’s optional chaining, and rightly so, you want to ensure your parsed JSON resolves to a type or not; no in between. However, for log parsing, this can quickly become a pain. For my project, it was fine because all Lambdas were in a monorepo and we agreed upon an error format. However, for general purpose, or flexibility in the future, I feel like this is a weakness. Yes, I want something like ReScript to ensure I’m only being awoken at 2am on a Saturday compared to “maybe it’s legit” using Python, but correctness is more than just types, something you’ll only learn with integration tests and time with your application in production. Again, this borders on a pro in some contexts, but I bring it up because we’re comparing ReScript to Python which makes this flexibility a lot easier.
Imports and Code Order
In ReScript, you can bind and add types to any external JavaScript, like you do in TypeScript. However, sometimes it’s wayyyyyyy easier to just fix the JavaScript so when you integrate in ReScript, it’s gorgeous.
Rather than namespace everything, we just import all the things (e.g. it’s more fun to go resolve
than Promise.resolve
):
open Promise
open Jzon
open Environment
Code language: JavaScript (javascript)
We need the Promise library as it is the best way to use Promises in ReScript. We utilize the Jzon library for deterministically parsing our JSON into sound ReScript types without using the super verbose Js.Json classify syntax. Finally, we have an Environment module to help our Lambda know if it’s running in a QA, Stage, or Production environment.
ReScript code is just like F# or OCAML; it doesn’t have a function parse phase like JavaScript, so we have to define our functions and types first before we can use them. That’s fine, but makes explaining the code backwards (meaning you start at the bottom of the file and work your way up), so we’ll start at our lambda handler and explain each part, regardless of where it’s defined.
ErrorParsing.res – Lambda Handler
let handler = event =>
sendErrorToSNS(publishSNS(publish), event)
Code language: JavaScript (javascript)
When our Lambda is invoked, it gets 2 parameters, event and context. Typically ReScript and Python are super strict about function arity (how many parameters a function takes). In this top-level case, though, Lambda is fine with just passing one since it compiles to JavaScript, and doesn’t doesn’t enforce arity at runtime. Additionally, since we’re using the Promise library, if everything works, our Lambda will be considered successful. If anything blows up, our Lambda catch will just log it, but still allow the exception to be fired, so AWS will know our Error Lambda failed, and attempt to retry. Alarms will be triggered normally for our Error Lambda failing (this is what we want).
For you purity ivory tower developers, you’ll notice that the publish
function is our core SNS publish function and creates side effects. Make no mistake; the handler
is _not_ a pure function. It’s assumed 99% of our code is written in a pure way, and the handler is where the side effects have been pushed to the side, and we just integration it from that point forward. Another way to think about it is Functional Core, Imperative Shell.
You’ll notice sendErrorToSNS
takes in a publish
function. The publish function referring to the JavaScript publish function we’re linking to a module:
@module("./sns.js")
external publish: (string, string, string) => Js.Promise.t<snsResult>
= "publish"
Code language: JavaScript (javascript)
This is ReScript’s syntax for integrating a JavaScript function that you wish to use in ReScript. If you’re familiar with TypeScript, it will use type definitions for this purpose. We’ve defined it in the sjs.js
JavaScript file. It’s a function that takes in 3 strings, and returns some type of snsResult
. Let’s show the JavaScript first, then we’ll show you the result type.
sns.js
In sns.js, its whole purpose is to make publishing an SNS message easier and more predictable. In v2 aws-sdk, this was reasonably straightforward using a Promise. In v3 aws-sdk, they over-complicated it, making it require a lot of code. Exposing types for that is pointless as all we care about: “Did you send a message to SNS successfully or not?”
It’s easier to abstract away this side-effect into a nicer, and simpler, function. “Here’s the info, send a message, lemme know if it worked”. While the goal of v3 aws-sdk for JavaScript was to reduce file size, I’ve yet to see that come to pass since I don’t use TypeScript, and it just ends up requiring you to use more code. That’s ok, we can abstract away this insanity. Also, this ensures the types of ReScript are much easier to write: a function takes typed inputs and returns a typed output we can trust… because we wrote it our self. And tested it.
const { SNSClient, PublishCommand } = require("@aws-sdk/client-sns")
Code language: JavaScript (javascript)
If you’re not familiar with the v3 aws-sdk, you import the client you wish to operate on (SNS, SQS, etc), and then commands that it does. You configure these commands and then “send it” which attempts to execute the command. It’s a design pattern from OOP and the entire SDK uses this pattern.
To publish a message to SNS, you need at minimum 3 things:
- what is the ARN (or URL to a thing in AWS) of the SNS Topic
- what is the subject of your message (like an email subject)
- what is the message? (a string or json string)
So that ends up being the signature of our publish function:
const publish = (arn, subject, message) =>
Code language: JavaScript (javascript)
Then 1 of 2 things happens: It works or it doesn’t. However, if it doesn’t work, while the exceptions may include a hint as to why, that isn’t really a helpful type we can use. What we need is a type of Result. JavaScript doesn’t have types beyond primitives. However we can create one, and type this in ReScript. To ensure it always works, we have to return this Object if the sns publish works, and return it if it fails. The way you ensure that always happens is to return a successfully Promise inside the Promise’s catch method.
Let’s setup the imperative nonsense AWS makes us do now. Sadly, all the TypeScript type information is lost… because we’re using ReScript, not TypeScript. Our publish function will wrap everything in a Promise that can’t fail; so instead of the typical (success, failure)
pattern you see in Promises that wrap asynchronous operations such as callbacks, ours always succeeds so we just use resolve:
const publish = (arn, subject, message) =>
new Promise(
resolve => { // <-- usually this is (success, failure)
Code language: JavaScript (javascript)
And then inordinate amount of classes & ceremony just to call 1 function:
const client = new SNSClient()
const command = new PublishCommand({
Message: message,
Subject: subject,
TargetArn: arn
})
return client.send(command)
Code language: JavaScript (javascript)
Ok, 2 things to handle, success and failure. Handling success isn’t too bad; we just destructure the MessageId we get back, return our Result Object:
.then(
({ MessageId }) =>
resolve({ ok: true, result: MessageId})
)
Code language: JavaScript (javascript)
A Result Object is just a pattern that says if your operation worked, you have the Object indicate it did so via ok: true. If it didn’t, it’s ok: false. If it did work, it probably has some data, so you use data: theData. If the Result didn’t work, you would include the error like error: theError.
Let’s handle the error, and return a resolved promise to ensure the Promise always works and always calls the then
method back in ReScript:
.catch(
error => {
const { ok, result, reason } = safeStringify(error)
if(ok === true) {
resolve({ ok: false, reason: error?.message, error: result })
} else {
resolve({ ok: false, reason: error?.message, error: `Failed to stringify error: ${reason}` })
}
}
)
Code language: JavaScript (javascript)
One thing that can hopefuly happen is errors have more information beyond their error.message. To ensure we get everything possible, we attempt to stringify it. This’ll help if we deploy a Lambda, and it fails to send to SNS. Typically this is an IAM Role permission error, but maybe it’s a simple JavaScript error. Either way, we want to know everything so attempt to pass it back to ReScript intact. IF for whatever that fails, we just send back a generic error saying “Hey, we had a problem, attempted to get you more details, but couldn’t do that so here is what we do know. To do even that safely, we use the Object chaining with the error.message
“.
The last detail is ensuring we can safely convert the error message to JSON. To that, we’ll use the Result pattern again using a try/catch since it is synchronous. We could use a Promise, but since we’re doing this inside the Promise catch, it makes more sense to just make it synchronous.
const safeStringify = data => {
try {
const result = JSON.stringify(data)
return { ok: true, result }
} catch(error) {
return { ok: false, reason: error?.message }
}
}
Code language: JavaScript (javascript)
If it works, great, return ok of true with the JSON data as a string. Otherwise, attempt to explain why it failed and return that.
k, so all that code allows this ReScript module import to be correct:
type snsResult = {
ok:bool,
result:Js.Nullable.t<string>,
reason:Js.Nullable.t<string>,
error:Js.Nullable.t<string>
}
@module("./sns.js")
external publish: (string, string, string) => Js.Promise.t<snsResult>
= "publish"
Code language: JavaScript (javascript)
The ok on the snsResult
is always there. However, we only have result
if ok is true, and reason and error if ok is false. Since this is JavaScript, we can’t trust anything, so instead of indicating the data might be there using a Js.Option
, we instead use Js.Nullable
to be more safe. Js.Nullable
helps handle undefined
vs null
being different. Rarely happens, but if you want to be 100% safe, that is the module to use.
publishSNS
Now that you know how we wrapped JavaScript in a safe way to send SNS messages, let’s show how we integrate with that function in ReScript. Our publishSNS function takes 4 parameters:
let publishSNS = (publishFunc, arn, subject, message) => {...
Code language: JavaScript (javascript)
Although ReScript is a data-first programming language, I still follow a data-last style because of my Elm / ML influenced background. We handle our side effect via dependency injection; meaning, we allow the function to take the SNS publishing function as its first input. Where to actually send that message, the arn, what the subject is, and the message itself are the rest of the parameters. For now they’re strings and match up with what the JavaScript publish function needs and is in the same order. It’d be better to make these Product types, but for now they’re just Strings.
In unit tests, the publishFunc
function will be some stub that sends back either a happy path snsResult
or an unhappy path which sends back an error version of the same snsResult
. In integration tests, is the JavaScript function, publish
from the sns.js file sending real SNS messages. This allows our unit tests to be deterministic with simple stubs, and our integration tests to test the real functionality of our system, but our code doesn’t have to change; it just takes a fake parameter in the unit tests and a real function in the integration tests.
Calling it, we need to handle that custom result type we created in the first Promise’ then:
publishFunc(arn, subject, message) =>
-> then(
resultFromSNS => {
if(resultFromSNS.ok === true) {
...
} else {
...
}
}
)
Code language: JavaScript (javascript)
A few of you may think it’d be nicer to “just use the Promise interface” to avoid the awkardness above of integrating some custom Result type vs. “dude, just use a Promise”. I tried that in the beginning with ReScript, but found that JavaScript is notorious for not having a consitently shaped error messages. Whether native JavaScript, or some library you’re integrating with, it’s just easier to invest a little effort in formatting the error messages so when things do blow up in your Lambdas, you can more easily read and diagnose what went wrong with them. Additionally, when happy paths happen, you’re often not entirely sure it _is_ a happy path. Just because JavaScript gave you something back does not mean it’s correct. The ReScript types will help you verify that, or at least the very least give you a little bit of confidence. Let’s attempt to get the SNS confirmation message ID it’s supposed to send back for example:
if(resultFromSNS.ok === true) {
resultFromSNS.result
-> Js.Nullable.toOption
-> Belt.Option.getWithDefault("unknown message ID")
-> Ok
-> resolve
}
Code language: JavaScript (javascript)
A lot is going in here, so let’s break it down. First, we’re making a huge assumption here that if our ok is true, we _should_ have a valid result. However, the types are Js.Nullable, meaning it could be a valid result, or null
or undefined
; “because JavaScript”. We think we’re good, and we probably are, but ReScript will help guarentee we won’t be surprised in case our assumptions in how we wrote and typed our JavaScript are incorrect. That Js.Nullable.toOption
will convert an undefined
or null
to ReScript’s version of a Maybe called Option
. If we have an undefined
or null
; cool, it just means None
. If, however, result has some data that isn’t undefined
or null
, cool, we’ll get a Some(theData)
. However, we don’t want a None… the whole point of returning the messageID was to return a String ID so we can use that message ID for logging and monitoring those SNS messages we’re publishing if something goes awry. If we start seeing unknown message ID
, while we may have possibly had a successful SNS published message, the response is probably being handled incorrectly. So partial success in this case is better than none and just throwing an Exception and we can investigate, fix the bug, and do a new deployment.
Now let’s talk about converting that to a Result of Ok vs. just resolving the Promise with data. Typically in JavaScript, Promises give you all the benefits of Functional Programming in a single data type. You get function composition, Result style interface, and a Monadic way to compose functions together with great flexibility in types. In ReScript, though, we don’t want flexibility in types; we want them constrained. “Did it work or not” should be a black and white question, not nuanced. A Promise or a Result can both answer that question. However, an exception has an extremely important part in AWS Lambda. The entire serverless ecosystem in AWS is built around what I call “The Lambda Contract”. It means, “If you’re Lambda does not throw an Exception, we will assume it worked, otherwise the Exception being thrown tells us your Lambda failed”. While that leaves little room for nuance if _some things_ worked and _some didn’t_, this is the coding contract we’re operating under. Thus, we play by the rules.
This has an interesting effect on how we utilize Promises in ReScript, specifically for use within AWS Lambda. It means we _intentionally_ are throwing exceptions when we have confidence our code failed, and we do not when we have confidence our code worked. We still reserve the right to log all kinds of nuance on purpose to perhaps indicate some part of a larger operation failed, or perhaps did something we weren’t expecting, but things appear ok. The way you do this is within your Promise, you pass Results internally to indicate success or failure with more control over the error messages. Then, at the very end, you say “If we have an Ok, we’ll assume all previous steps are correct. If not, we’ll intentionally throw an error with the information we have on hand as to why the operation failed.” This makes the happy path a lot more confident. For the unhappy path, you now have basically 2 reasons something failed: 1 is intentional, you sending a Result.Error with the information you have on hand. 2 is untintional, and handles every possible part you missed, whether inside of ReScript or more likely in the JavaScript you’re integrating with.
Below, we handle the else as a first step in gleaning that “known” error messages to the best of our ability given how we typed things.
} else {
let reason_ =
resultFromSNS.reason
-> Js.Nullable.toOption
-> Belt.Option.getWithDefault("unknown reason")
let error_ =
resultFromSNS.error
-> Js.Nullable.toOption
-> Belt.Option.getWithDefault("unknown error")
}
Code language: JavaScript (javascript)
We attempt to get the reason why the sns.js publish function failed; first converting undefined
or null
to an Option, and then providing a default in case there is no value. The “unknown reason” may appear a worthless log. In reality, it’s a breadcrumb. Whenever we encounter that error at runtime, we know that our JavaScript’s not right; something in there isn’t parsing the error correctly, and/or we have a new type of error we’ve never encountered before. A “global find” on that error text will lead you to where you need to work backwards in the code.
Sadly, this is a common tactic I’ve used with ReScript, and even TypeScript & JavaScript, when integrating with other JavaScript code such as libraries. Because error handling is so hard to make deterministic in JavaScript, guaranteeing the error has an Object shape you expect, AND ensuring the error messages can indicate what is actually wrong, it sadly bleeds into the rest of the code. This pattern of “do your best in JavaScript to ensure no exceptions occur, and if they do, you just snag off the error message and pass along” and then subsequently “don’t trust JavaScript, provide a breadcrumb default” becomes commonplace at integration points.
However, another way to look at it is ReScript’s types, AND the escape hatches it provides, these can show you where dangerous parts of our code are… which often have a high correlation of being where side effects are. These are where you should spend more time running integration tests, and property tests if you have time, to ensure you’ve covered them safely and maybe expose holes in the types you’ve used.
Let’s handle the error at the end. My rule of thumb for all functional code that needs to interface with imperative code is the functional core, imperative shell we mentioned before. All our functions, the publishSNS included, should be as pure as possible and return the correct type. The correct type for publishSNS is a Result; either publish worked or it did not.
-> catch(
error =>
resolve(Error('publishSNS failed, unknown reason.'))
)
Code language: JavaScript (javascript)
Notice 2 things: First, we’re resolving the Promise instead of using reject. When you start composing functions together, and many of them can fail, this makes it a lot easier to chain things, whether sync or async. Typically this would be what Promise was built for, but remember we’re in AWS Lambda here. The contract is “an exception indicates it didn’t work, nothing indicates it did”. We want our code as pure as possible, and only in 1 place do we make the determination to intentionally throw an exception to indicate to AWS we failed. Given we’re dealing with JavaScript, this isn’t 100% infallible like it is with Elm, but that’s ok; if an error occurs we don’t know about, we’ll see it, and ReScript will at least ensre we know super close where to look in the code.
Secondly, we’re leaving another breadcrumb. This will “probably never happen” given our pretty thorough error handling on the JavaScript side, but for our types to be sound, we handle it. This gives is something to search for and find out exactly where the last place the code failed if we ever see this at runtime in our CloudWatch logs.
That’s the entirety of both our JavaScript wrapper functions written in JavaScript in sns.js
, and our ReScript integration publishSNS
. The key to make this testable is ReScript’s built in curried functions.
For unit tests, we just define a stub:
let snsStub = (_, _, _) =>
Promise.resolve(Ok("some message id"))
Code language: JavaScript (javascript)
It’s a function that takes 3 parameters that we don’t use, and we just return an “Ok, things worked” to indicate a happy path. Using it for our Lambda unit test:
let _ = sendErrorToSNS(snsStub, eventStub)
Code language: JavaScript (javascript)
This ensures the unit tests side effects are ensured to work every time, and our tests are deterministic. More on these later.
However, to swap it out for the real thing, let’s revisit our Lambda handler:
let handler = event =>
sendErrorToSNS(publishSNS(publish), event)
Code language: JavaScript (javascript)
The key there is publishSNS
; that’s our function to send a message to sns. In the real-world, we want it to use our JavaScript’s publish
function, so we call publishSNS
with it. The publishSNS
function takes 4 parameters:
let publishSNS = (publishFunc, arn, subject, message) => {
Code language: JavaScript (javascript)
So just giving it 1 parameter when calling it returns a function that has a function signature like this:
publishSNSPartial = (arn, subject, message) => {
Code language: JavaScript (javascript)
Using the closure, it bakes in the publish
function, and waits for the Lambda to call it with the arn, subject, and message, THEN actually attempts to publish to SNS.
sendErrorToSNS
The easiest way to test your AWS Lambda code is to have the handler be 1 line of code that injects real dependencies to a single “do all” function. The do all, in this case “sendErrorToSNS”, takes stubs for unit tests, and real concrete implementations for integration tests and when you’re Lambda runs in AWS.
The sendErrorToSNS
is function that composes all of our functions together for our Lambda to do its job. The goal in designing the function is to:
- ensure all side effects are pushed to the side
- make as much of the code as pure as possible
- make this THE ONLY PLACE to intentionally throw an exception
As such, the function has 2 distinct parts: the first is where we compose together all the functions to parse the error logs. The 2nd part is where we actually attempt to send it to SNS, and if it fails, throw the exception to signal to AWS we failed. We’ll cover it in that order.
Result Chain
The error logs come in a JSON format that’s zipped and then base64 encoded. It’s a bit of a process to snag out the errors, parse them, and then clean them up so you can send to SNS. The important part to remember here is “SNS is not the goal”. The whole goal of error parsing is when you get an alert, you know WHY you are getting the alert.
The alerts you setup by default on serverless are typically for Lambdas failing; either a runtime exception, a Lambda not having permission via IAM Role, or a variety of other reasons. However, the Alerts aren’t setup well for the errors; most of what we get while in production for awhile. “A Lambda is having an error” is not what you want to see on a phone at 3am on a Saturday. Instead, what you want is “An upstream service notorious for problems has recovered multiple times” in Slack on Monday morning indicating the problem happened on Saturday. If it is important enough to page you on a glorious Sunday morning before the crack of dawn, things like “Your getProducts Lambda failed because of a JSON parsing error, details & stack trace below” are much more helpful, AND actionable.
How you go about that last part depends on how your monitoring is setup. We’re using xMatters at work and unlike PagerDuty, xMatters allows you to create these flow charts, much like AWS Step Functions, that allow you to visually orchestrate what happens when you get an alert. These get pretty complex based on different environments, but suffice to say, when we get a Lambda alert from SNS, both email and Slack notifications are sent as well as triggering an incident that may not self resolve. We can only do this because we’ve taken the time to format these error messages so by the time they get to xMatters, its (mostly) trivial to interpret, and format them for various downstream alerting systems.
That’s what this chained together list of results does; parse and prep the message for monitoring purposes.
let sendErrorToSNS = (snsPublish, event) => {
let result =
parseAWSEvent(event)
-> Result.flatMap( parseAWSEventBody )
-> Result.flatMap( unzipData )
-> Result.flatMap( basicJSONParse )
-> Result.flatMap( cleanUpLogEventMessages )
-> Result.flatMap( formatMessageForSNS )
Code language: JavaScript (javascript)
Ok, that’s a lot. If you’ve used Promises in JavaScript, you can replace in your head Result.flatMap
with then
and you’ll have a pretty good idea what’s going on. Let’s break down each of these in order.
parseAWSEvent
Error lambdas are invoked by CloudWatch log streams. As a log message comes in as a stream of messages, each message triggers the Lambda. We only care about errors, so we put a filter on the CloudWatch trigger that says “only trigger us if the log message has the word ‘error’ in it”. This greatly reduces how many times our Lambda is fired; typically only for known error messages. This is a lot easier using ReScript since our types our sound, and following the Lambda contract, meaning all of our code is pure and we only intentionally throw exceptions in 1 place within the handler, we can have a lot of confidence in this error processing architecture.
The event is JSON and looks something like this:
{
awsLogs: {
data: "H4sIAAANvtBw/aW83JT..."
}
}
Code language: CSS (css)
That “data” is the zipped and base64 encoded error from one to many of our Lambda’s. Our first step is to confirm if our JSON can be decoded safely to strong types. If you’ve never used a strict or soundly typed language before, challenge with those types is to speak with the outside world. For example, if everything is typed, what do you do with things like JSON that isn’t? It has primitives, sure, but it’s got an Object based structure. Some languages will do the basics for you, and provide those primitives. Others require you to manually parse everything in excruciatingly detail. ReScript allows you to do all of those, choosing which one you think is best.
That’s our first step; to safely bring in the JSON to our type system. If we were in JavaScript, we wouldn’t need to parse anything; AWS already ensured a JSON.parse behind the scenes and invokes our Lambda. In ReScript, though, we’re using it instead of TypeScript because we believe in the power of sound types, and are willing to put in the parsing work to ensure those types work for us and give us confidence in our code.
To do that, easily, we use Jzon. It’s a library that has you define the type to string, string to type, and field level types so Jzon can parse your JSON with confidence. Unlike Elm which separates encoding and decoding into 2 different libraries, Jzon combines them. While it’s API is quite elegant and lightweight, exemplifying what ReScript/OCAML is good at, I almost never use the encode parts, but am still required to write it. A small cost to pay, I think, but still frustrating.
Let’s take a look at this function:
let parseAWSEvent = event =>
switch Jzon.decodeWith(event, Codecs.cloudWatchLogEvent) {
| Error(reason) => Error('parseAWSEvent failed, ${DecodingError.toString(reason)')
| Ok(data) => Ok(data)
}
Code language: JavaScript (javascript)
If our JSON matches our Jzon defined codec to convert it to a nice, soundly typed ReScript Record, then great, we’ll get an Ok
with our data in it. Otherwise, something is off with the JSON, and Jzon will tell us exactly what path and field it is, and what the incorrect shape or type was.
Crash Course on Jzon Codecs
If you’re coming from JavaScript, doing anything beyond JSON.parse
seems strange. In typed languages, you want to leverage the types to ensure your code is correct or not. Converting something from the outside world into your types is always tricky to get right without having to write inordinate amounts of parsing code. Jzon fills the gap by ensuring correctness, only a little code, and good error messages when it fails, all with the compiler still helping you write it.
There are basically 4 ways to use Jzon:
- decoding a JSON object from JavaScript to a typed ReScript type (what we’re doing)
- encoding a ReScript type to JSON (what we’re doing at the verrrrrry end)
- decoding a JSON string, like
JSON.parse
, but into a ReScript sound type like a record - encoding a ReScript type into a JSON string
Any of those 4 require you to answer 4 basic questions:
- How many fields are there in this object?
- What JSON types do I convert my types into?
- What type am I parsing this JSON into?
- What is the data type of each individual field?
Let’s show our parsing code in that order.
module Codecs
… but first, a bit of organization. We’ll wrap all this parsing stuff in a module. I’ve seen this pattern used a lot, and I get why as it allows you to keep all your parsing code near the types, but you can be free with the names because it’s inside an inner module.
module Codecs = {
Code language: JavaScript (javascript)
Then we’ll define our first codec to parse that JSON above inside the Codecs module. I like to name them the same thing:
type awsData = {
data: string
}
That one is pretty easy; it’s an Object with 1 property, data
that is a string that we have to do a ton more parsing too. It’s wrapped in the CloudWatch event json:
type cloudWatchLogEvent = {
awslogs: awsData
}
K, so our record types match the JSON. Now let’s write the parsing function right below it:
let cloudWatchLogEvent = Jzon.object1(
)
Code language: JavaScript (javascript)
Notice how the type cloudWatchLogEvent
matches up with the let cloudWatchLogEvent
function. I like that and it makes it easier to match up in your mind what type you’re parsing to and from. The Jzon.object
function will return a parser; Jzon uses it to know how to do the 4 kinds of parsing. Our job is to tell it how many fields to expect, which we did using object
1. The creator of the library is amazing and a fuckin’ baller and made tons of object functions, up to object25. In Elm, you’d cap out at 8 and get lower quality type errors using pipelines, lelz. Score 1 for ReScript here.
Step 1 is down, step 2 is the function required to go from cloudWatchLogEvent
record to some JSON. Primitives are fine to simply lump into a tuple, but complex types, you’d have to parse down. Thankfully for us it’s quite easy. Before you go “Wait, we’re not going to convert our cloudWatchLogEvent record to JSON… we want JSON to the type.” remember that we have to provide both conversion functions, it’s just part of Jzon library design. Score 1 for Elm here.
({ awsLogs }) => ( awsLogs ),
Code language: PHP (php)
Ok, that was pretty simple. Destructure the record which only has 1 property, awsLogs, and shove it off to the right in the tuple.
Step 3, write a function that takes your primitive JSON types in a tuple, and make them into a Record. The only difference is this one can fail; you can’t guarentee the primitives look like you want, so instead of returning our Record, we return a Result with our record in it, or an Error and why we couldn’t parse it successfully.
(( awsLogs )) => Ok({ awsLogs }),
Code language: JavaScript (javascript)
So basically the opposite of the first; there’s nothing to fail here so just take the awsLogs object and shove it in a record; poof, you have a successfully parsed CloudWatch Log event. If things required parsing strings to variants, or perhaps using integers to Variants, you could implement switch statements that return Ok or Errors.
Step 4 is to define what the field name(es) we’re looking for and what their type(es) are. Since we used object1, we only have to do this for 1 field, but if you used something like Jzon.object4, you’d have to do this 4 times. Get it?
Jzon.field("awsLogs", awsData)
Code language: JavaScript (javascript)
You can read that as, “When you get this Object, look for a field called ‘awsLogs’, and use this parser to verify it’s the correct type. Here is the parser in it’s entirety:
let cloudWatchLogEvents = Jzon.object1(
({ awsLogs }) => ( awsLogs ),
(( awsLogs )) => Ok({ awsLogs }),
Jzon.field("awslogs", awsData)
}
Code language: JavaScript (javascript)
What is awsData? Another Jzon parser for the awsData
type we defined. Let’s write that one in 1 fell swoop:
let awsData = Jzon.object1(
({ data }) => ( data ),
(( data )) => Ok({ data }),
Jzon.field("data", Jzon.string)
)
Code language: JavaScript (javascript)
Notice this one is easier type wise. We have a JSON object like this:
{
"data": "some string stuff"
}
Code language: JSON / JSON with Comments (json)
And we defined a ReScript Record type like this:
type awsData = {
data: string
}
So our parse is like “Yo, you’ll get an Object with 1 property, called “data”. It’s a string. Please convert it to a record, called awsData, that has a data property that is a string”. Simple, ya?
All together, here is our entire Codec module so far:
module Codecs = {
type awsData = {
data: string
}
let awsData = Jzon.object1(
({ data }) => ( data ),
(( data )) => Ok({ data }),
Jzon.field("data", Jzon.string)
)
type cloudWatchLogEvent = {
awsLogs: awsData
}
let cloudWatchLogEvent = Jzon.object1(
({ awslogs }) => ( awslogs ),
(( awslogs }) => Ok({ awslogs }),
Jzon.field("awslogs", awsData)
)
}
Code language: JavaScript (javascript)
Now that we have a Codec to safely parse to and from JSON and JSON strings, we’ll use it in our parseAWSEvent
function:
let parseAWSEvent = event =>
switch Jzon.decodeWith(event, Codecs.cloudWatchLogEvent) {
| Error(reason) => Error('parseAWSEvent failed: ${DecodingError.toString(reason)')
| Ok(data) => Ok(data)
}
Code language: JavaScript (javascript)
If the event is a JSON Object, is shaped the correct way, the fields are named correctly, and their types are correct, then our codec will succeed, and we’ll have Records we can confidently dot onto their properties and have confidence in their types.
parseAWSEventBody
Parsing the aws event body is next. It’s a bit tricky because we have to write JavaScript, and JavaScript is dangerous. We’ll again use 3 techniques to safely integrate like we’ve done before:
- make the JavaScript we write more functional, easier to use functions, with known typed inputs.
- Have the JavaScript return if it was successful or not instead of throwing Exceptions.
- Write types in ReScript that represent what JavaScript sends back and use the built in JavaScript types for safety.
Let’s take a quick trip into buffer.js
and see a simplified version of a functional style parseBase64
function.
const parseBase64 = data => {
try {
} catch(error) {
}
}
Code language: JavaScript (javascript)
In the past, we used JavaScript Promises because they have built in try/catch, and allow us to compose our JavaScript functions with our ReScript ones because they both use Promises. This one, however, isn’t asynchronous, so we’ll just use normal, imperative try/catch. If we get a result, great, return some kind of Object that “looks like a ReScript Result with the data inside”. If we get an error, return some kind of Object that “looks like a ReScript result with the errors inside”.
The shape we’ll send back to ReScript looks something like this:
{ ok: false, result: Buffer, reason: "error message", error: "json stringified error class" }
Code language: CSS (css)
Base64 decoding issues typically have complicated reasons why it fails, and sometimes other parts of the Error class can help. Sometimes not. Who knows, it’s JavaScript. For thoroughness, we’ve included the JSON stringified (safely) in the original code. It’s a bit too thorough for this already insanely overkill tutorial, so we’ll just exclude the error from the returned JSON for now. However, if you look at the original code, you can see how you can write JavaScript to help safely give ReScript insight into what went wrong.
The happy path is pretty straightforward:
const result = Buffer.from(data, 'base64')
return { ok: true, result }
Code language: JavaScript (javascript)
We break it down into 2 statements in case the Buffer.from
fails; we want to ensure that works first. If so, then we can feel safe in returning. The ok indicates to ReScript the function was successful, and it can convert it into an Ok
.
The unhappy path, we’ll shorten for this tutorial:
} catch(error) {
return { ok: false, reason: error?.message || 'Unknown parse error in JavaScript.' }
}
Code language: JavaScript (javascript)
We’ll come back to buffer.js later when unzipping; for now, it just exports that parseBase64
function which takes in some data which is a string.
On the ReScript side, embedding it looks like:
@module("./buffer.js") external parseBase64: string => parseBase64Result = "parseBase64"
Code language: JavaScript (javascript)
That’s the function, and the type that defines the custom JavaScript return type is:
type parseBase64Result = {
ok: bool,
result: Js.Nullable.t<string>,
reason: Js.Nullable.t<string>
}
Code language: HTML, XML (xml)
If it works, great, we mmaaaaaayyyy have a string in there. If it fails, we mmmaaayyyyy have a reason why string.
Using it, we wrap in ReScript like before (you know when I whip out the squiggly braces {} in ReScript, it’s about to get imperative in functional land…):
let parseAWSEventBody = event => {
let result = parseBase64(event.awslogs.data)
}
Code language: JavaScript (javascript)
That result
will be a Record of the parseBase64Result
type. We now need to check if ok is true or false. Let’s handle the happy path of true, and attempt to snag off the data, failing if we can’t:
if(result.ok === true) {
switch Js.Nullable.toOption(result.result) {
| None => Error("Parsing the AWS event body was successful, but the JavaScript returned no result.")
| Some(result_) => Ok(result_)
}
Code language: JavaScript (javascript)
Got data? No? WTF JavaScript. Yes? Cool, everything is ok.
If the function failed, let’s attempt to snag the helpful error:
} else {
switch Js.Nullable.toOption(result.reason) {
| None => Error("Parsing the AWS event body failed, but JavaScript didn't tell us why.")
| Some(reason_) => Error(reason_)
}
Code language: JavaScript (javascript)
unzipData
Unzipping the data is next, and given we’re “dealing with low-level JavaScript data parsing”, let’s head back to buffer.js
and add another function to make things more functional and safer.
This one requires an import:
const zlib = require('zlib')
Code language: JavaScript (javascript)
JavaScript has a few helpful functions in there. The one we want is gunzipSync
; it’ll unzip our data, and synchronously. We use the same pattern as before; a try/catch:
const unzip = data => {
try {
const result = zlib.gunzipSync(data).toString()
return { ok: true, result }
} catch(error) {
return { ok: false, reason: error?.message }
}
Code language: JavaScript (javascript)
In ReScript, same pattern as before; define the return type record, and embed in the function we’re calling from ReScript to JavaScript:
type unzipResult = {
ok: bool,
result: Js.Nullable.t<string>,
reason: Js.Nullable.t<string>
}
@module("./buffer.js") external unzip: string => unzipResult = "unzip"
Code language: JavaScript (javascript)
Rad, and to wrap in ReScript, about the same style as before, nothing new here; checking for ok, and unwrapping the result or error:
let unzipData = data => {
let result = unzip(data)
if(result.ok === true) {
switch Js.Nullable.toOption(result.result) {
| None => Error("Successfully unzipped the AWS event data, but the JavaScript returned no result.")
| Some(result_) => Ok(result_)
} else {
switch Js.Nullable.toOption(result.reason) {
| None => Error("Unzipping the AWS event body failed, but JavaScript didn't tell us why.")
| Some(reason_) => Error(reason_)
}
}
}
Code language: JavaScript (javascript)
basicJSONParse
Now comes our 2nd round of JSON parsing. At this point, we’ve snagged out the original error message that triggered this Lambda. However, just because it has the word “error” in it, doesn’t mean we can successfully parse it in a soundly typed way. Let’s use Jzon codecs again to define our types and parses of what we’d like our JSON to be in Record form:
We’ll define the cloudWatchLogMessage that we’ll hopefully be able to successfully parse. This is the single CloudWatch log message that has the original print/console.log statement, or exception in JSON form from any Lambda logging to that CloudWatch stream.
type cloudWatchLogMessage = {
messageType: string,
owner: string,
logGroup: string,
logStream: string,
subscriptionFilters: Js.Array.t<string>,
logEvents: Js.Array.t<logEvent>
}
Code language: HTML, XML (xml)
I’m including the logEvent type as well because she’s the one that has the good stuff, the raw error info that you can hopefully figure out what happened by parsing it.
type logEvent = {
id: string,
timestamp: float,
message: string
}
The Jzon parsing for it is pretty straightforward, so I’m not including it. The key to the above 2 is that they are all a common CloudWatch log message with a series of log events inside it. The message
is your raw console.log
; it’s ideal to have it JSON, but it’s ok if it’s not. For us, yes, catastrophic if not JSON, we’ll just fail, heh. This is the first step in getting access to our raw errors from other Lambdas. Parsing is as follows:
let basicJSONParse = string =>
switch Jzon.decodeString(Codecs.cloudWatchLogMessage, string) {
| Error(reason) => Error(Jzon.DecodingError.toString(reason))
| Ok(data) => Ok(data)
Code language: JavaScript (javascript)
cleanUpLogEventMessages
Super dope, if you made it this far, you have your CloudWatch log messages parsed. However, the logEvents’ message is still stringified JSON, and we have to parse that. Sometimes, whether bad logging on your part, or other formatting reasons, the message will have whitespace that can trip up simple JSON parsers, ours included. So before we attempt to parse, we need to clean up the whitespace on all logMessage message bodies. Let’s show the first part of doing that where we map through all the logEvents, and just modify the logEvent’s message:
let cleanUpLogEventMessages = cloudWatchLog => {
let cleanedLogEvents = Js.Array.map(
logEvent =>
{...logEvent, message: attemptToMakeMessageValidJSON(logEvent.message)},
cloudWatchLog.logEvents
)
...
}
Code language: JavaScript (javascript)
The key in there is the attemptToMakeMessageValidJSON
; he’s just a function that trims the whitespace, but you want to do this first as in the beginning of your project, you’ll be either modifiying this, or finding ways to ensure you never have to do this. It depends on what logging mechnism you’re using in your programming language so it’s nice to have this “before I attempt to JSON parse” function.
let attemptToMakeMessageValidJSON = logMessageText =>
Js.String.substring(
~from=Js.String.indexOf("{", logMessageText),
~to=Js.String.length(logMessageText),
logMessageText
)
Code language: JavaScript (javascript)
To give you an example of what the above does, it’ll take a string like this:
" { "foo": "bar" }"
Code language: JavaScript (javascript)
And change it to something like this:
"{ "foo": "bar" }
Code language: JavaScript (javascript)
The last part is to just return our updated cloudWatchLog type:
let cleanUpLogEventMessages = cloudWatchLog => {
let cleanedLogEvents = Js.Array.map(
logEvent =>
{...logEvent, message: attemptToMakeMessageValidJSON(logEvent.message)},
cloudWatchLog.logEvents
)
{ ...cloudWatchLog, logEvents: cleanedLogEvents } -> Ok
}
Code language: JavaScript (javascript)
formatMessageForSNS
Last in the line, we’re going to do a few things in this function. Now that we have our CloudWatch logs all formatted and soundly typed, the last 3 things to do is “get at the good stuff”. This is pretty much a business logic function; meaning, “What does your Lambda do when it parses an Error message?”
For us, we send to SNS so we can have flexibility in downstream processing. Do we send to xMatters? PagerDuty? Slack? Email? All of them? Who knows, but SNS gives us the flexbiility to choose now, AND change our minds later.
Finally, we’ll slice and dice the log messages into a “blog of JSON that has all you need to know to debug a Lambda failure at 3am on a Saturday”.
CloudWatch will attempt to format errors for you, it’s super dope, regardless of language choice. Our errors are logged out in ECS format as JSON, but they still can nicely exist in the JSON format CloudWatch has chosen. Basically it’s the type of error, what the actual error message was, and the stack trace if any.
Now, while we’re in a functional language, ReScript is kind of in the middle; you still can, and sometimes are encouraged, to write imperative style code in functional blocks. Exceptions exist, and can be used. So while we’re using a Result
monad here, and you’d assume we’d “just log out an error describing what function failed and why”, we’re… still kind of an imperative world, so we’re going to have things like a stack trace (lol, JavaScript). It’s worthless most of the time, but occasionally you can backtrace from the compiled JavaScript if you’re really confused.
type errorEvent = {
errorType: string,
errorMessage: string,
stack: Js.Array.t<string>
}
Code language: HTML, XML (xml)
Now that we have a type and an equivalent Jzon codec, let’s extract all the messages, and decode the error. We provide a fallback error in case our parsing fails. First, we get all the messages:
let formatMessageForSNS = cloudWatchLog => {
let errorMessageResults =
Js.Array.map(
logMessage =>
logMessage.message,
cloudWatchLog.logEvents
}
Code language: JavaScript (javascript)
Second, we attempt to parse them to the CloudWatch error format. We’ll do it imperative style because we want to log errors with info in case we need to debug early in the project. Sometimes you might have accidentally gotten logs and need to adjust your Error Lambda’s CloudWatch stream trigger’s filter words to be more strict and not get regular log messages:
-> Js.Array.map(
message => {
let result = Jzon.decodeString(Codecs.errorEvent, message)
switch result {
| Error(reason) => Js.log2("wat:", DeocdingError.toString(reason))
| Ok(_) => Js.log2("good to go:", result)
}
result
},
_
)
Code language: JavaScript (javascript)
Third, let’s abort if we had any parse errors. If any of the errorMessageResults
array contains an Error
, we just abort the whole thing and log out which ones:
if(Js.Array.some(Belt.Result.isError, errorMessageResults) === true) {
switch Js.Array.filter(Belt.Result.isError, errorMessageResults) -> Belt.Array.get(0) {
| None => Js.log("impossible, but ok")
| Some(yup) => Js.log2("failed:", yup)
}
Error("Failed parsing 1 or more of the error messages.")
}
Code language: PHP (php)
Otherwise, if we got this far, we know it’s safe to start extracting the data, and creating a new record for us to send to SNS. We need to format the Lambda name which is as hashed URL looking thing; we just want the name:
let lambdaName = Js.String.split("/", cloudWatchLog.logGroup)
-> Belt.Array.get(3)
-> Belt.Option.getWithDefault("Unknown Lambda name.")
Code language: JavaScript (javascript)
The SNS record itself has all the info someone would need to know the cause, which Lambda, and where to go to find out more information:
let snsMessageWereSending = {
lambdaName,
logGroup: cloudWatchLog.logGroup,
logStream: cloudWatchLog.logStream,
messages: errorMessages
}
Code language: JavaScript (javascript)
Lastly, we’ll encode that a JSON string; thankfully encoders never fail so we’ll just shove to an Ok
with a helpful message for logging purposes in the returned tuple. Most SNS messages are around the idea of an email, so you’ll typically have 2; a message and a body. You’re more than welcome to combine them, I just separate them initially. Another way to look at it is (tl;dr;, big ole wall of text)
:
(
'Lambda Error for ${lambdaName}', Jzon.encodeString(Codecs.snsErrorMessage, snsMessageWereSending)
-> Ok
)
Code language: JavaScript (javascript)
Last part of sendErrorToSNS
The last part is confirming we got an Ok from that big ole parsing routine and if we did, send to SNS; the only real side-effect in this Lambda, and the only place we intentionally cause an Exception to ensure our Lambda fails.
switch result {
| Error(reason) => reject(SendErrorToSNSError(reason)
| Ok((subject, message)) =>
snsPublish(getArnFromEnvironment(), subject, message)
...
Code language: JavaScript (javascript)
Now, while we’re mostly done here, we still want to fail if the SNS fails to send… as the only real job of this Lambda is to send to SNS and if it can’t even do that, we should know about it quickly.
->then(
snsResult =>
switch snsResult {
| Error(reason) => reject(sendErrorToSNSError(reason))
| Ok(snsMessageID) => resolve(messageID)
}
)
Code language: PHP (php)
And with that, we’re done with the main code, just 1 last step:
sendErrorToSNSPartial & handler
For the handler, you’ve seen how we wired up “the real”, but there is a better pattern for it. By default, sendErrorToSNS
expects you to inject it’s dependencies. It doesn’t import them and use them as closures inside the function. That breaks purity and makes things hard to test or using strange mocks.
By default, there are 2 types of tests we’re interested in on the back-end (we won’t cover functional tests here, heh). For unit tests, we want them fast and deterministic. To do that, we need to give our function stubbed versions of anything that has a side effect so it can just run in-memory. This’ll allow it to work everytime the way we expect, and as well as locally, in CICD, offline, whereever. And that’s what we have currently:
let sendErrorToSNS = (snsPublishFunction, event) => {
Code language: JavaScript (javascript)
Integration tests, however, “test if the code actually works in AWS”. It doesn’t matter if it’s self-contained or has 50 other API’s and databases it calls, what you’re interested in is “does my code work?”. To do that, you need real depencencies, not mocks/stubs. ReScript, by default, defines all functions as curried unless you define them as uncurried manually. This means we can create partial applications normally. This allows us to have the module inject the real dependency/ies it needs, and export both: 1 for unit testing, 1 for integration testing and for those who use to use the module.
let sendErrorToSNSPartial = sendErrorToSNS(publishSNS(publish))
Code language: JavaScript (javascript)
This means you’re handler can be simplified to use the partial:
let handler = event => sendErrorToSNSPartial(event)
Code language: JavaScript (javascript)
Or if you’re hardcore:
let handler = sendErrorToSNSPartial
Code language: JavaScript (javascript)
Unit Testing ReScript Lambdas
Unit testing our code is done first in a Test Driven Development / Red Green Refactor scenario. This code runs locally, and requires no AWS LocalStack or AWS mock nonsense, just simple dependency injection (read: parameters to functions).
To unit test functions, you:
- give your function an input
- capture the output
- assert the output is what you expect for that input
To do that, we’re using rescript-test, which feels natural to those used to Jest, Mocha, etc. The only main difference is using testAsync
vs test
for async code, and having to define your own assertions vs. “Jest/Chai has ALL THE THINGS”.
testAsync("send error to SNS happy path", callback => {
Code language: PHP (php)
The callback is a function similiar to how you can optionally “call a function when you’re async unit test is done”.
let _ = sendErrorToSNS(snsStub, eventStub)
Code language: JavaScript (javascript)
The snsStub
is a stub or a mock; a fake implementation of the AWS SDK + our wrapper code that “takes some JSON, sends it to an SNS topic, and returns a unique message ID”. aka, “a function that returns a string”. We define it with 3 holes (e.g. the underscore, _) because we don’t care about, nor use the parameters; we just want the SNS publish to suceed always for this happy path test:
let snsStub = (_, _, _) => resolve(Ok("some message id"))
Code language: JavaScript (javascript)
The event is called a stub, but you can call it a fixture too; it’s just a realistic JSON Object we’ll get from AWS when our CloudWatch error log stream invokes the Lambda.
let eventStub = Js.Json.parseExn('{
"awsLogs": {
"data": "H4sIAAAA01..."
}
}
To create that stub requires you to take a CloudWatch log JSON event, zip it, then base64 encode it. Keep in mind a few things while you read the below. First, we need to assert the message ID’s match; this is just basic unit testing stuff and ensures all of our parsing code works as we expect. The callback with a planned ensures we don’t concurrency issues where we’re running this code and it accidentally crashes in some other unit test. It’s really nice feature. Lastly, we do NOT care about the resolve(true)
or whatever the Promise returns; ReScript is functional and requires all functions to return something, so we attempt to imply resolve(true)
for good and resolve(false)
for bad.
->then(
result => {
stringEqual(result, "some message id")
callback(~planned=1, ())
resolve(true)
}
)
Code language: JavaScript (javascript)
Lastly, we want a failure to intentionally fail the test.
->catch(
error => {
Js.log2("send error to SNS happy path failed:", error)
fail(())
callback(~planned=1, ())
resolve(false)
}
)
Code language: JavaScript (javascript)
And that’s your happy path test; here is the whole thing:
testAsync("send error to SNS happy path", callback => {
let snsStub = (_, _, _) => resolve(Ok("some message id"))
let eventStub = Js.Json.parseExn('{
"awsLogs": {
"data": "H4sIAAAA01..."
}
}')
let _ = sendErrorToSNS(snsStub, eventStub)
->then(
result => {
stringEqual(result, "some message id")
callback(~planned=1, ())
resolve(true)
}
)
->catch(
error => {
Js.log2("send error to SNS happy path failed:", error)
fail(())
callback(~planned=1, ())
resolve(false)
}
)
})
Code language: PHP (php)
Integration Testing ReScript Lambdas
Given types don’t really help in integration testing much, I just write ’em in Mocha + JavaScript. This one is a bit beefy because it’s gotta create a few side effets to trigger an error, then snag it out and parse it to verify it’s “the error we intentionally caused”. There are various improvements you could make here, but this should get you moving as a baseline. The originally code was in Promise
format, but the async/await reads better for some people so I’ll include that version here for brevity’s sake.
describe('errorParsing Lambda', function() {
this.timeout(20 * 1000)
it('should be able to blow up getMinMax, and then read the error from SNS', async () => {
Code language: JavaScript (javascript)
We set it to 20 seconds just because I’ve seen some latency being use calling a Lambda which will trigger an error, that error being logged to CloudWatch, and our ability to then pull it out of the CloudWatch AWS SDK logs.
We’ll need a unique ID to know “this is the error we caused”, so we create a version 4 guuid:
const id = uuidv4()
Code language: JavaScript (javascript)
We also need to know what time frame we’re looking after since CloudWatch has so many logs and we don’t want to loop through tens of thousands of logs. We’ll use this timestamp to filter the results AFTER this test runs:
const now = Date.now()
Code language: JavaScript (javascript)
We’re going to intentionally invoke another Lambda deployed on AWS in a QA environment with bad inputs (our ID instead of a real ID). We’re using the v2 style because the v3 AWS SDK is written by insane Object Oriented Programmers who are rewarded for verbosity:
await lambda.invoke({
FunctionName: getLambdaNameFromEnvironment(),
Payload: JSON.stringify({
arguments: {
merchantID: id
},
request: {
headers: {}
}
})
})
.promise()
Code language: CSS (css)
This’ll trigger the Lambda to fail and write an error to CloudWatch. Next, we just wait for filter function to find it within 20 seconds…
const results = await filterLogEvents(id, now, 40, 0)
Code language: JavaScript (javascript)
We then assert we have at least 1 error:
expect(results.length > 0).to.equal(true)
Code language: CSS (css)
getLambdaNameFromEnvironment
Before we part with final advice, let’s cover those 2 helper functions as things get deep, quick. The getLambdaNameFromEnvironment
function handles getting which Lambda ARN you need. My setup here is we have 3 environments: qa, stage, and prod. We run integration tests for QA after it deploys to QA and before it deploys to stage. We want stage more stable, and QA should be able to play. After stage, we re-run the same integration tests against stage. To do that, you just need different Lambda ARN’s.
const getLambdaNameFromEnvironment = () =>
'loanleasecalculatorapi-${process.env.NODE_ENV}-getMinMax'
Code language: JavaScript (javascript)
For the filterLogEvents
, it’s a complicated, recursive function that logs cloudwatch logs over and over until it finds a match of what you’re looking for, else it gives up after a certain amount of times. It’s long so let’s break it down
filterLogEvents
First, we need the definition to take it what id we’re looking for, what time it is now so we can filter, how many times we’re willing to try to find the logs if our request doesnt’ find our guuidv4 in the logs, and what the current interation is (this is for recrusion).
const filterLogEvents = (id, now, max, current) => {
Code language: JavaScript (javascript)
Next up is to determine if we’ve exhausted how many times we’re willing to try, and if so, fail:
if(current >= max) {
return Promise.reject(new Error('Failed after ${current} tries, with max ${max} allowed.'))
}
Code language: JavaScript (javascript)
Our big ole Promise chain attempts to load the CloudWatch logs from the stream and process them:
return logs.filterLogEvents({
logGroupName: getLogGroupNameFromEnvironment(),
limit: 3,
startTime: now
})
.promise()
Code language: CSS (css)
We then snag out just the log messages:
.then(
({ events }) =>
events.map(
({ message }) =>
message
)
)
Code language: JavaScript (javascript)
Next we filter on messages that match our ID, hopefully finding an error in there:
.then(
messages =>
messages.filter(
message =>
message.indexOf(id) > -1
)
)
Code language: JavaScript (javascript)
Lastly, we need to ensure we found some matches, else, recrusively call ourselves to try again:
.then(
matches => {
if(matches.length > 0) {
return Promise.resolve(matches)
} else {
return delay(1 * 1000).then( () => filterLogEvents(id, now, max, current + 1) )
}
}
)
Code language: JavaScript (javascript)
Python
Python is a mature language that continues to be used everywhere, and is continually updated. It’s terse syntax with a lot of built in functionality allows you to accomplish a lot with short code snippets. While it’s dynamic, you can utilize Python 3’s Typings in combination with mypy to get compiler help on types before you run your code. We’ll just focus on the dynamic powers here.
Pro’s:
- shortest amount of Lambda code you’ll get with official AWS Lambda runtimes. Lambda initialization time aside, this follows the philosophy of non-Lambdaliths in have 1 Lambda function do 1 thing as well as being easier to get up to speed when you revisit the code 6 months later. This includes 1 file compared to the multiple files for ReScript which is also multiple languages of ReScript and JavaScript.
- Python is notoriously hard to install and package compared to Node.js. However, given we’re not using any 3rd party libraries, you can “just push your Python code”, putting it on equal footing with Node.js, Ruby, or Go Lambdas in terms of ease of deployment.
- All our steps are linear so we don’t have to do the difficult Python async code.
Con’s:
- Ease of deployment, yes. Ease of avoiding “works on my machine”? No. Even in 2022, Python has 20 different ways to install for local development, and all are fraught with challenges/caveats. The language and runtime is great. The installation process is miserable.
- No types so less confidence your code works. Typings with mypy help, but they’re limited in what they can model, and their Sum/Variant/Union types are early days. Normally dynamic languages ability to have insanely fast feedback loops of “write, run, repeat” are true, types can help remove a litany of bugs you’ll likely encounter so you just have to power through those using unit and integration tests to help.
- Python does not optional chaining like JavaScript, so you have to utilize PyDash for lens support when safely digging into JSON dictionaries you don’t own. You can utilize getattr, but it only works for single property depths.
Let’s break down the Python code. We’re going to follow the same AWS Lambda contract and style of “pure core, imperative shell” as well as only raising exceptions and doing side effects inside the Lambda handler itself.
def handler(event, _):
return (
decode_event
.then( unzip_logs )
.then( parse_log_messages )
.then( format_message_for_sns)
.then( publish(boto3.client('sns'), get_arn_from_environment() ) )
.either( raise_error, identity )
)
Code language: JavaScript (javascript)
If you’re an imperative or Object Oriented programmer in Python, this may look not just unpythonic, but super weird. We write functional code in Python much like we’d do it in JavaScript or ReScript; by using a Monadic interface provided by either the language (e.g. Promise
in JavaScript or Result
in ReScript), or a library (e.g. Folktale for Result
in JavaScript). In Python, we’re using PyMonad. While dry-python/returns is super legit and has great integration with mypy, we’re using raw Python here with no Typings. The hope is “something as simple as an Error log parser for a single Lambda is perfect for dynamic Python” ya? While we can capitulate on strong/sound types, we’re not renegaging on Functional Programming for it’s benefits.
Let’s break down each function; it’s mostly similiar to the ReScript, but a bit less of it because Python has no types and is a lot less safe… and Python is just a much more terse language anyway, one of its strengths. Additionally, there is no external JavaScript to integrate with; this is 100% Python.
decode_event
There is a lot going on in this function despite it’s small size (Python in general, heh). Let’s first cover lenses…
Crash Course into Optics
In languages that aren’t soundly typed, you don’t have any guarentee’s at runtime. This is why most languages, even strictly typed ones like C# and Java still have null pointer exceptions. A, they have a null type and B, they have types that allow null to slip in when something is supposed to be a string, int, or some custom class.
This flies in the face of pure functions. If you want something to not have side effects, exceptions because you accessed something certainly feel like side effects because your function doesn’t return a value, your program can possibly crash, and it “affects things” after it runs. It’s debatable if exceptions are side effects, but suffice to say, functional programming eschews exceptions (don’t get me started on OCAML/ReScript/F#…).
While soundly typed languages can gurentee no runtime exceptions because some of those languages don’t have null/undefined, Python does not have these gurentees, and has None
which is one of the many things that can trigger null pointer exceptions at runtime.
However, dynamic languages can implement dynamic access of things with no exception a lot easier than strictly/soundly typed ones can. For example, in Python, you can go:
cow = { "name": "foo" }
"name" in cow # True
"age" in cow # False
Code language: PHP (php)
This is one of the many built-in ways to dynamically check at runtime if a property exists before you start reading/writing it. These are called null checks, and are a royal pain to write, clutter up the code, and kind of defeat the purpose of using a dynamic language. If something breaks, you’re supposed to fix it, and re-run; that’s the whole fast feedback loop speed of dynamic languages.
So what do you do when you want to avoid them entirely while keeping to functional programming principals? You have a ton of options, but the lowest hanging fruit is just using lenses. The math version is called “Optics” and if you’re interested, there are some wonderful docs describing the different types of lenses you can use, not just in Python, but any language.
Instead of writing the above, we can use a pure function lense function from PyDash (the Python version of JavaScript’s Lodash). It has a function called get
which allows us to safely access not just properties, but deeply nested ones, including a mix and match of Dictionaries and Lists. Instead of the above style for a deeply nested property:
if "awslogs" in event:
if "data" in event["awslogs"]:
data = event["awslogs"]["data"]
Code language: JavaScript (javascript)
Using get
in PyDash, you can go:
data = get(event, "awslogs.data")
Code language: JavaScript (javascript)
Note we’re not going full ReScript Option, or Elm Maybe here. While PyMonad does offer a Maybe, it requires us to unwrap it and write a lot more code. 99% of the Maybe’s we’ll encounter are basically “we’re screwed” scenario, so there isn’t any point to handle the null path. Rather, we’re more interested in the processing of that data. If you have the time in the Green part of Red Green Refactor, YES YES, you should do Maybes because it’s way more clear at runtime what is actually failing, but for now, this is good enough (that’s how Python tries to snare you, ya see…).
One final note is sometimes using a lens is bad. For example, if you’re unsure of the shape of JSON you’re getting from an AWS service, for example, you WANT it to throw a null pointer so you can then look at the logs, learn the shape, and fix your code. Other times, you may not care if a piece of data is there. For example, which lambda caused the error? If I fail to get that name, that’s not that important compared to sending the _actual_ error itself along to SNS. We can fix name parsing issues later, for now, do our job as best we can and send that message! You’ll see a mix and match in the Python where object['property']
is used for things we want to explode, and get(object, 'property', 'default value')
for things we don’t care. It’s very tactical, and you can change your mind in certain parts of the code.
Crash Course in Result in Python
The 2nd philosophy is “How do I use Result in Python?”. There are 2 things to solve here. We need to capture errors, and we need to be able to chain them. PyMonad gives us the Ok
and Error
types and gives us a then
method… but we’ll have to convert Exceptions to Errors ourselves (it doesn’t have a try
method like Folktale does). If you’re interested in how to implement yourself, here’s an example.
Ok, got the basics, let’s snag out the data and base64 decode it, and wrap the whole thing in a Result:
def decode_event(event):
try:
result = base64.b64decode(get(event, "awslogs.data"))
return Right(result)
except Exception as e:
return Left(e)
Code language: PHP (php)
unzip_logs
Next up is unzipping the content. Same as before, we wrap dangerous operations that could raise an Exception in a try/catch so we return a proper Result type.
def unzip_logs(event):
try:
result = zlib.decompress(event, 16+zlib.MAX_WBITS)
return Right(result)
except Exception as e:
return Left(e)
Code language: PHP (php)
parse_log_messages
This next function is super imperative. That’s no excues to get lazy, but while PyDash offers chaining of list comprehensions, AND it’s much more terse than Python’s built in ones… I just hate Python’s lambda functions. When you start chaining functions together, smaller functions defined inline are awesome. That’s one of the nice things about JavaScript’s fat arrow functions, and Elm’s anonymous functions. Instead, I just whip out the imperative code, and smash it all together using local state. I justify it because “the function is still pure”. The only thing that can really fail in this function, though, is the parsing of JSON so at least you won’t have to hunt down where an Exception came from.
def parse_log_messages(decompressed_data):
try:
data = json.loads(decompressed_data)
fixed = list(map(fix_json_log_message, get(data, 'logEvents'))
parsed = list(map(parse_json_log_message, fixed))
failures = list(filter(lambda x: isinstance(x, Left))
if len(failures) > 0:
return Left(failures)
return Right(parsed)
except Exception as e:
return Left(e)
Code language: PHP (php)
While they’re about the same as ReScript, let’s take a look at those 2 mapper functions we created. The first, fix_json_log_message
is like the ReScript equivalent that trims whitespace so we can safely parse it as JSON.
def fix_json_log_message(log_message):
msg = log_message['message']
index_of_slash = msg.index('{')
return msg[index_of_slash:len(msg)]
Code language: JavaScript (javascript)
The 2nd, parse_json_log_message
is similiar to the ReScript one; it attempts to parse the JSON in the log message.
def parse_json_log_message(log_message):
try:
result = json.loads(log_message)
return Right(result)
except Exception as e:
return Left(e)
Code language: PHP (php)
format_message_for_sns
While we’re modifying data, it’s way less verbose in Python, and requires no logging for errors. If anything screws up, we’ll just check the error log + stack trace. This is a key difference between working with Python vs a soundly typed language like ReScript, or even a strictly typed one like TypeScript.
The justification from the sound/strict typed Functional crowd is “spend your time getting the types right, and use the compiler to help you get as close to correctness as possible, use unit tests and manual tests by running the code get the rest”.
Python on the other hand follows the dynamic creed, much like JavaScript/Ruby/Lua: run the code over and over and over and over. Running the code is so fast because there is no compilation step, you can quickly run, get feedback if it worked or not, and keep tweaking until it does. For small things, or easy problems, this can be viewed as superior because you more quickly get results.
This means our formatting of data is about the same as ReScript, the only difference is we aren’t parsing to records, and are accessing dynamic properties with dictionaries with wild abandon:
First, we have to format the Lambda name that’s trapped in the logGroup name:
lambda_name = get(cloud_watch_log['logGroup'].split('/'), '[3]', 'Unknown Lambda name.')
Code language: PHP (php)
Second, build the mesage dictionary:
sns_message_were_sending = {
'lambdaName': lambda_name,
'logGroup': cloud_watch_log['logGroup'],
'logStream': cloud_watch_log['logStream'],
'messages': cloud_watch_log['logEvents']
}
Code language: JavaScript (javascript)
Third, we return a tuple containing both the subject and the message:
return Right((f'Lambda Error for {lambda_name}', sns_message_were_sending))
Code language: JavaScript (javascript)
We wrap the whole thing in a try/catch in case the split
fails:
except Exception as e:
return Left(e)
Code language: PHP (php)
publish
The publish should look famliar; it’s just a wrapper around the AWS SDK (called Boto3 in Python) that handles errors and makes it easier for us to unit test our code.
… but let’s talk about that. Doing dependency injection in Python, aka, passing arguments to functions, is pretty straightforward, and Python gives you lots of options when it comes to function parameter types, amount, position, etc using things like *args and **kwargs. However, Python does not have an easy way to compose functions together. It’s OOP features have been improved throughout the years, but the basics of FP have not.
PyMonad
has a few annotations you can use to help make it simpler, @curry being one in particular. Since PyMonad added the then
method, a combination of map
and bind
, we now have a way to compose functions together into larger functions. You no longer need to resort to imperative style with local state (although you still can if you want; sometimes it helps to think in minute steps and prototype like that). Leveraging curry
makes it even easier for functions where you’re doing data last and want to inject some dependencies later on. Instead of using the function, we can use it as an annotation below, stating how many paramereters, the function’s arity, as the first and only parameter to curry
:
@curry(3)
def publish(sns, arn, subject_and_message):
Code language: CSS (css)
This pattern of try/catch convert to Ok/Error works well with AWS SDK in Python because all of the AWS SDK returns values for good calls, and Errors when things break, both intentionally by the library, and unknowingly for network/permission issues.
One special note, we’re getting a tuple as that 3rd parameter. We’ll destructure it to get our subject and message as 2 separate variables.
try:
subject, message = subject_and_message
response = sns.publish(
TopicArn=arn,
Message=json.dumps(message),
Subject=subject
)
return Right(
get(response, 'messageID', 'unknown message id')
)
except Exception as e:
return Left(e)
Code language: PHP (php)
.either(raise_error, identity)
We’ve reached the end of our chain. One special note is in ReScript, we did all the pure stuff in a chain, then separated out the sns publishing in a separate switch statement, mainly for logging purposes, AND to show where the side effect is in the code. In Python, it’s way less verbose, and since this is just for teaching, I figured I’d show you how to do it all in 1 chain vs. inspecting the Either result in the handler method.
When you get a big chain of functions where a Right or a Left can come out, you can convert it back to imperative coding land using an .either function, also known as unwrapping the monad. PyMonad provides a nice method called either which takes 2 parameters, a function to run when a Left comes out, typically an error and the 2nd parameter function is for when a Right comes out, typically a good result with your value. Like ReScript’s Result, think Either Left == Result.Error(error reason), and Either Right == Result.Ok(value).
Our raise_error is a function that raises an Exception so AWS Lambda knows we fail and signals to the trigger, Alarms, etc.
def raise_error(reason):
raise Exception(reason)
Code language: PHP (php)
Our identity function is the most boring, weird, edge funtion in all of functional programming; a function that just returns the value it was given:
def identity(arg):
return arg
Code language: JavaScript (javascript)
This strange setup means if any of your functions fail, they’ll throw an Exception with the reason they failed. If all of them succeeded, the Lambda will return the value that comes out (in our case a SNS message ID). This is how you go from Functional Programming world to Imperative World, and it follows the AWS Lambda contract of “return a value if you work, crash to signal you didn’t work”.
Exposing the Partial Application
The last step is to make it easier to unit test and integration test. We’ll wrap our handler like we did in ReScript (notice we’re replacing the concreate implementation boto3.client('sns')
with whatever dependency as a parameter the function passes in:
@curry(2)
def send_error_to_sns(sns, event):
return (
decode_event(event)
.then(unzip_logs)
.then(parse_log_messages)
.then(format_message_for_sns)
.then(publish(sns, get_arn_from_environment() )
.either(raise_error, identity)
)
Code language: JavaScript (javascript)
Then define a module level variable:
send_error_to_sns_partial = send_error_to_sns(boto3.client('sns'))
Code language: JavaScript (javascript)
Finally, we have a better looking handler:
def handler(event, _):
return send_error_to_sns_partial(event)
Code language: JavaScript (javascript)
Unit Testing Python Lambdas
Assuming you’re using pytest, the unit test is:
# send_error_to_sns_test.py
from error_parsing import send_error_to_sns
def test_handler():
assert send_error_to_sns(SNSStub(), event_stub) == 'some message id'
Code language: PHP (php)
Again, this is where Python gets you… it’s all “look how simple and short I am, yet so powerful”. * sigh *
Let’s show you the sns_stub
as you have to switch your brain to Object Oriented Programming, and build a stub that’s a fake class vs. what we usually do and write fake functions:
class SNSStub():
def publish(self, **kwargs):
return { 'messageID': 'some message id' }
The above class, once instantiated, will expose a publish method. If you call it with whatever, it’ll always return a dictionary that has the messageID == some message id. This ensures our test always works and follows mostly the same interface as the AWS SDK boto3.
The event stub is the same thing form the ReScript test, a zipped and Base64 encoded JSON CloudWatch event message. I’m including some here for brevity’s sake:
event_stub = {
"awslogs": {
"data": "H4sIAAA01..."
}
}
Code language: JavaScript (javascript)
Interation testing is about the same as the ReScript version I wrote in JavaScript, just without the async nonsense “because Python is blocking by default and makes programming easier and only hardcore mofo’s use async/wait or ThreadPool”.
Conclusions
As you can see, Error handling has concepts that can work well for soundly typed functional languages, or dynamic data parsing languages like Python. Both can work and leverage their strengths, but it’s SUPER hard not to be seduced by the brevity and simplicity of Python. You can hear it saying “Do you really care if your AWS CloudWatch log fails to deterministically parse to a soundly typed Record? Like, when would AWS actually screw that up ever? Additionally, you’re errors in your other Lambdas are guarenteed to get formatted into AWS format, do you really need to guarentee those are parsed too?”
You start asking a lot of these types of questions, wondering if all that super hardcore typing and correctness really leads to code you can be confident in. Remember, this is cornerstone in your monitoring strategy for your serverless API. You want to know, immediately and before your customers (and your boss) when things break, what broke, and why so you can quickly ascertain if you need to panic, or not, while possibly being distrated or sleep deprived. Would you really want to leave something that important to a language that gives no guarentees on the shape of your data, at runtime, nor can it easily tell you exactly where something failed to parse without verbose handling of dictionaries or copious PyDash get wrappers around data access? ReScript, Haskell, F#, Rust… their king here.
That coginitive dissonance is why I wrote this article. I love how both laguages can shine in what they do best doing the same task, and it’s fascinating to compare them, and how you’d approach doing the same thing in each. For reference, all code is up on my Github.