Consulting Chronicles #3: Preventing Fire Drills & Crises by Removing Land-mines and Using Diagnostic Tools

Preface

When brought into existing projects in a consulting role, there will often be the perception the project is “mostly done”, or “90% there”.  Opening up the hood, you sigh.  You marvel at the wonders of modern programming technology, how they’ve empowered even the shoddiest, hastily thrown together, duct taped to work and work well, fooling many into a sense of functional complacency.  You also wonder when, not if, it’ll explode in someone’s face.

A lot of short lived software written for conferences and trade shows can get away with this.  It just needs to work long enough to work once or twice.  After that, who cares.  In longer term products & services for companies, most software follows the traditional rule of living 3 times longer than it’s intended life span.  In consulting, you’re brought on to finish it, fix it, and prep her for long term maintenance, hell, maybe even adding additional features to it.

Sometimes you and your team can inadvertently become a victim of their own success.  Once you’ve wrangled the problem areas of the code into a stable state, earned trust with your client, things start to settle down.  At this point in the project, the code really does work as advertised, or you’ve merely stopped all the noticeable and constant explosions.  The land-mines, or unexpected breakages haven’t occurred in awhile to remind people, in a sad way, you still have a lot more work to do.

Introduction

This post assumes you’re a consultant brought in to fix and help release an existing code base.  The client needs to release, and you didn’t write the original code base, yet now are responsible for it’s architecture & direction.  The perception is, either because of your efforts or merely because there haven’t been any explosions lately, the code works just fine and just needs maintenance to get them to launch.  Then, something horrible happens.  The code doesn’t work, and NO ONE knows why.  Suddenly mass insecurity sets in since the reality people thought they knew doesn’t exist anymore.

As an architect who didn’t architect the code, you may not be able to immediately remove that insecurity, but you CAN use it to your advantage.  This post will show you how via defining what fire drills, crises, and land-mines in code are.  It’ll also show you how to prevent them by using diagnostic tools, some of which you’ll have to write.  A lot of these can proactively   Doing this will better equip you to prevent the code from not working suddenly, identify with confidence WHY it’s not working, and help build the team’s trust in your word.

What are Fire Drills and Crises?

A Fire Drill is a slang phrase used in corporate culture.  It’s used to refer to situations where a manager/leader has a perceived importance of something that really isn’t, and demands an individual or team work on it.  This work that comes up suddenly is chaotic, and in the end often accomplishes nothing.  In terms of software, this often happens when a feature suddenly needs to be implemented in an extremely short time frame.  This bypasses the traditional processes used by the team (Waterfall/Scrum), causing much commotion, and often damaging the code base short term and long term.  The more common ones, though, are when un-announced demo’s are given by high level executives, a key feature doesn’t work, and suddenly something the team had been working 6 weeks on to get done now must be done in 1 day.  Complete bullshit, I know, but it happens all the time.

Crises are very similar, although, valid.  In a fire drill, the executives may not know that a newer production build fixes the key feature they wanted to demo, and a miscommunication just happened.  In a crisis, everyone, including the developers, is under the delusion the feature actually works, and in a key demo, it does not.  THAT’s a crisis.  Fire drills, while perceived as important, aren’t.  Crisis ARE important.  Your code doesn’t work… or worse, your code doesn’t work, it’s live on production, and customers are flooding your call center.

Both can cause harm to the code, team morale, and trust in you as a consultant.

Preventing Fire Drills & Crises

A lot of fire drills can be prevented by good communication.  As a developer, you’re primary job is to kick some ass writing code, not give directors and executives transparency into the project; that’s the Project Manager’s job.  As a consultant/architect, however, there is a LOT you can do to help empower the managers, PM’s, and testers with insight into the application, and good information on what’s happening.

If a PM knows what’s wrong with the application, they can confidently assuage the fears of those wondering why things are broke.  The more information they have, the easier it is to articulate the problem.  Sometimes it’s a reoccurring problem.  The confidence, and perhaps slight indifference. in their voice when communicating to those above arising from this commonality will go far in ensuring people in charge don’t freak out.

Sometimes, a PM or tester will know before YOU do.  This proactive action in both known, and new unknown problems, allows you to not only prevent higher ups from seeing problems before they happen, but also allow you to leverage the entire team in debugging your application.  It’s one thing to have a developer and a PM duplicate a problem; it’s another when you have 5 people all getting the same results, with logs to confidently prove it.

During a crisis, a lot of fear is because of the unknown surrounding the situation.  Why is this happening?  Who’s responsible?  Is it my fault, the back-end, or our 3rd party data provider?  The worst thing anyone can do in a crisis is panic.  You need to be calm, collected, and strategize how to diagnose the problem to inform those in charge, and then allow yourself time to actually attempt to fix the problem.

That’s easier said then done when the suits have a gun to you and your PM’s head… and perhaps you even LIKE your PM.  Maybe you feel like the performance of the application is directly tied to team member’s perception of your ability, and helping them determine whether they like you or not?  Perhaps you’re right? ZOMG!!!

Your application needs to talk.  Your application needs to report what is going on with the various aspects of itself.  It needs to tell the truth, or the truth of what it thinks it knows.  You need to have external tools at your disposal to corroborate the application’s built in reporting and diagnostic tools.  These can be off the shelf, open source, and ones you’ve built yourself specifically tailored to the application at hand.  These need to be quickly & easily accessible, and require little to no maintenance.  They need to be relatively easy to use and understanding by not just developers.  Reports generated need to be easily portable text.

These reports and tools allow proactive action against problems, help empower management with good information, and help prevent fire drills and reduce the severity of crisis situations, often preventing them.

Reporting & Logging

How do you get this information?  You’re application needs to talk to not just you, but anyone who asks.  It needs to have a semblance of an agreed upon vocabulary.  It needs to generate fine grained reports about volatile areas.

How do you get it to do that?

Logging.  Logging is another way to say “trace” in Flash or Flex.  It’s a lot more than just tracing out simple messages, though.  You need a formalized way to send them as well as allow them to be readable when you have thousands of lines of messages.  Here is some criteria of a helpful logging strategy:

  1. Shouldn’t require any fragile or complicated configuration to get it working
  2. should fall back to Flash Player’s trace command so it works with the traditional Flash & mxmlc debug players
  3. should have special GUI created around the log messages to display, filter, and allow extraction
  4. the GUI should be in the application itself.  This allows anyone to access it easily.  For widgets and small screens, if you’re showing a GUI, you clearly have enough room to show log messages.
  5. the GUI should be able to be opened & closed easily from within the application.  Right click is usually the most unobtrusive, and easily removed for production code.  Closing should not remove the log messages from memory, nor affect the logs in any adverse way.
  6. The log messages should be able to be easily extracted, copied, and pasted.
  7. The log messages should have built in, cross platform formatting to make them readable when pasted into email, text messages, and text files.
  8. You should be able to scroll through older messages without being interrupted by the logger.
  9. The log window should not adversely affect the application (in Flex 1, override Object.toString() is what some debuggers did, and this broke Flex’ String formatters and validators… but ONLY when the log window was running).
  10. Bonus points: per developer filtering, easy to turn on/off, color coded messages, and works in multiple compilers/IDE’s.

Wow… more than just a simple trace in the Output window, ya?  Why all the guidelines?  Let’s break it down.

Stay on Target…

During a crisis, stress levels are high.  You want to make it as simple as possible to quickly get logs from your application.  This shouldn’t have to make you think or concentrate hard on getting it to work; it should just work.  In the case it DOES screw up, or another developer doesn’t have/refuses to have your custom setup, as long as all of your special trace commands still output a trace command, you can utilize the existing debuggers in Flex/Flash.

When Things Breaks, Look Here for Answers

The special GUI is important.  This clearly delineates where log messages go, how you interact with them, and is the gateway for non-developers into the app when something goes wrong.  They will turn to this window when something breaks or doesn’t work correctly.  This is the information they will be combing through for insight into why something broke.  Sometimes this information is so helpful, it’s self-correcting, and you’ll never hear about their problems.  Examples include clearly stating you don’t have a session, and thus the user needs to log back in.  If you haven’t captured all session errors in your application’s Service layer yet, the PM/tester can re-login and try again.  This, as opposed to the 3 email conversation, or the 1 minute phone call all to “just re-login dude…. you’re session is probably expired”.  Doesn’t interrupt your focus, and ensures when problems do arise, those working with you know where to look.

Filter Out the Noise

As your application grows, both in size of code and developers, so to will the frequency of messages.  Sometimes all of those messages are an important indication & insight into the health of your application.  Thus, you need an easy way to filter out the good ones from the bad ones.  This is where a simple trace won’t do.  You cannot differentiate once trace from another easily since it’s just text.  No, using RegEx and clever time stamps doesn’t count.  You need to create messages, a Value Object class that represents the message, the time it occurred, who sent it, where it came from, what it’s saying, and it’s type… at a minimum.  You can go overboard here if you want, but just be aware of the old adage, more code == more problems.  You want to make sure you’re only adding what you need, you don’t make the logging API a pain to use for the developers, and you don’t have to debug your logger for more than a day to ensure it’s not lying to you, or broken.

Filters can include differentiating between logs, warnings, and explosions… also known as errors.  This allows anyone to see the entire state of the app, or just the problems, known and unknown.  As multiple developers get on the project, you’ll sometimes get log messages you didn’t know about; this can sometimes make it easier or harder for you to debug your own issues.  These should be filterable as well, whether at compile time or runtime.  This, too, should be easy to configure.

Portability & Ease of Use

Many debug windows in the Flash community are external.  There are a few advantages of this, namely it makes it easy to debug applications running in a browser.  You’re application isn’t affected, nor necessarily tied to the browser’s state.  Other times, it’s completely separate from your code base, making integration cleaner.  Finally, it’s easy to debug multiple applications using the same debug window.

I’ve found, though, in consulting & contracting over the years, these never work with PM’s and testers as well as custom, home grown, in-application ones do.  Installing and confirming things like Thunderbolt or DeMonsterDebugger work are challenging endeavors unto themselves requiring 3 technologies.  This vs. “just run the app”.  The last thing you need Firefox to wig out on you during a production push, it does, and suddenly you have zero insight into why your application isn’t working on your staging server… and people are freaking out because you’ll miss your production push.

The other nice thing is no matter where your application runs, so to will your logger run.  This includes on your local box, your local web server, your QA server, and even production if you’re your so inclined… no configuration needed.  Nothing like adding confidence to your code base.

This also means ANYONE using the application, from testers to PM’s to suits and bobs in a board meeting; if something goes wrong, you can quickly diagnose what it is without needing the Flash Debug Player installed.

Copy Pasta

Adding just a simple “Add to Clipboard” button goes a long way to making it easier for others to get errors to you when problems occur.  If you’ve ever tried to copy and paste in Flash with it’s awesome focus-fun, you know what I’m talking about.

Moveable & Capable of Being Closed or Hidden

The log window needs to be moveable and capable of being hidden.  Some applications have buttons in certain places and you don’t want the log window to get in the way.  A lot of times it’s just easier to close/hide it.  Whether you visible false it, or just removeChild, whatever works.  This should NOT affect the log messages.  You should still be able to get log messages.  This means that the logger class usually has no GUI, and the log window has knowledge of how to display those messages.  That way, you can kill the GUI, and the messages are still being retained.  More importantly, before Flash/Flex has even started, in the case of static initializer methods (which even the Flash IDE can’t debug), you can still get log information about them.

This should be easy to do.  Usually making it “look” like a window is enough for most people to try to drag it.  Using useHandCursor and buttonMode to true on the top part will show the pointy hand.  Although the wrong gesture, it’s better than the cursor b/c it gives a hint the user can “do something with this part”.

In Flex, it’s easiest to just use PopUpManager.  This puts your logger on top of everything, and if you kill it, you don’t affect other View’s.  Bonus points if she remembers where you dragged it when it opens back up (hint: local SharedObject).

Log Message Formatting

Log messaging formatting is very much opinion.  Thus, using standard formatting options such as newline (“\n”) and tab (“\t”) will ensure whatever your team decides on, it’ll work no matter where it’s copy pasta’d from.  One convention I use to prevent needing strange metadata in your code is the following:

  • have log, debug, info, warn, error, and fatal messages
  • color code them in the GUI with filters
  • all log messages start with “ClassName::methodName” where ClassName is the name of the class you are in, and methodName is the current method the log message is in.  It seems a major pain at first, but I guarantee you when you start removing them 2 months later, you know EXACTLY where to find it vs. that one trace that stays there for weeks because no one found where the bastard is.  It’s not that they couldn’t do a search/grep for Debug.log(, but they’d find 500 of his friends… so not worth the effort.
  • I provide a header method for each type of message; logHeader, debugHeader, infoHeader, warnHeader, errorHeader, and fatalHeader.  These are usually just “———————” color coded.  When you put multiple logs in a method, especially for/while loops, this helps visually break up the hundreds of lines of log messages.  This also enforces consistency amongst the team when other members will just use their own like “>>>>>>>>>” and “**************”, and it becomes cool from an artistic ASCII perspective, but a pain from a debugging standpoint.

Writing Log Message Types

You need helpful log messages that aren’t written in g33k talk.  It’s not just you that will be reading them, but testers and PM’s from a variety of backgrounds.  You also need to write clear enough that if a bug crops up 3 months later, you’ve written a message that clearly identifies the problem.  Trust me, you’ll forget as you focus on other things.  Even if you don’t, a confidently written message goes a long way in assuaging people’s fears that you know what is wrong, and got it covered.  Here, I’ll cover the 6 message types, and where & why you’d use them.

LOG

Logs are used for common stuff that is integral the app, and always runs every time you run the application.  This includes loading external service definitions, logging into the application, and other integral data your application needs to get from an external place before it runs.  Logs should get cliche over time in larger applications; you should only start worrying when you don’t see them, and instead see errors.  This, also, is a great indication that something totally wrong… is wrong.

Logs are checklist items to ensure your application is in working condition, or to confirm an action did in fact happen successfully.

Debug.log(“ServiceLoader::onComplete, we’ve successfully loaded our services.xml, and we’re ready to rock this mic!”);

DEBUG

Debug messages are your primary source of insight into new code.  These are messages you’ll put into new code to ensure it works… or into old code that’s acting up.  If you use Test Driven Developement and/or unit tests, these messages are used in tandem to be iron-clad sure that the things you think are happening are actually happening.

Debug.debugHeader();
Debug.debug(“Alright… my method is actually running… amazing, I’m not fired.”);

INFO

Info messages are used in strange, insecure situations.  When something happens you won’t to know about, but doesn’t adversely affect the application, you use an info message.  They can also be used in tandem with debug messages.  If you have a lot of debug messages, it’s sometimes hard to filter them alone, so you throw an info message into the middle to confirm/deny what you were testing/debugging.

Debug.info(“Dude, I didn’t get a security error this time!”);

WARN

Warning messages are used to report errors that don’t negatively affect the continued operation of your application.  These include security errors when loading images, failure to save a file, or when a web service goes awol.  These can lead to, or hint at, bigger problems but sometimes occur so often they don’t hurt anyone, you and your team just need to be aware they happened.

Debug.warn(“LoginService::onResult, succeeded logging in, but I’m getting yelled at to change my password, and we don’t have the popup wired in here yet.”);

ERROR

Error’s are usually a blanket message for all errors.  These include synchronous, asynchronous, and custom problems that arise.  You’ll often log these within try/catch blocks, asynchronous error event handlers, or when your code is expecting something to be true, and it’s not, and you’re screwed because it’s not.  Sometimes you’ll just use a warning instead because you don’t care right now, or it’s not your fault, or you just can’t do anything about it.  Errors usually imply something needs to be, and can be, fixed.

Examples:

try
{
        fileStream.writeObject(obj);
}
catch(ioError:Error)
{
        Debug.error("FileSaver::onSave, Couldn't save the file, ioError: " + ioError);
}

function onIOError(event:IOErrorEvent):void
{
        Debug.error("FileSaver::onSave, Couldn't save the file, ioError: " + ioError);
}

FATAL

You shouldn’t have to ever use these.  You’ll find that WARN and ERROR messages on their own cause concern amongst the non-developers on your team.  Even messages written with a concerned tone using LOG/DEBUG/INFO can arouse suspicion, and fill your inbox unnecessarily.  You should only also NEVER write fatal messages when you’re emotional, like when a 3rd party web service breaks for the umpteenth billionth time, and it’s not your fault… yet your team always gets blamed.  Maybe the code isn’t in a solid enough place yet to actually debug it.

Fatal’s should really be saved for situations in which you’re royally screwed.  If you fail to save a file, and you’re app is built a Notepad like app… you’re pretty screwed, but you’re not royally f’ed… there is a big difference.  Maybe the user is out of hard drive space, or the file is locked.  If you can’t recover from an error, but perhaps can wait it out, you still have a chance.

If your external services.xml file doesn’t load, and thus you’re entire app doesn’t work?  Yeah, that’s a fatal message.

More on Messages

Litter your application with these, but only the ones you need.  Sorting through hundreds of messages to solve a simple null pointer exception is painful; don’t make your job harder than it has to be… but don’t skimp on helpful details either.  It’s an art, and you’ll get better over time.  Sometimes your messages will be great for 6 weeks… and then after that section of code base always works, you can comment them out.

Finally, try to be proactive with errors & warnings.  If something breaks, and you have an idea, provide insight into perhaps why.  Examples include ExternalInterface.call or stage.displayState == “fullscreen”.  Both can fail because the HTML/JavaScript embedding them didn’t have the proper parameters set.  You can explain what these are.  Even if that isn’t the problem, knowing they are gone usually hints that perhaps the html-template in Flex Builder got messed up during a merge, or perhaps the wrong HTML was pushed to the server, etc.  These are really nice when they popup months later, and you immediately go, “I’ve never seen this message before… we clearly did something out of the ordinary.”.

Finally, keep in mind you should try not to push debug messages to your staging & production servers.  You need at least 2 servers without messages in case they negatively affect things.  Accessing objects inside of a debug message itself can cause a null pointer for example.  Also, trace is basically a write to the disk in some situations, and slows your app down.  While removing the messages and speeding your app up may seem like a good thing, sometimes strange race conditions your debugger fixed will arise.  Better to see them before you move to production.

Reporting

Logging is a form of reporting, but reports specifically are run on certain sections of the code base and data to get finer grained information, and ONLY on that section.  Sometimes you can write a unit test to get the report you want, or other times just a custom class/application.  Examples include validating, en masse, all data coming from your back-end is valid.  On my current project, I loop through 500 videos and ensure they play within 10 minutes; all I do is hit a URL in the browser and it runs.  It prints out custom log information on the status so I know clearly what’s going on. It’s ONLY for that particular report, so I don’t need to do any filtering.

Reports can be run in specific SWF’s tailor made specifically to run them.  Making these easily accessible to others allows increased insight to various parts of the application.  Example:

“Jesse, the application isn’t displaying images from the image server again.”

“How do you know it’s the image server that’s failing?”

“You’re debug window said it was.”

“Did you run the image server tester app?”

“Yes, and it confirmed that our internal development image server works, but when we he hit the production 3rd party one, it fails, so it’s definitely their fault again, not ours.  I’ve already told the build master to switch to our local server for our noon demo until the 3rd party can get their act together.  Just wanted to let you know if you start getting old images.”

Within seconds, someone from your team can run a SWF that ISN’T your application, and determine it’s not the application that’s broken.  Sometimes, applications take awhile to run, as well as awhile to navigate to the problem section you want to test.  If the problem occurs repeatedly, you make it dead simple to diagnose by all.  Win.

Using these reports in tandem with other data allows PM’s to place pressure where they need to, with the team’s confidence behind them.

Other Examples of Fire Drill & Crisis Prevention

This crap happens to me ALL the time.  An IM pops up:

“Dude, the app isn’t showing any data!!!!”

“I know.  I already told our boss.  It’s not our fault, it’s the server team having migration issues.  They’ll have it up in about 10 minutes working again.”

“Whoa… that was quick.”

That, vs. panicking, and then you have to run the app.  This assumes you aren’t in the middle of something and the app can even compile.  Within seconds, you clearly identified the problem, informed the necessary parties, and set the team at ease.  That as opposed to a fire drill which breaks team focus.  As we know, developer focus is EXTREMELY valuable to maintain.

Another is:

“Jesse, Java middle tier dev here.  We’re not seeing video’s work here, but a quick test in the browser shows they are coming from Amazon’s CDN no problem.  I didn’t want to spend an hour running scripts to sort through failed video FTP logs, so was curious if you knew of anything before I did so?”

“Huh?”

:: runs video tester ::

“Hrm… works in the tester; you checking production?”

“Yep.”

“Ugh… hold on, lemme test….”

:: tests ::

“Yep, my code broke it… I’m getting error messages from a totally different server in the app itself.  Give me an hour, I’ll push a new build.”

“Thanks!”

First, I saved a middle tier developer’s time.  Second, I quickly ascertained the core video services worked; it’s just my latest code change broke it.  This WITHOUT having to resort to a debugging session, or even comparing tagged builds in SVN.

More on Video Diagnosis

I use video diagnosis as an example reporting application merely because streaming video is really complicated.  There are a lot of failure points, and it helps to know which point failed.  Without verbose logging, it’s hard to tell what the error really was.  In dealing with 3rd party CDN’s, they will NOT respond to you unless you can easily point the finger.  Thus, you need verbose coverage of errors to effectively communicate all bases on your side are covered.

For example, does the NetConnection work?  If one of his 7 failures, excluding the 10 billion ones you have manually parse (yes, PARSE) out of NetStatus, do you know which one failed, and why?  What about NetStream?  Are they failing because of your session?  Your token?  A malformed URL that botched the whole process?

This is where a complex section of code is better tested in isolation.  Unit tests help here, yes, but a lot of times you need to run a lot of stateful code that gets way too complex to test in just 1 unit test.  That, and you verbose reporting data, in order.  Data that you can send to those in charge and CDN’s for help tickets.

Having a simple Flex app that plays a bunch of videos and reports verbosely on their successes and failures is invaluable without having to:

  1. run the app (assuming it complies, and assuming it’s working on a specific server where the problem is reported)
  2. login
  3. navigate to section
  4. select video play

…slow!  If you do this more than 2 times, you’ll probably be doing it a lot more.  Things that are complex break more often, thus having diagnostic tools around them helps you more quickly ascertain the problem.  Remember, while unit tests specific units of code, diagnostic tools test application functionality.  While you can write unit tests to do this, writing a simple GUI anyone can run quickly to see if something works is invaluable.

Service Layer Diagnostic Tools

For most services (web service, REST, SOAP, Remoting, etc), unit tests will suffice.  If you can quickly test 30 web services, and see that not only do they work, but the data they are sending back is valid, it’s really nice to blame the middle tier guys & gals so you can get back to work.  It’s always easy to blame the client because that is the main GUI used to access & use the services.  Granted, the middle tier developers can access your unit tests as well, but sometimes it’s helpful to provide a GUI for them to quickly test as well.  This includes customizable parameters that can be sent with verbose logs that ensure you’re code is sending & getting what it needs.  Trust me, the more empowered your middle tier developers are, the better off you’ll be in the long run.

Tools you don’t have to build here include Firebug for Firefox, Charles HTTP Proxy, and Wireshark.  All work right now, and all aren’t your code.

Some tools are built upon a specific service.  For example, one project we were showing a bunch of images for a certain account.  Each account had certain images show on certain dates & times.  It was INTEGRAL that the GUI correctly represented this.  Since there were a lot of failure points, I created a simple GUI to easily validate that the data I had not only was visually valid for the middle tier developer to confirm in the database, but also matched our applications use of the data.  It was just a simple DataGrid, in a TileWindow, with 1 custom itemRenderer.  However, that one component found sooooo many bugs, and really gave us a lot of confidence our back-end was working as it should… and my parsing code blew chunks.

Performance Diagnostic Tools

While Flex Builder comes with a profiler, you can create your own profilers as well.  You can access the sampler classes, and/or make your own profiling tools to constantly update you (or manually) on the current performance of your application.  As RIA’s tend to push a lot of browser based app limits, it helps to know just how far you are really pushing things.

Existing tools include Stats, which making accessible via right click menu is awesome.  Also, Grant Skinner’s PerformanceTest.  Getting familiar with how Flex’s profiler works, at least for memory, is a wonderful start.

More on Tools

Most diagnostic tools I create are simple, quick, and exist in a self-contained window visually, and package in the code base.  This ensures the code is easily removed if it causes a problem, it has a low risk of causing a dependency, and I can easily work on the tool in relative isolation.  I ONLY create a tool when the area in question is risky and/or hard to test.  Additionally, sometimes I need a GUI to provide simple functionality to test that I can’t do easily on my own.

For example, I’ve had 2 applciations were we implemented client side caching via Local SharedObjects.  Have you ever tried to delete these things?  Pain the ass, and slow.  What if you could just click a button IN your app?  That’s right, 3 clicks and they are dead?  What if you could see ALL SO’s you’re application uses, how much room they take up, and what their contents are… IN your application?  What if you could edit them?  Exacatly.  Flash Player allows you to delete them, but it’s tedious, requires an external website, and there is no good, x-platform apps that quickly give you business insight into your local cache.  When I say “business” I mean, relevant to your application.

For example, just because a Local SO has a ByteArray of BitmapData (since BitmapData doesn’t serialize properly) doesn’t mean it’s actually an image your GUI should show.  Perhaps you’re utilizing BitmapData as a faster multi-dimensional array, and saving them to the disk this way.  You’re GUI knows what to display, and how.

Another reason is ease of use.  Having PM’s and QA testing sessions in a browser based app is pretty straightforward in Safari and Firefox.  Same with new GUI changes; just clear your cache.  But what about Flash cookies?  “Right Click, choose ‘Cache Viewer’, hit the ‘Delete’ button”.  Hell, those instructions will fit in a Tweet!

The most infamous: “What version of Flash Player are you running?”

“How do I know?”

“Right click, and it’ll say.”

Booya.  Tools don’t have to be complicated windows… they could just be simple log messages that can be triggered, or information inside your right click menu.

Landmines

The last thing to touch on is land mines.  Logging, reporting, and diagnostic tools prepare you for the worst.  The worst is when you start removing land mines.  Sometimes you don’t know you have land-mines until someone steps on one.  Examples include the login service timing out.  The ENTIRE APP fails to work merely because of a hiccup in the server.  The client wasn’t written to… oh I don’t know, try again a couple of times.  Once it happens, you need to remove it.  Hopefully you’ll have a log message… like the login service never reporting an error, NOR a success.  If you don’t, you’ll learn your lesson, and log that shiz.  It’ll never get by you again.

…oh crap, a timeout.  No seriously, THIS TIME it won’t get by me again.  I’ll log when the timeout occurs AND whether or not they worked.

Now you’re talking.

Other more common land-mines include code that blatantly assumes no throws will ever be thrown.  Things like Loader.load, NetConnection.connect, or no try/catch blocks around navigateToURL.  Those things are just WAITING to explode.  If you don’t have a try/catch, they’ll blow up eventually.  If you do, but no log, this is MUCH MUCH worse.  It’ll explode, and no one will know.  Like when the tree falls in the woods, but no one is there to hear it, it doesn’t make a sound.  You need to log all errors.

…except for NetStream.close()… he’s the exception to the rule.

Bigger land mines include sections of code that break, and don’t tell anyone.  A lot of time developers won’t get a list of error messages, nor a GUI element to utilize in those cases.  Feeling insecure, they’ll either just log it, or perhaps LET it break on purpose to get someone motivated to provide design/UX direction.  It’s better to log those messages with proactive verbiage, and even using built-in alert controls (like Flex’) until a designer/management can provide the developer with what they need.  Informing the user of a problem with an ugly dialogue, and a log for QA testers/PM, is MUCH better than a hidden log message with no visual indication of what went wrong.

It may not seem like a land-mine, but if no one knows why clicking a button didn’t work, they’ll feel insecure about using the app.  If they are a stakeholder, this insecurity can lead to a fire drill.  Even if you and/or your team knows the problem, they don’t since you didn’t tell them.

Again, most land-mines from a consulting perspective are the ones you don’t know about.  A lot of times, the code base is too large for you to properly dig in and ascertain potential problems.  At least providing diagnostic tools for that region, or empowering others on your team to use/build them, keeps you abreast of what’s going on.  I’m not talking about null pointers; I’m talking about dependencies that you don’t know about breaking, and no one, including you, immediately knowing what’s wrong.

Using my JXLLogger v2

I’ve provided a sample logger I use in Flex & AIR apps.  You can use it in pure AS3 projects if you just utilize a simple LocalConnection.  It doesn’t follow all of my rules above, but it’s good enough to get the job done.  Simple drop the SWC into your libs folder, and go:

import mx.managers.PopUpManager;

PopUpManager.createPopUp(this, DebugMax, false);
DebugMax.log("Application::init, DebugMax ready and able, SIR!");

Thenceforth, anywhere in your application, you can go, DebugMax.log(), or debug, etc.

Cons

Having your application clearly talk to you and your team, with associated diagnostic & reporting tools to help augment your existing debugging capabilities… what could wrong that?

It’s more code.

More code means more things that can possibly go wrong.  More code means more things to maintain, to steal your focus from my core objective.  It’s even worse in Agile/Scrum where while it may be a necessity, to create code to help debug other code, yet you’re technically only getting the user story points for one particular feature.  While you get better at recognizing what tools you need to create, and thus factor that into your perception of the challenge associated with specific user stories, it’s sometimes hard to justify.

“Did you finish the login service?”

“No.”

“Why not?”

“…well, my server-reporting app that allows me to see if the server is awake took longer than anticipated.”

“You were only supposed to code 1 class that logged into the server, not right a server monitoring application.”

“Dude, that server goes down ALL the time and the LoginService class has no GUI; we don’t have an easy way to debug yet what errors are real, what are our code, and what’s that bloody server.”

“I don’t care, you’re clearly off task.”

“And you clearly are a douche!!!!”

See what I mean?  If you use the same arguments against Test Driven Development, they have the same negative consequences here, specifically, you’ve created more code to maintain.  Worse, that code isn’t often factored into project planning & budget.  It’s just assumed that if your design or data model changes, that you go update your tests classes because they “pay for themselves” in the future.  Hopefully that future is before your next UAT/deadline, and they + the things you need to get done that the client/your boss actually cares about are finished.

It’s really hard to argue against good logging.  Everyone benefits, and you can do it whilst you code.  Creating diagnostic tools can sometimes be justified as simple prototypes for a larger GUI.  If you have to create a real-time graph for example, that is REALLY complicated.  Creating a simple tool that just shows the values as they are is much easier, and VERY valuable in the long run to visually compare against your in-development graphing component.

Full blown diagnostic tools, however, can usually only be justified to those who’ve been doing this for awhile, are confident in your ability to not go off on a tangent (assuming you don’t have better things to do), and/or actually have a business need.

For example, Flex & Flash applications are SWF’s.  As such, they are a binary format that is self contained on the page, and doesn’t have massive integration with the browser like HTML/JS/CSS does.  Thus, common debugging and reporting tools & plugins available on the web do not work well with Flex/Flash apps as they do with HTML ones (ie Firebug, and various web developer toolbars built into some browsers).  Therefore, sometimes clients will specifically request monitoring tools and/or ways of testing the health of the application and/or getting reporting metrics from it.

Again, most of the times where these tools & techniques are valuable is when you’re consulting on a large code base you don’t know.  Like wrapping strange code with unit tests, they are created to give insight into how the code behaves since if it were good, it’d be easy to find out, and you wouldn’t need custom diagnostic tools.  You’re creating these tools for yourself to compare the information your good, working code is giving you about the “great unknown”. The more solid information you have in chaotic situations, the better.  You can be the calm in the middle of the storm, and slowly spread that calm over time.

In those situations, while management/stakeholders still want fixed features and new functionality, you’re clearly there as a consultant for a reason.  Establishing a strong beachhead that can confidently stand on to effectively assess the situation is the right thing to do.

Conclusions

A simple test to ascertain if you need to utilize logging & diagnostic tools in your application is to ask yourself: Does my current team have fire drills and our crises?  If yes, then yes, you need logging & diagnostic tools.

A simple test to ascertain if your application has land-mines can either be solved by doing a little digging yourself.  Do you commonly see sections of the code where obvious try/catch are needed?  Does the code assume a lot of things will never be null, yet have no centralized factory functions to ensure they aren’t?  Do things blow up in the app randomly, and know one on the team has a good idea as to why?

Using logging to make your application talk will help you develop and debug with confidence.  It’ll also have the wonderful side effect of having your team members, even the non-developers, getting insight into the application, sometimes solving their own problems, and providing you with helpful information to solve errors when they occur.  Nothing like having a verbose log with a JIRA ticket.

Using reporting tools can empower others to get finer grained information about problem areas, and having optional tools to corroborate problems, especially with 3rd party libraries & services.

Removing land-mines from the code ensures they won’t blow up; and if the do, you know, or knew, it was going to happen via a log message and/or diagnostic tool.  The real challenge is figuring out what land-mines you focus on, and those you don’t.  That just comes from experience.

Preventing stressful situations and empowering the team to have insight into the workings of the application goes a long way to earning trust with your client.  The app may be full of hundreds of little time bombs, but eventually you’ll know about all of ’em, and those you don’t, you’ll know why.  As will your team.  That allows you to kick ass… even if the current code base doesn’t.

7 Replies to “Consulting Chronicles #3: Preventing Fire Drills & Crises by Removing Land-mines and Using Diagnostic Tools”

  1. Very nice and comprehensive round-up. I’m definitely passing this along inside my team.

    Started me thinking about how to (re-)implement a lot of these ideas in our current projects.
    We do do a lot of these things already, but again, it’s nice to see everything concerning these areas rounded up here, good reference material.

  2. This is scary.

    For the last two weeks I’ve been working on adding features to a flash portion of a large retail website. The code is written as if someone started out with 10 different ideas of how to properly program (50% of the code lines miss the semi-colon). Dozens of design patterns thrown together. 30% of the classes doesn’t seem to be used. One of the main .fla’s contain tons of files for different campaign sites done by the same production company.
    Then it’s as if the code has gone through the washing machine a few times, totally whacked. And yet it magically works.

    I have in short time gone from making small banners to dealing with this, so I’m learning a lot and your posts have somehow synced with my current experiences, that’s what’s so scary!

    Today, for the first time, I wrote a specific helper class just for this project alone, although it’ll be useful later on.

    Class is called ChildSupport, and you simply feed it a DisplayObjectContainer and it parses through it and recursively lists all children, optionally the x & y values of the children and also you can choose how many levels deep you want to go.
    (At first I ran it on the whole project container and it of course timed out with hundreds of movieclips filling flashlog.txt).

    Nothing groundbreaking, but for me it’s huge, and quite fun indeed! While at the same time I was worried I was straying too far from the main objective. Again, you’re nailing it spot on in your post!

    Are you sure you don’t want to come to Stockholm and work with us? In code nerd land, I’d probably want to marry you.

    Keep the blog alive, it’s awesome!

  3. Empowering PM’s and AE’s is key. I totally agree with your post and use these practices daily. The more verbose, the better. This is especially helpful when dealing with 3rd party API’s.

Comments are closed.