Software Observability, Structure, and Automation – Why and How

Observability and controllability are two important design goals when designing systems. In software design and development, the two goals are often forgotten or mixed up with debugability or taken for only relevant to the testing phase. I write this article to pitch some of my raw thinking on the software observability and related topics.

Observability is more a topic for development phase. Too often we take for granted that our coding is straightforward and with no bugs in it. Only when putting under a debugger and the execution being tracked closely we shall find it does have some unintended effects, at best harmless, or worse causing random bugs. Hooking up a debugger to observe the execution is an extra step if the software does not show an ill symptom, for this reason we often skip this step. Only if we can build the observers into the software so that we become accustomed to observing the software at development phase as a routine, we’ll have more confident that our code is really doing what it is designed to do.

Software Observability on the Web

Bryan Cantrill in his article “Hidden in Plain Sight” on “ACM Queue” describes the problems with observability in two folds:

  • Software are usually observable with the development version, but not in production. The production version usually strips off all the observer instruments for performance reason.
  • And, software problems are usually caused at the very high abstraction level, but the observable level is close to the bottom level. The high level problem usually will cause a cascading of problems, only observed very deeply away from the root cause and it can be difficult and time consuming to trace back from the symptoms to the root cause.

Bryan argued that to overcome the two problems, the observability infrastructure must take two profound shifts: From development to production, and from programs to systems. Going forward, Bryan suggests the use of DTRACE – a Sun technology.

Practical Observability Observations

Without resorting to the DTRACE, we the average developers have to seek for a more pragmatic and practical solutions.

In my experience I’ve observed the observability problems have these characteristics:

  • Instrumental code are buggy. They are usually created once but not touched or used forever. Because most developers (including myself) assume our code is perfect if we don’t see a buggy behaviour so that we don’t bother to closely watch what log or trace our software is producing. When the software advanced (be midified), the observers become obsolete.
  • Plain logging can not serve the need at different component levels. Some components inherently execute more often than others, or at a frequency of producing too much logs than anyone can read. Those high-frequency logs will flush out the useful logs generated by other components. And by carefully designing the roles of the components, logging in some inner components does not generate information the same useful as the interfacing components.
  • As developers we don’t bother (or allowed by time) to fully test our code due to the lack of testing facilities (instrumental code, and external test drivers). We too often leave the code to be tested only at the testing phase, by then it is too costly to test fully, and even more costly to fix any bugs found. Too often products carries bugs those could have been avoided if the software is more observable.

Based on these observations, I explain below what I’ve tried as a solution and why.

Pragmatic Observability and Resilience Experience

To overcome the practical observability problems and to make a software system resilient against software defects, I have tried these practices in my previous jobs and hobby projects (taken from my resume):

  • Semi-automated a built-in data inspector and a unit test generator: Applied open source tools (Make and perl pstruct) to generating data-structure descriptions (C) in the release build process. Implemented a built-in data inspector (C) using the generated data descriptions; this helps developers to easily see what is going on inside the system (through a UART connection or virtual shell). From the same data descriptions also generated tests in Java. Implemented a desktop Java Swing application to run the external tests. The external java-driven testing greatly extended the practical unit test coverage.
  • Extensively fine-tuned and customized tracking and logging mechanisms: Designed common components for resource tracking and event logging (C), and customized and incorporated them into other feature components (C). With this structured tracking/logging mechanism, most defects left observable history traces (retrieved via UART shell or coredump) so that we were able to pinpoint the causes of many bugs without reproducing them (login into the shell of a live customer router or offline decode the saved core dump file). Fixed many one-time or transient bugs with help from this feature.
  • Introduced a domain-specific language for structured execution: Converted the subsystem from the pure event-driven model (very long call path) into the scheduled event-driven model (non-blocking, shallow call) to align the data context with the execution context (this makes auto-recovery easier to do). Designed and implemented a domain-specific language (DSL, a mini language) for specifying scheduling flow of different API requests thus we can separate the state-machine logic and the application logic. Together with table-driven event handlers, this prevented scope creep of components (e.g. running top level functions at the bottom of call path).
  • Automated and customized core dump and multi-tier recovery: Designed and implemented GDB stub (C), core dumper (C on proprietary OS and vxWorks), core decoder (perl and C), multi-tier recovery mechanism (C on proprietary OS and vxWorks). Extended the auto-recovery mechanism from inter-processor to intra-processor components. These measures greatly reduced troubleshooting and debugging time, and reduced the impact of defects.

Pragmatic Observability Solutions Reasoning

The reasons for the above pragmatic practices can be further explained by the following rules of thumb:

  • Do not code all the observers by hand. Code the framework or template by hand, then use generators to generate the data description part. When we design the software we design it on top of some abstractions. Be careful to make sure the observed behaviors each have a data structure representation kept in memory (even on a garbage list), and have compiler to dump the data structures and generate data structure decoder from the compiler dump. The observed data structures are at the intended abstraction level, thus are more closely reflecting the root of defects should a bug exist. Since the framework code or template code are executed in any and all observers, it belongs to the code that is executed often thus less buggy. The real data decoder are generated by the same generator, thus only if the generator is correct, the decoder can be correct. As long as we run any observing functions correctly we are confident the other observers are correct too.
  • Separate the logging areas for different components by their running frequency, critical level, type of logging (string or raw data). Design a common logging and tracing component and instantiate it with configuration and customization for observed components, and attach the instances to the components.
  • Separate logging formats. For very high frequency events, use less logging information in compact formats. For less run components and less watched ones, use more descriptive logs. For OS related exception, generate a specially designed coredump (do not use a full memory coredump as it is too resource consuming).
  • Only build the most basic facilities into the runtime image. Use external observers and decoders to help work together with the internal facilities. An external decoder on desktop are easier to maintain and modify than a built-in one. Use socket or other communication links to connect the external and internal parts. The external part can be coded in a more efficient language such as Java.
  • Be sure to keep the first occurrence of a problem forever. When using a ring-buffer logging area, make sure the first occurrence of an alarm is not written over.
  • Add auto-recovery when components are involved in interaction. Better to make sure the error status returned from other components are handled properly, and when possible not to be propagated to upper-stream components. Thus the problem logging should generate only a few error logs, not to cause a cascading of effects due to a small defect.

Structure and Automation as Important Software Quality Factors

Observability is important to software quality. Two other key factors are also very important to software quality: structure and automation. I’ll explain more below.

Structure

By structure, we just mean any structure in software, including but not limited to the data structures, execution paths, multi-tasking design, locking and synchronization, debugging, coredumping, logging, tracing, etc. It’s all the important structures of the implementation.

For structure, we need to:

  • Make sure a software is design to have a sound structure that support the implementation of the features of the software, and accommodate the change of the software.
  • Make sure each time a new feature is added, the existing structure is not broken accidentally. Too often this is a problem when we add new features without understanding the existing structure, or when we focus on only the logic of the new feature.
  • Build observers around the core structure of the software. Further more, make sure the structures are represented by some data structure so that observers can be generated automatically.
  • Design the structures to be defects-tolerating. Be prepared to handle exceptions all the time, contain the damage locally, and restart down-stream components if possible.
  • Design software so that it has structured complexity, rather than random complexity. A real software can never be simple, it’s always a complex monster. The only difference is how you can structure the complexity so that it can be under control, but never less complex. Simplicity is an illusion a well-designed software appears to be to its users.

Automation

By automation, we mean automatic build, code generation, automatic differentiation of logs, automatic transfer of logs and dumps, and much more.

Regarding automation, we can say:

  • At implementation phase, code the core logic by hand, and generate the repetitive part automatically. For example, if you have tens of API functions and each does very similar pre-condition test, generate the the condition tests automatically. Another example is when you need to build data observers the data description can be generated by generators from the source code. Only the core decode logic need to be coded by hand.
  • Use domain-specific-language (DSL) to help the coding and debugging. A DSL is a little bit higher than raw coding (in C, etc.), and much lower than the one-catch-all high level languages introduced by many CASE tools. The high level languages in CASE tools have failed due to they just can not capture enough details of any real software system. A DSL is domain specific, meaning it is closely tailored to the need of a certain coding task in a very specific scope to simplify the life of the coder. DSLs have existed since the start of UNIX or Lisp days, in the forms of scripts, frameworks, templates, or even coding style, and other forms. It is a tried and true means for augment software development with little effort big gain.

Design by Overlay, Aspect Oriented, Design by Contract

Design-by-overlay is a way of viewing how the structures of software designs are put together.

Aspect-oriented design is a practical and formal method to practice design-by-overlay automatically.

Design-by-contract is about the locations in the code where you should put the code relating to observers, sometimes observers with automatic actions. It’s about how you should validate against the design intentions in forms of contracts and trigger observable events if the contract is followed or violated.

All these practices serve the same goals: To maintain sound structure so that a larger software can be handled by a single developer. To extensively use automation so the developers can be relieved from tedious details to focusing on more intelligent logics of the code. And to facilitate effective and accurate communications between developers and users.

The ultimate goal, is to produce nearly perfect software in practice.

Creative Commons License
This work by minghuasweblog.wordpress.com is licensed under a Creative Commons Attribution 3.0 Unported License.
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.
Published on minghuasweblog.wordpress.com on Nov 2, 2011 @ 21:18 GMT
Advertisements

About minghuasweblog

a long time coder
This entry was posted in All, Methods and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s