Solution Architecture Inspirations

Solution Architecture Inspirations

Many believe that Solution Architecture is a relatively new field however such as with any other new field it extends the previous accumulated knowledge. Countless architects and more recently product designers worked to develop architectural knowledge since the dawn of time, gradually evolving their practice from one level of complexity to the next. One can easily draw many parallels between Solution Architecture and Product Design, no wonder that solution architects are often referred to as solution designers.

images

Personally I believe there is a lot a solution architect can learn from studying the design disciplines that preceded Solution Design. One of my main inspirations is Dieter Rams, an industrial designer whose designs are as ubiquitous as can be, some of his designs are now the norm and even are considered the main design inspiration behind Apple’s product and applications look and feel.

Dieter Rams came up with 10 design principles that carry over nicely from industrial design to solution design.

Braun_brand-analysis-poster

ACID Solution Designs

ACID Solution Designs

In database ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee database reliability while processing transactions. Databases being the foundation of most -if not all- IT solutions can be considered the superclass of any transactional system thus such properties are usually inherited such systems. One way to abstract a complex system is as a database with each subsystem being a table.

2e813ae14c739b6ffc4bf528e157054c.gif

Keeping these concepts in mind while designing transactional based systems is quite valuable as it yields a stable operable system that would run as expected at a low operational cost; Naturally a cost benefit analysis should precede this as sometimes its actually cheaper to let the system fail and handle failures operationally. These properties within the transactional system design context are:

Atomicity: All or nothing, the transaction should either be successfully completed or rolled back. This can be implemented by exception handling and checkpointing. A partially successful transaction shouldn’t exist, the transaction should either complete successfully or is rolled back to reverse the impact it had on the system.

The complexity of atomicity is a function of the number of systems involved and the actions required to roll back a transaction’s partial impact, furthermore recursive failures must be kept in mind, as there is always a potential for rollback of the rollback to fail. Usually such functionality is achieved by using an order management system (such as OSM) which sees the execution of a transaction across the subsystems and make sure that failures are handled or rolled back.

Consistency: Any transaction must bring the entire system from one state to another. For instance if we are doing a sim swap the updated sim must be reflected across all systems, inconsistencies can take place in many ways other than failures (given that Atomicity concepts are well guarded).

One of the most common reasons for data drift is human operational intervention, operations going in and updating some system manually using a database update statement or by invoking an internal system service. Another reason is bad design, instead of relying on system “public” services, subsystems “private” micro-services are directly invoked.

Maintaining consistency can be done by restricting database access and setting standards in regards to micro services invocation and the reliance on public services invoked through the middleware. Constructing a protected / private micro-service is a pattern that should be used more often, to guarantee that future developers will not directly invoke a micro-service that would impact system’s consistency.

Isolation: The concurrent execution of transactions should have the same results as transactions executed serially. A customer updating his bundle and changing the voice mail language should have the same result whether they are implemented in parallel or serial…Siebel usually copies the customer assets into any newly constructed order, and doesn’t update the assets until an order is complete. If both orders run in parallel one will end up rolling back the other (only on Siebel assets) hence you often find that Siebel will only allow a single open order at a time. And if a customer has a failed order that requires operational intervention he can’t do anything until that order  has been completed successfully.

Maintaining isolation in a complex environment can be rather complicated, the simplistic solution that is the de-facto best practice is locking out parallel process execution all together piping everything through a single system (Order management/CRM) and making sure that transactions are executed in a serial fashion. More advanced approaches are available such as intelligent business rules about actions that can be conducted in parallel however the cost of such approaches is high and operating them is a nightmare.

Durability: Once a transaction is completed successfully…it remains so in the event of a power outage or a crash.  This impacts inflight orders with multiple subflows, in case of an outage the order should resume from where it had stopped, Oracle AIA (oracle pre-fusion middleware) achieved this by the use of the “dehydration points” concept, in which a snapshot of flow is stored in the database as a checkpoint. Oracle AQs (Advanced Queues) guarantee that messages sent between subsystems are kept in none volatile memory and are handled upon outage end.

Designing for durability while working on a high level design can be challenging given that the design should be technology agnostic yet there are a set of best practices such as checkpointing and trying to avoid exceptionally long flows.

Cost Benefit Analysis: Can a $20 solution replace a multimillion dollar solution?

“For a man with a hammer every problem is a nail.” The EA team having access to a very large hammer often leans towards very elaborate solutions to solve problems that could have been solved with a much simpler solution, or even no solution at all. As a solution Architect I’ve often encountered such solutions -and even was part of few of them- very elaborate solutions to resolve problems that should have been handled operationally.

Screen Shot 2016-01-20 at 3.05.12 PM.png

Cost Benefit Analysis conducted during the requirements phase of a project can prevent such scenarios from taking place, whats more important though is keeping an open mind and accepting that often the problem encountered doesn’t require the hammer the EA team is wielding.

Here is a parable I like to share when I encounter such situations, It is said that this is a true story.

Understanding how important that was, the CEO of the toothpaste factory got the top people in the company together and they decided to start a new project, in which they would hire an external engineering company to solve their empty boxes problem, as their engineering department was already too stretched to take on any extra effort.

The project followed the usual process: budget and project sponsor allocated, RFP, third-parties selected, and six months (and $8 million) later they had a fantastic solution — on time, on budget, high quality and everyone in the project had a great time. They solved the problem by using some high-tech precision scales that would sound a bell and flash lights whenever a toothpaste box weighing less than it should. The line would stop, and someone had to walk over and yank the defective box out of it, pressing another button when done.

A while later, the CEO decides to have a look at the ROI of the project: amazing results! No empty boxes ever shipped out of the factory after the scales were put in place. Very few customer complaints, and they were gaining market share. “That’s some money well spent!” – he says, before looking closely at the other statistics in the report.

It turns out, the number of defects picked up by the scales was 0 after three weeks of production use. It should’ve been picking up at least a dozen a day, so maybe there was something wrong with the report. He filed a bug against it, and after some investigation, the engineers come back saying the report was actually correct. The scales really weren’t picking up any defects, because all boxes that got to that point in the conveyor belt were good.

Puzzled, the CEO travels down to the factory, and walks up to the part of the line where the precision scales were installed. A few feet before it, there was a $20 desk fan, blowing the empty boxes out of the belt and into a bin.

“Oh, that — one of the guys put it there ’cause he was tired of walking over every time the bell rang”, says one of the workers.

The Law of Diminishing Returns

As an architect you often encounter requirements that are better off not implemented, requirements are triaged through several activities one of them is impact analysis. The law of diminishing returns comes into play here given that the how the complexity of a requirement and its return are often inversely proportional. Its often more productive to partially implement the requirement rather than going to the full extent as the cost will far exceed the return.

The law of diminishing returns states that in all productive processes, adding more of one factor of production, while holding all others constant, will at some point yield lower incremental per-unit returns.

This behaviour can be represented using the following formula with X being the unit of incrementation and i the number of increments.

2016-01-19-11.59.51.jpg.jpg

Screen Shot 2016-01-19 at 11.40.19 AM

To put it in a more colloquial form, the more seeds you plant in a field the less yield you get per seed, another example would be developers per project there is a certain infliction point afterwards it doesn’t matter how many developers you add to the project the return remains the same.

An example of requirement that exhibits a highly diminishing result are service assurance requirements, requiring a 99.9% accuracy would cost far more than what it would cost to have a manual workflow to manually verify the deviation. OCR (optical character recognition) projects come in mind and hence you find projects like captcha relying on users to verify the output of the OCR algorithm.

aepByYp_460sv.gif

This is not very different from NP-Complete problems and approximation algorithms reaching an acceptable solution at a fraction of the cost is favoured over reaching the solution to a certain problem set with a potentially infinite solution time.

Fault Tolerant Process Design Patterns

Designing a fault tolerant system in a loosely coupled system based on async calls can be quite challenging, usually certain trade offs must be made between resilience and performance. The usual challenge faced while designing such a system is missed/unprocessed calls resulting in data drift, This exponentially increases over time eventually turning the system unusable.

Use Case:

GSM customer swapping his SIM card.

async SIM swap
async SIM swap
  1. SIM migration order is created.
  2. Order processing starts, and SIM swap call is sent to network elements.
  3. Customer’s SIM is swapped but response from network elements is missed/not sent.
  4. CRM order is cancelled by customer care.
  5. Customer now has two different SIMs associated with his account, the one he is using listed in Network, and his old SIM card on CRM.
  6. All subsequent orders will fail since the customer’s service account is inconsistent through the BSS stack.

One way to prevent such an issue from happening all together is to lock the customer for editing until the SIM swap request is completed from network, and if a failure happens during SIM swap the customer remains locked until resolved manually, this approach is called Fault Avoidance, and its quite costly performance wise, also it provides a really poor customer experience.

Fault Tolerance on the other hand allows for such incidents to take place but the system prevents failure from happening. In my opinion the best pattern to handle faults in loosely coupled systems is check-pointing.

Checkpointing is a technique in which the system periodically checks for faults or inconsistencies and attempts to recover from them, thus preventing a failure from happening.

Check-pointing pattern is based on a  four-stage approach:

  1. Error detection
  2. Damage assessment and confinement (sometimes called “firewalling”)
  3. Error recovery
  4. Fault treatment and continued service

If this approach sounds familiar its because its been in use for quite sometime now in SQL (a loosely coupled system between client and DB Server), to retain DB consistency in the event of a faults during long running queries the following steps are taken :

  1. Client session termination is detected (step 1 detection).
  2. Does user have any uncommitted DML queries  (step 2 assessment).
  3. Access undo log and pull out data needed to rollback changes (step 3 recovery).
  4. Rollback changes and restore data consistency (step 4 fault treatment).

Checkpoint Roll-Back:

The pattern used by DBMSs, Checkpoint-rollback Scenario relies on taking a snap shot of the system at certain checkpoints through the process flow and upon failure between two checkpoints restoring the snapshot. However this pattern becomes too complex to implement in multi-tiered systems.

Checkpoint  Recovery Block:

This pattern relies on using alternative flows based on the type of fault, the checkpoint recognizes the type of fault and picks the correct block to use to recover from the error and complete the process.

This approach is extensively while coding, try with multiple catching blocks each handling a different type of exception, however instead of using it within the code of a single layer its taken one step further and used on the process level.