Debunking the Trend of “Open Box” AI & Machine Learning in AIOps

Will Cappelli | Monday September 10 2018

IT deserves clear answers, but let’s be careful not to equate transparency in how technology works with optimal results.

Debunking the Trend of “Open Box” AI & Machine Learning in AIOps

Three years ago, at Gartner, I warned vendors and users alike that unless AIOps technologies provided the means to trace and make explicit the key steps executed by AI algorithms on the path from data to pattern, anomaly, and causal analysis, IT operations teams would not take up those technologies with any enthusiasm. There were a couple of reasons for this warning.

First, memories are short in the vendor community but it is important to recall that AIOps represents the second major attempt to commercialize the deployment of AI for IT operations use cases. The first attempt occurred in the late 80s/early 90s which, in fact, led to market successes (the previous generation of help desk technologies, the Prolog-based Tivoli Management Environment) but also some undeniable failures (CA’s Neugents). Unfortunately, the failures seared the experience of IT operations teams more deeply than the successes and are preserved in the institutional memory of many data centers. Hence, there is a predisposition to skepticism about AIOps which vendors and practitioners need to overcome.

Tweet Section

It is important to stress one thing that algorithmic transparency is not. It is not the ability to look at the results delivered by a pattern discovery algorithm, decide that one does not like the results, and arbitrarily introduce changes into the pattern to make it better accord with one’s intuitions.

Second, most of the algorithms that drive modern AIOps are based on a mathematics and statistical theory that goes beyond what most computer science undergrads have been exposed to. Hence, the algorithms themselves are a psychological “black box” to many IT ops professionals.

Now, even if these professionals come to trust the technology they are deploying, they will frequently find themselves in a position where they have to defend cost incurring decisions to executives who are, in most cases, even more math phobic than they are. If an IT ops professional cannot, at least at a high level, make plausible the rationale for a decision to, let us say, take down a given server in order to fix a problem, the executive is unlikely to give his or her approval.

The Need for Sufficient Transparency with AI and Machine Learning

Note that it is algorithmic transparency that is critical here — the ability to see the steps starting with the data set you are working from through the various operations applied to the data in sequence until you get the final result.

Transparency is, of course, relative. Each operation applied in sequence to a given data set can be sub-analyzed into a more fine grained sequence of sub operations. Furthermore, at a certain point, this analysis will fork. One branch will yield fine grained structures that ultimately take us into the realm of combinators or lambda calculus terms. Another branch will yield machine code and ultimately electrical signals routed through silicon.

So when I originally warned the market about the need for algorithmic transparency, I should have been a bit more precise. Total algorithmic transparency is probably an impossible goal and, even if it were achievable, it would not be very helpful.

“OK, executive decision maker — this algorithm uses the Newton-Coates method to approximate a Gaussian quadrature. But you don’t care about that. What it’s actually doing is performing a reduction on a fixed-point lambda term and making use of the side effects! Now can I turn off that server?”

So my general warning about algorithmic transparency was problematic in the form that I first gave it. What I should have said was that algorithmic transparency is required up to the point where an executive decision maker can grasp the rationale behind the pattern, anomaly, or causal analysis.

There is No Such Thing as an ‘Open Box’ Solution

Unfortunately, for better or worse, many other analysts, and vendors as well, have recently decided to echo my 2015 vintage warning.

I am flattered, of course, but would collegially remind them to take into account the modification I have just offered. Now, it is also important to stress one thing that algorithmic transparency is NOT. It is NOT the ability to look at the results delivered by a pattern discovery algorithm, decide that one does not like the results, and arbitrarily introduce changes into the pattern to make it better accord with one’s intuitions.

This, by the way, is how other companies deal with algorithmic transparency calling it the ‘Open Box’ approach to machine learning. Let’s unpack the absurdity of this ‘capability.’

First of all, it renders whatever mathematical integrity the original algorithm had completely inoperative. If you change the results, you have undermined the rationale of the algorithm. Remember one deploys such algorithms in the first place precisely because human intuitions in the face of large, complex, evolving data sets are useless. If a practitioner feels free to randomly alter the results based on intuitions (and, whatever they may ‘feel,’ they are just randomly altering the results), there was no reason to deploy the algorithm in the first place. Why not just look at the data and draw a curve that comports with your feelings? Using the algorithm to kick off the process, so to speak, is really not adding anything to the rationale of your result at all.

Secondly, and in some sense more importantly, adding such a capability has not solved the problem of algorithmic transparency at all. Let’s return to the situation where one needs an OK for a significant intervention. The executive decision maker, still math phobic, asks the practitioner to justify the action and the practitioner replies: “Well, some black box algorithm using math that neither you nor I understand yielded the result. I didn’t like the outcome so I tweaked it because … well, the curve just did not look right.”

The bottom line is that no explanation — not even at a high level — is being provided. There is no ‘open box.’ There is only the same old black box algorithm with some completely unjustified tinkering after the fact. Pro tip: The executive decision maker will not OK a significant intervention on that basis.

Four Steps for Achieving Algorithmic Transparency

Vendors and practitioners alike need to work towards the goal of sufficient algorithmic transparency by making explicit the steps by which a given platform or complex, multi-part algorithm transforms data into result. The Moogsoft AIOps platform does indeed show at least one way in which this can be done.

The Moogsoft AIOps platform takes data through a sequence of four distinct operations. As a preliminary stage, the platform ingests a raw data stream. Now, we have discovered that this data stream, although voluminous, is, in fact, highly redundant. So the first operation sets about pruning this highly redundant data stream to a much less voluminous, but now, information rich data stream using Information Theoretic principles. Furthermore, we allow practitioners to decide how vigorously the original data stream is to be pruned and, thanks to the math, that vigor can be ratcheted up or down while still preserving the validity of the result.

The next operation works on pruned data stream. The platform now has a collection of information-rich data items captured over a given interval of time. But clearly there are relationships among these data items, patterns and anomalies, that can be surfaced by algorithmic means. The operation applied at this stage, then, seeks to surface such correlations.

There are, of course, many types of correlation worth considering. The Moogsoft AIOps platform, in addition to supporting the ability to create and apply custom-created patterns, seeks for correlations based on time (how close in time and in what temporal patterns did data items appear in the stream), correlations based on topology (derived ingested topology information mapped against information contained within the data items), and, because data items almost always take the form of alphanumeric text strings, correlations based upon text (using a number of standard metrics measure the distance of strings from one another). Once again, this operation is explicit and practitioners can decide which of the time, topology, or text correlations are the most important for them (or create their own patterns, if they so choose.)

As a result of the second operation, the platform is now working with collections or sets of correlated data items — the Moogsoft term for these sets is Situation. That is great and, often, until recently that was as far as the algorithmic pattern discovery of our platform went. Recently, however, Moogsoft has introduced another vital operation.

This third operation works directly on the situations — the sets of correlated data items — and seeks to surface the causal texture implicit in those sets. Remember, as I discussed in some detail in a recent blog, correlation is not causality. Knowing that two or more data items were generated at the same time, originated from the same place, or “look” like one another does not allow a practitioner to determine which of those data items indicate underlying system state actually responsible for all of the correlated data items (and the system states they represent).

This is important to know, precisely because, if one wants not only to observe what is happening but also to make changes to fix or prevent problems, one needs to know what the causes are and what are their effects. In other words, causality is critical to making an algorithmic analysis actionable.

The causal texture discovering operation has two basic modes. On the one hand, the platform uses neural network based learning to discover over time which data items are likely to represent causal events and which represent events better treated as effects. On the other hand, the platform uses a unique and patent-pending technique called vertex entropy to determine which nodes of a topology are most likely to serve as points of origin for a complex causal sequence.

These two techniques can be used, at the practitioner’s discretion, in isolation from one another or in concert, and experience, to date, shows that the two techniques tend to triangulate on a single causal source adding further credibility to the results obtained by the platform.

Finally, as a fourth operation, the causally analyzed situations are turned over to our collaborative intelligence environment where they can be worked upon by practitioners and the problems the situations indicate resolved. Within the Situation Room itself a further layer of AI is applied which will be discussed in future blogs but since this AI has more to do with workflow optimization and the preservation and refinement of institutional memory, it takes us beyond our immediate concern with sufficient algorithmic transparency.

So, in the end, a good case can made that the Moogsoft AIOps platform, in fact, embodies the algorithmic transparency which years ago I argued was critical for market acceptance and enthusiasm.

Of course, it will only get better over time. Even now, however, it represents a substantially superior way of addressing the concern than a BigPanda-like alternative — using a black box to generate patterns followed up by encouraging the practitioner to tinker with the result. As argued above, this methodology renders the algorithm originally deployed pointless and makes the whole exercise more obscure than ever.

Remember: unprincipled tinkering will never be allowed to justify significant interventions.

Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.

Will Cappelli

About the Author

Will studied math and philosophy at university, has been involved in the IT industry for over 30 years, and for most of his professional life has focused on both AI and IT operations management technology and practises. As an analyst at Gartner he is widely credited for having been the first to define the AIOps market and has recently joined Moogsoft as CTO, EMEA and VP of Product Strategy. In his spare time, he dabbles in ancient languages.

See more posts from this author >