## Modeling and Predictives

By JAdP on February 9th, 2010

In Predictions, OSS DSS Study Guide

Here's a personal perspective and a bit of a personal history regarding mathematical modeling and predictives.

The 1980s were an exciting time for mathematical modeling of complex systems. At the time, there were two basic types of modeling: deterministic and stochastic (probability or statistics models). Within stochastic modeling, traditional statistics vs. Bayesian statistics was a burgeoning battleground. Physical simulations (often based upon deterministic models) were giving way to computer simulations (often based upon stochastic models, especially Monte Carlo Simulations). Two theories were popularized during this time: catastrophe theory and chaos theory; ultimately though, both of these theories proved incapable of prediction - the hallmark of a good mathematical model. A different type of modeling technique, based upon relational algebra, was also moving from the theoretical work of Ted Codd, to the practical implementations at (the company now known as) Oracle: data modeling.

Mathematical models are attempts to understand the complex by making simplifying assumptions. They are always a balance between complexity and accuracy. One nice example of the evolution of a deterministic mathematical model can be found in the Ideal Gas Laws, starting with Boyle's Law to Charles' Law to Gay-Lussac's Law to Avogadro's Law, culminating in the Ideal Gas Law, which all of saw in high school chemistry: PV=nRT.

Mathematical models are used in pretty much all fields of endeavor: physical sciences, all types of engineering, behavioral studies, and business. In the 1970's, I used deterministic electrochemical models to understand and predict the behaviour of various chemical stoichiometry for fuel cells and photovoltaic cells. In the 1980's, I used Bayesian statistics, sometimes combined with Monte Carlo Simulations to predict the reliability and risk associated with complex aerospace, utility and other systems.

The most popular use of Bayesian statistics was to expand the *a priori* knowledge of a complex system with subjective opinions. Likely the most famous application of Bayesian Statistics, at the time I became involved with the branch, was the Rand Corporation's Delphi Method. There was actually a joke in the Aerospace Industry about the Delphi Method:

A team of Rand consultants went to Werner von Braun to seek the expert opinion of the engineers working on a new rocket motor. The consultants explained their Delphi Method thusly. Prior to the first static test of the new rocket motor, they would ask, separately, each of the five engineers working on the new design their opinion of the rocket's reliability. Their opinions would form the Bayesian

a prioridistribution. After the test, they would reveal the results of the first survey and the test results, and ask the five engineers, collectively, their new opinion of the rocket's reliability. This would form the Bayesiana posteriori, from which the rocket's reliability would be predicted. Doctor von Braun said that he could save them some time. He gathered his team of rocket engineers, and asked them if they thought that the new rocket motor would fail. Each answered, as did Doctor Von Braun, "no" in German. "There, you see, five nines reliability, as specified." declared the good Doctor to the Rand consultants, "No need for any further study on your part."

Yep, it's a side splitter.

I didn't like this method, and did things a bit differently. My method involved gathering all the data for similar test and production models, weighting each relevant engineering variable, creating the *a priori*, fitting with Weibull Analysis, designing the Bayesian mathematical conjugate, using a detailed post-mortem of the first and subsequent tests of the system being analyzed, updating and learning as we went, to finally predict the reliability and risk for the system. I first used this on the Star48 perigee kick motor, and went on to refine and use this method for:

- a variety of apogee and perigee kick motors
- several components of the Space Transportation System
- the Extreme Ultraviolet Explorer
- Gravity Probe-B
- a halogen lamp pizza oven
- a methodology for failure mode, effects and criticality analysis of the electrical grid
- and many more systems

I started to call this method "objective Bayes", but that name was already taken by a branch of Bayesian statistics that uses a non-informative *a priori*. Several of my projects resulted in software programs, all in FORTRAN. The first was used as a justification for a 1 MB [no, not a mistake] "box" [memory] for the corporate mainframe. NASA had sent us detailed data on over 4,000 solid propellant rocket motors. Talk about "big data". I had a lot of fun doing this into the 1990's.

The next paradigm shift, for me personally, was learning data modeling, and focusing on business processes rather than engineering systems. Spending time at Oracle, including Richard Barker and his computer aided system engineering methods, I felt right at home. Rather than Bayesian Statistics, I would be using relational algebra and calculus for deterministic mathematical models of the data for the business processes being stored in a relational database management system. I very quickly got involved in very large databases, decision support systems, data warehousing and business intelligence.

I was surprised, and, after 17 years, continue to be surprised, how few data modelers agree with the statement in the preceding paragraph. I'm surprised how few data modelers go beyond entity-relationship diagrams; how few know or care about relational algebra and relational calculus. I'm amazed how few people realize that the arithmetic average computed in most "analytic" systems is a fairly useless measure of the underlying data, for most systems. I'm amazed that BI and analytic systems are still deterministic, and always go with simplicity over accuracy.

But computer power continues to expand. Moore's Law still rules. We can do better now. Things that used to take powerful main frames or even supercomputers can be done on laptops now. We no longer need to settle for simplicity over accuracy.

More importantly, the R Statistical Language has matured. Literally thousands and thousands of mathematical, graphical and statistical packages have been added to the CRAN, Omegahat and BioConductor repositories. Even the New York Times has published pieces about R.

It's once again time to move from deterministic to stochastic models.

Over the next few weeks, I hope to post a series of "study guides" that will focus on setting up a web-based environment consolidating SQL and MDX based analytics, as expressed in Pentaho and LucidDB open source projects, with R, and possibly SQLStream. Updated 20100314 to correct links (typos). Thanks to Doug Moran of Pentaho for catching this.

There have been many articles as well on "Big Data". As I commented on Merv Adrian's blog post request for "Ideas for SF Big Data Summit":

One area of discussion, which may appear to be for the “newbies” but is actually a matter of some debate, would be the definition of “big data”.

It really isn’t about the amount of data (TB & PB & more) so much as it is about the volumetric flow and timeliness of the data streams.

It’s about how data management systems handle the various sources of data as well as the interweaving of those sources.

It means treating data management systems in the same way that we treat the Space Transportation System, as a very large, complex system.

-- Comment by Joseph A. di Paolantonio, February 1, 2010 at 4:09 pm

I believe this because there is a huge amount of data about to come down the pipe. I'm not talking about the Semantic Web or the pidly little petabytes of web log and click-through data. I'm talking about the instrumented world. Something that's been in the making for ten years, and more: RFID, SmartDust, ZigBee, and more wired and wireless sensors, monitors and devices that will become a part of everything, everywhere.

Let me just cite two examples from something that is coming, is hyped, but not yet standardized, even if solid attempts at definition are being made: the SmartGrid. First, consider the fact that utility companies are distributing and using smart meters to replace manually read mechanical meters at homes and businesses; this will result in thousands of data points per day as opposed to one per month **PER METER**. The second is EPRI's copper-riding robot, as explained in a recent Popular Science. Think of the petabytes of data that these two examples will generate monthly. [Order the Smart Grid Dictionary: First Edition on Amazon]

The desire, the need, to analyze and make inferences from this data will be great. The need to actually predict from this data will be even greater, and will be a necessary element of the coming SmartGrid, and in making the instrumented world a better world for all of humanity.