## Category: "Predictions"

## BayAreaUseR October Special Event

By JAdP on October 8th, 2010

In Open Source, software, Data Mining, Agile, Predictions, Data Science, Big Data

Zhou Yu organized a great special event for the San Francisco Bay Area Use R group, and has asked me to post the slide decks for download. Here they are:

- RedR presented by Anup Parikh with contributions from Kyle Covington
- Cross-Channel by R presented by Zhou Yu
- Use of R DataFrames: An Illustrations with Chinese Dialects Data presented by Norm Matloff
- Update 20101021: Distribution of Proton Dissociation Constants for Model Humic and Fulvic Acid Molecules Before and After R presented by Yasemin B. Atalay

No longer missing is the very interesting presentation by Yasemin Atalay showing the difference in plotting analysis using the Windermere Humic Aqueous Model for river water environmental factors, without using R and then the increased in variety and accuracy of analysis and plotting gained by using R.

## R the next Big Thing or Not

By JAdP on April 20th, 2010

In Computers and Internet, Predictions, OSS DSS Study Guide

Recently, AnnMaria De Mars, PhD (multiple) and Dr. Peter Flom, PhD have stirred up a bit of a tempest in a tweet-pot, as well as in the statistical blogosphere, with comparisons of R and SAS, IBM/SPSS and the like. I've commented on both of their blogs, but decided to expand a bit here, as the choice of R is something that we planned to cover in a later post to our Open Source Solutions Decision Support Systems Study Guide. First, let me say that Dr. De Mars and Dr. Flom appear to have posted completely independently of each other, and further, that their posts have different goals.

In The Next Big Thing, Dr. De Mars is looking for the **next big thing**, both to keep her own career on-track, and to guide students into areas of study that will be survive in the job market in the coming decades. This is always difficult for mentors, as we can't always anticipate the "black swan" events that might change things drastically. The tempestuous nature of her post came from one little sentence:

Contrary to what some people seem to think, R is definitely not the next big thing, either. -- AnnMaria De Mars, The Next Big Thing, AnnMaria's Blog

In SAS vs. R, Introduction and Request, Dr. Flom starts a series comparing R and SAS from the standpoint of a statistician deciding upon tools to use.

There are several threads in Dr. De Mars post. I agree with Dr. De Mars that two of the "next big things" in data management & analysis are data visualization and dealing with unstructured data. I'm of the opinion that there is a third area, related to the "Internet of Things" and the tsunami of data that will be generated by it. These are conceptual areas, however. Dr. De Mars quickly moves on to discussing the tools that might be a part of the solutions of these next big things. The concepts cited are neither software packages nor computing languages. The software packages SAS, IBM/SPSS, Stata, Pentaho and the like, and the computing language S, with its open source distribution R, and its proprietary distribution S+ are none likely to be the next big things, as they are currently useful tools to know.

I find it interesting that both Dr. De Mars and Dr. Flom, as well as the various commenters, tweeters, and other posters, are comparing software suites and applications with a computing language. I think that a bit more historical perspective might be needed in bringing these threads together.

In 1979, when I first sat down with a FORTRAN programmer to turn my Bayesian methodologies into practical applications to determine the reliability and risk associated with the STAR48 kick motor and associated Payload Assist Module (PAM), the statistical libraries for FORTRAN seemed amazing. The ease with which we were able to create the program and churn through decades of NASA data (after buying a 1MB memory box for the mainframe) was wondrous

Today, there's not so much wonder from such a feat. The evolution of computing has drastically affected the way in which we apply mathematics and statistics today. Several of the comments to these posts argue both sides of the statement that anyone doing statistics today should be a programmer, or shouldn't. It's an interesting argument, that I've also seen reflected in chemistry, as fewer technicians are used in the lab, and the Ph.D.s work directly with the robots to prepare the samples and interpret the results.

Approximately 15 years ago, I moved from solving scientific and engineering problems directly with statistics, to solving business problems through vendor's software suites. The marketing names for this endeavor have gone through several changes: Decision Support Systems, Very Large Databases, Data Warehousing, Data Marts, Corporate Information Factory, Business Intelligence, and the like. Today, Data Mining, Data Visualization, Sentiment Analysis, "Big Data", SQL Streaming, and similar buzzwords reflect the new "big thing". Software applications, from new as well as established vendors, both open source and proprietary, are coming to the fore to handle these new areas that represent real problems.

So, one question to answer for students, is which, if any, of these software packages will best survive with and aid the growth of, their maturing careers. Will Tableau, LyzaSoft, QlikView or Viney@rd be in a better spot in 20 years, through growth or acquisition, than SAS or IBM/SPSS? Will the open source movement take down the proprietary vendors or be subsumed by them? Is Pentaho/Weka the BI & data mining solution for their career? Maybe, maybe not. But what about that other beast of which everyone speaks? Namely, R, the r-project, the R Statistical Language. What is it? Is it a worthy alternative to SAS or IBM/SPSS or Pentaho/Weka? Or is it a different genus altogether? That's a question I've been seeking to answer for myself, in my own career evolution. After 15 years, software such as SAP/Business Objects and IBM/Cognos, haven't evolved into anything that I like, with their pinnacle of statistical computation being the "average", the arithmetic mean. SAS and IBM/SPSS are certainly better, and with data mining, machine learning and predictives becoming important to business, certainly likely to be a good choice for the future. But are they really powerful enough? Are they flexible enough? Can they be used to solve the next generation of data problems? They're very likely to evolve into software that can do so. But how quickly? And like all vendor software, they have limitations based upon the market studies and business decisions of the corporation.

How is R different?

Well, first, R is a computing language. Unlike SAP/Business Objects, IBM/Cognos, IBM/SPSS, SAS, Pentaho, JasperSoft, SpagoBI, or Oracle, it's not a company, nor a BI Suite, nor even a collection of software applications. Second, R is an open source project. It's an open source implementation of S. Like C, and the other single letter named languages, S came out of Bell Labs, and in the case of R, in 1976. The open source implementation, R comes from R. Ihaka and R. Gentleman, first revealed in 1996 through the article, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, and is often associated with the Department of Statistics, University of Auckland.

While I'm not a software engineer, R is a very compelling statistical tool. As a language, it's very intuitive… for a statistician. It's an interactive, interpretive, functional, object oriented, statistical programming language. R itself is written in R, C, C++ and FORTRAN; it's powerful. As an open source project, it has attracted thousands upon thousands of users who have formed a strong community. There are thousands upon thousands of community contributed packages for R. It's flexible, and growing. One of the main goals of R was data visualization, and it has a wonderful new package for data visualization in ggplot2. It's ahead of the curve. There are packages for parallel processing (some quite specific), for big data beyond in-memory capacity, for servers, and for embedding in a web site. Get the idea? If you think you need something in R, search CRAN, RForge, BioConductor or Omegahat.

As you can tell, I like R. However, in all honesty, I don't think that the SAS vs. R controversy is an either/or situation. SAS, IBM/SPSS and Pentaho complement R and vice-versa. Pentaho, IBM/SPSS and some SAS products support R. R can read data from SAS, IBM/SPSS , relational databases, Excel, mapReduce and more. The real question isn't is one tool better than another but rather selecting the best tool to answer a particular question. That being said, I'm looking forward to Dr. Flom's comparison, as well as the continuing discussion on Dr. De Mars' blog.

For us, the question is building a decision support system or stack from open source components. It looks like we'll have a good time doing so.

## Questions and Commonality

By JAdP on March 19th, 2010

In books, Open Source, Business Intelligence, Predictions, OSS DSS Study Guide

In the introduction to our open source solutions (OSS) for decision support systems (DSS) study guide (SG), I gave a variety of examples of activities that might be considered using a DSS. I asked some questions as to what common elements exist among these activities that might help us to define a modern platform for DSS, and whether or not we could build such a system using open source solutions.

In this post, let's examine the first of those questions, and see if we can start answering those questions. In the next post, we will lay out a syllabus of sorts for this OSS DSS SG.

The first common element is that in all cases, we have an individual doing the activity, not a machine nor a committee.

Secondly, the individual has some resources at their disposal. Those resources include current and historical information, structured and unstructured data, communiqués and opinions, and some amount of personal experience, augmented by the experience of others.

Thirdly, though not explicit, there's the idea of digesting these resources and performing formal or informal analyses.

Fourthly, though again, not explicit, the concept of trying to predict what might happen next, or as a result of the decision is inherent to all of the examples.

Finally, there's collaboration involved. Few of us can make good decisions in a vacuum.

Of course, since the examples are fictional, and created by us, they represent our biases. If you had fingered our domain server back in 1993, or read our .project and .plan files from that time, you would have seen that we were interested in sharing information and analyses, while providing a framework for making decisions using such tools as email, gopher and electronic bulletin boards. So, if you identify any other commonalities, or think anything is missing, please join the discussion in the comments.

From these commonalities, can we begin to answer the first question we had asked: "What does this term [DSS] really mean?". Let's try.

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.

That's nice and vague; generic enough to almost meaningless, but provides some key points that will help us to bound the specifics as we go along. For example, if a process or technology doesn't help us to make a better decision, than it doesn't fit. If something allows us to make a better decision, but we can't define the process or identify the technology involved, it doesn't belong (e.g. "my gut tells me so").

Let's create a list from all of the above.

- Individual Decision Maker
- Process
- Technology
- Structured Data
- Unstructured Data
- Historical Information
- Current Information
- Communication
- Opinion
- Collaboration
- Analysis
- Prediction
- Personal Experience
- Other's Experience

What do you think? Does a modern system to support decisions need to cover all of these elements and no others? Is this list complete and sufficient? The comments are open.

## First DSS Study Guide

By JAdP on March 15th, 2010

In Business Intelligence, Predictions, OSS DSS Study Guide

Someone sitting in their study, looking at their books, journals, piles of scholarly periodicals and files of correspondence with learned colleagues probably didn't think that they were looking at their decision support system, but they were.

Someone sitting on the plains, looking at the conditions around them, smoke signals from distant tribe members, records knotted into a string, probably didn't think that they were looking at their decision support system, but they were.

Someone at the nexus of a modern military command, control, communications, computing and intelligence system, probably didn't think that they were looking at their decision support system, but they were.

Someone pulling data from transactional systems, and dumping the results of reports & analyses from BI tool into a spreadsheet to feed a dashboard for the executives of a huge corporation probably didn't think that they were looking at their decision support system, but they were.

The term "decision support system" has been in use for over 50 years, perhaps longer.

- But what does this term really mean?
- What do all of my examples have in common?
- How can we build a reasonable decision support system from open source solutions?
- What resources exist to help us learn?

I'm starting a series of posts, essentially a "study guide" to help answer these questions.

I'll be drawing from and pointing to the following books and online resources as we install, configure and use open source systems to create a technical platform for a decision support system.

- Bayesian Computation in R by Jim Albert, Springer Series in UseR!, ISBN: 0-38-792297-0, Purchase from Amazon, you can also purchase the Kindle ebook from Amazon
- R in a Nutshell by Joseph Adler, ISBN: 0-59-68017-0X, Purchase from Amazon
- Pentaho Solutions; Business Intelligence and Data Warehousing with Pentaho and MySQL, by Roland Bouman and Jos van Dongen, ISBN: 0-47-048432-2, Purchase from Amazon
- Pentaho Reporting 3.5 for Java Developers by Will Gorman, ISBN: 1-84-719319-6, Purchase from Amazon
- Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration by Matt Casters, Roland Bouman & Jos van Dongen, ISBN: 0-47-063517-7 due 2010 September, Pre-Order from Amazon
- Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten and Eibe Frank, Second Edition, Morgan-Kaufmann Series in Data Management Systems, ISBN: 0-12-088407-0 a.k.a. "The Weka Book", Purchase from Amazon, Pre-Order the Third Edition, you can also purchase the Kindle ebook from Amazon
- LucidDB online documentation
- Pertinent information from Eigenbase
- LudidDB mailing list archive on Nabble
- Anything I can find on PAT
- Pentaho Community Forums, Wiki, WebEx Events, and other community sources
- R Mailing Lists and Forums
- Various Books in PDF from The R Project
- Information Management and Open Source Solution Blogs from our side-column linkblogs

In this study guide series of posts:

- I'll show how the datawarehousing (DW) and business intelligence (BI) can be extended to include all the elements held in common from my DSS examples.
- We'll examine the open source solutions Pentaho, R, Rserve, Rapache, LucidDB and possibly Map-Reduce & Key-value-stores, and the related open source projects, communities and companies in terms of how they can be used to create a DSS.
- I would like to add a collaboration tool to the mix, as we do in our implementation projects, possibly Mindtouch, a ReSTful Wiki Platform.
- We may add one non-open source package, SQLStream, that's built upon open source elements from Eigenbase. This will allow us to add a real-time component to our DSS.
- I'll give my own experience in installing these packages and getting them to work together, with pointers to the resources listed above.
- We'll explore sample and public data sets with the DSS environment we created, again with pointers to and help from the resources listed.

The purpose of this series of posts is a study guide, not an online book written as a blog. The goal is to help us to define a modern DSS and build it out of open source solutions, while using existing resources.

Please feel free to comment, especially if there is anything that you feel should be included beyond what I've outlined here.

## Modeling and Predictives

By JAdP on February 9th, 2010

In Predictions, OSS DSS Study Guide

Here's a personal perspective and a bit of a personal history regarding mathematical modeling and predictives.

The 1980s were an exciting time for mathematical modeling of complex systems. At the time, there were two basic types of modeling: deterministic and stochastic (probability or statistics models). Within stochastic modeling, traditional statistics vs. Bayesian statistics was a burgeoning battleground. Physical simulations (often based upon deterministic models) were giving way to computer simulations (often based upon stochastic models, especially Monte Carlo Simulations). Two theories were popularized during this time: catastrophe theory and chaos theory; ultimately though, both of these theories proved incapable of prediction - the hallmark of a good mathematical model. A different type of modeling technique, based upon relational algebra, was also moving from the theoretical work of Ted Codd, to the practical implementations at (the company now known as) Oracle: data modeling.

Mathematical models are attempts to understand the complex by making simplifying assumptions. They are always a balance between complexity and accuracy. One nice example of the evolution of a deterministic mathematical model can be found in the Ideal Gas Laws, starting with Boyle's Law to Charles' Law to Gay-Lussac's Law to Avogadro's Law, culminating in the Ideal Gas Law, which all of saw in high school chemistry: PV=nRT.

Mathematical models are used in pretty much all fields of endeavor: physical sciences, all types of engineering, behavioral studies, and business. In the 1970's, I used deterministic electrochemical models to understand and predict the behaviour of various chemical stoichiometry for fuel cells and photovoltaic cells. In the 1980's, I used Bayesian statistics, sometimes combined with Monte Carlo Simulations to predict the reliability and risk associated with complex aerospace, utility and other systems.

The most popular use of Bayesian statistics was to expand the *a priori* knowledge of a complex system with subjective opinions. Likely the most famous application of Bayesian Statistics, at the time I became involved with the branch, was the Rand Corporation's Delphi Method. There was actually a joke in the Aerospace Industry about the Delphi Method:

A team of Rand consultants went to Werner von Braun to seek the expert opinion of the engineers working on a new rocket motor. The consultants explained their Delphi Method thusly. Prior to the first static test of the new rocket motor, they would ask, separately, each of the five engineers working on the new design their opinion of the rocket's reliability. Their opinions would form the Bayesian

a prioridistribution. After the test, they would reveal the results of the first survey and the test results, and ask the five engineers, collectively, their new opinion of the rocket's reliability. This would form the Bayesiana posteriori, from which the rocket's reliability would be predicted. Doctor von Braun said that he could save them some time. He gathered his team of rocket engineers, and asked them if they thought that the new rocket motor would fail. Each answered, as did Doctor Von Braun, "no" in German. "There, you see, five nines reliability, as specified." declared the good Doctor to the Rand consultants, "No need for any further study on your part."

Yep, it's a side splitter.

I didn't like this method, and did things a bit differently. My method involved gathering all the data for similar test and production models, weighting each relevant engineering variable, creating the *a priori*, fitting with Weibull Analysis, designing the Bayesian mathematical conjugate, using a detailed post-mortem of the first and subsequent tests of the system being analyzed, updating and learning as we went, to finally predict the reliability and risk for the system. I first used this on the Star48 perigee kick motor, and went on to refine and use this method for:

- a variety of apogee and perigee kick motors
- several components of the Space Transportation System
- the Extreme Ultraviolet Explorer
- Gravity Probe-B
- a halogen lamp pizza oven
- a methodology for failure mode, effects and criticality analysis of the electrical grid
- and many more systems

This was, essentially, a combination of empirical Bayes and Bayesian weighted variables. Several of my projects resulted in software programs, all in FORTRAN. The first was used as a justification for a 1 MB [no, not a mistake] "box" [memory] for the corporate mainframe. NASA had sent us detailed data on over 4,000 solid propellant rocket motors. Talk about "big data". I had a lot of fun doing this into the 1990's.

The next paradigm shift, for me personally, was learning data modeling, and focusing on business processes rather than engineering systems. Spending time at Oracle, including Richard Barker and his computer aided system engineering methods, I felt right at home. Rather than Bayesian Statistics, I would be using relational algebra and calculus for deterministic mathematical models of the data for the business processes being stored in a relational database management system. I very quickly got involved in very large databases, decision support systems, data warehousing and business intelligence.

I was surprised, and, after 17 years, continue to be surprised, how few data modelers agree with the statement in the preceding paragraph. I'm surprised how few data modelers go beyond entity-relationship diagrams; how few know or care about relational algebra and relational calculus. I'm amazed how few people realize that the arithmetic average computed in most "analytic" systems is a fairly useless measure of the underlying data, for most systems. I'm amazed that BI and analytic systems are still deterministic, and always go with simplicity over accuracy.

But computer power continues to expand. Moore's Law still rules. We can do better now. Things that used to take powerful main frames or even supercomputers can be done on laptops now. We no longer need to settle for simplicity over accuracy.

More importantly, the R Statistical Language has matured. Literally thousands and thousands of mathematical, graphical and statistical packages have been added to the CRAN, Omegahat and BioConductor repositories. Even the New York Times has published pieces about R.

It's once again time to move from deterministic to stochastic models.

Over the next few weeks, I hope to post a series of "study guides" that will focus on setting up a web-based environment consolidating SQL and MDX based analytics, as expressed in Pentaho and LucidDB open source projects, with R, and possibly SQLStream. Updated 20100314 to correct links (typos). Thanks to Doug Moran of Pentaho for catching this.

There have been many articles as well on "Big Data". As I commented on Merv Adrian's blog post request for "Ideas for SF Big Data Summit":

One area of discussion, which may appear to be for the “newbies” but is actually a matter of some debate, would be the definition of “big data”.

It really isn’t about the amount of data (TB & PB & more) so much as it is about the volumetric flow and timeliness of the data streams.

It’s about how data management systems handle the various sources of data as well as the interweaving of those sources.

It means treating data management systems in the same way that we treat the Space Transportation System, as a very large, complex system.

-- Comment by Joseph A. di Paolantonio, February 1, 2010 at 4:09 pm

I believe this because there is a huge amount of data about to come down the pipe. I'm not talking about the Semantic Web or the pidly little petabytes of web log and click-through data. I'm talking about the instrumented world. Something that's been in the making for ten years, and more: RFID, SmartDust, ZigBee, and more wired and wireless sensors, monitors and devices that will become a part of everything, everywhere.

Let me just cite two examples from something that is coming, is hyped, but not yet standardized, even if solid attempts at definition are being made: the SmartGrid. First, consider the fact that utility companies are distributing and using smart meters to replace manually read mechanical meters at homes and businesses; this will result in thousands of data points per day as opposed to one per month **PER METER**. The second is EPRI's copper-riding robot, as explained in a recent Popular Science. Think of the petabytes of data that these two examples will generate monthly. [Order the Smart Grid Dictionary: First Edition on Amazon]

The desire, the need, to analyze and make inferences from this data will be great. The need to actually predict from this data will be even greater, and will be a necessary element of the coming SmartGrid, and in making the instrumented world a better world for all of humanity.