« Data Artisan Smith or ScientistOSS DSS Formalization »

Technology for the OSS DSS Study Guide

'Tis been longer than intended, but we finally have the technology, time and resources to continue with our Open Source Solutions Decision Support System Study Guide (OSS DSS SG).

First, I want to thank SQLstream for allowing us to use SQLstream as a part of our solution. As mentioned in our "First DSS Study Guide" post, we were hoping to add a real-time component to our DSS. SQLstream is not open source, and not readily available for download. It is however, a co-founder and core contributer to the open source Eigenbase Project, and has incorporated Eigenbase technology into its product. So, what is SQLstream? To quote their web site, "SQLstream enables executives to make strategic decisions based on current data, in flight, from multiple, diverse sources". And that is why we are so interested in having SQLstream as a part of our DSS technology stack: to have the capability to capture and manipulate data as it is being generated.

Today, there are two very important classes of technologies that should belong to any DSS: data warehousing (DW) and business intelligence (BI). What actually comprises these technologies is still a matter of debate. To me, they are quite interrelated and provide the following capabilities.

  • The means of getting data from one or more sources to one or more target storage & analysis systems. Regardless of the details for the source(s) and the target(s), the traditional means in data warehousing is Extract from the source(s), Transform for consistency & correctness, and Load into the target(s), that is, ETL. Other means, such as using data services within a services oriented architecture (SOA) either using provider-consumer contracts & Web Service Definition Language (WSDL) or representational state transfer (ReST) are also possible.
  • Active storage over the long term of historic and near-current data. Active storage as opposed to static storage, such as a tape archive. This storage should be optimized for reporting and analysis through both its logical and physical data models, and through the database architecture and technologies implemented. Today we're seeing an amazing surge of data storage and management innovation, with column-store relational database management systems (RDBMS), map-reduce (M-R), key-value stores (KVS) and more, especially hybrids of one or several of old and new technologies. The innovation is coming so thick and fast, that the terminology is even more confused than in the rest of the BI world. NoSQL has become a popular term for all non-RDBMS, and even some RDBMS like column-store. But even here, what once meant No Structured Query Language now is often defined as Not only Structured Query Language, as if SQL was the only way to create an RDBMS (can someone say Progress and its proprietary 4GL).
  • Tools for reporting including gathering the data, performing calculations, graphing, or perhaps more accurately, charting, formating and disseminating.
  • Online Analytical Processing (OLAP) also known as "slice and dice", generally allowing forms of multi-dimensional or pivot analysis. Simply put, there are three underlying concepts for OLAP: the cube (a.k.a. hypercube, multi-dimensional database [MDDB] or OLAP engine), the measures (facts) & dimensions, and aggregation. OLAP provides much more flexibility than reporting, though the two often work hand-in-hand, especially for ad-hoc reporting and analysis.
  • Data Mining, including machine learning and the ability to discover correlations among disparate data sets.

For our purposes, an important question is whether or not there are open source, or at least open source based, solutions for all of these capabilities. The answer is yes. As a matter of fact, there are three complete open source BI Suites [there were four, but the first, written in PERL, the Bee Project from the Czech Republic, is no longer being updated]. Here's a brief overview of SpagoBI, JasperSoft, and Pentaho.

Capability SpagoBI JasperSoft Pentaho
ETL Talend Talend
JasperETL
KETTLE
PDI
Included
DBMS
HSQLDB
MySQL
Reporting BIRT
JasperReport
JasperReports
iReports
jFreeReports
Analyzer jPivot
PaloPivot
JasperServer
JasperAnalysis
jPivot
PAT
OLAP Mondrian Mondrian Mondrian
Data Mining Weka None Weka

We'll be using Pentaho, but you can use any of the these, or any combination of the OSS projects that are used by these BI Suites, or pick and choose from the more than 60 projects in our OSS Linkblog, as shown in the sidebar to this blog. All of the OSS BI Suites have many more features than shown in the simple table above. For example, SpagoBI has good tools for geographic & location services. Also, JasperSoft Professional and Enterprise Editions have many features than their Community Edition, such as Ad Hoc Reporting and Dashboards. Pentaho has a different Analyzer in their Enterprise Edition than either jPivot or PAT, Pentaho Analyzer, based upon the SaaS ClearView from the now-defunct LucidEra, as well as ease-of-use tools such as an OLAP schæma designer, and enterprise class security and administration tools.

Data warehousing using general purpose RDBMS systems such as Oracle, EnterpriseDB, PostrgeSQL or MySQL, are gradually giving way to analytic database management system (ADBMS), or, as we mentioned above, the catch-all NoSQL data storage systems, or even hybrid systems. For example, Oracle recently introduced hybrid column-row store features, and Aster Data has a column-store Massive Parallel Processing (MPP) DBMS|map-reduce hybrid [updated 20100616 per comment from Seth Grimes]. Pentaho supports Hadoop, as well as traditional general purpose RDBMS and column-store ADMBS. In the open source world, there are two columnar storage engines for MySQL, Infobright and Calpont InfiniDB, as well as one column-store ADBMS purpose built for BI, LucidDB. We'll be using LucidDB, and just for fun, may throw some data into Hadoop.

In addition, a modern DSS needs two more primary capabilities. Predictives, sometimes called predictive intelligence or predictive analytics (PA), which is the ability to go beyond inference and trend analysis, assigning a probability, with associated confidence, or likelihood of an event occurring in the future, and full Statistical Analysis, which includes determining the probability density or distribution function that best describes the data. Of course, there are OSS projects for these as well, such as The R Project, the Apache Common Math libraries, and other GNU projects that can be found in our Linkblog.

For statistical analysis and predictives, we'll be using the open source R statistical language and the open standard predictive model markup language (PMML), both of which are also supported by Pentaho.

We have all of these OSS projects installed on a Red Hat Enterprise Linux machine. The trick will be to get them all working together. The magic will be in modeling and analyzing the data to support good decisions. There are several areas of decision making that we're considering as examples. One is fairly prosaic, one is very interesting and far-reaching, and the others are somewhat in between.

  1. A fairly simple example would be to take our blog statistics, a real-time stream using SQLstream's Twitter API, and run experiments to determine whether or not, and possibly how, Twitter affects traffic to and interaction with our blogs. Possibly, we could get to the point where we can predict how our use of Twitter will affect our blog.
  2. A much more far-reaching idea was presented by Ken Winnick to me, via Twitter, and has created an on-going Twitter conversation and hashtag, #BPgulfDB. Let's take crowd sourced, government, and other publicly available data about the recent oilspill in the Gulf of Mexico, and analyze it.
  3. Another idea is to take historical home utility usage plus current smart meter usage data, and create a real-time dashboard, and even predictives, for reducing and managing energy usage.
  4. We also have the opportunity of using public data to enhance reporting and analytics for small, rural and research hospitals.

Creative Commons License: Attribution, Non-Commercial, Share-AlikeExcept where otherwise noted, this content is
licensed under a Creative Commons License.

Trackback address for this post

Trackback URL (right click and copy shortcut/link location)

9 comments

Comment from: Seth Grimes [Visitor] · http://twitter.com/sethgrimes
Joseph, one small technical point: Aster Data's nCluster stores data by rows. The DBMS is not a column store.

Seth
06/16/10 @ 14:29
Comment from: Joseph A. di Paolantonio [Member] Email · http://press.teleinteractive.net/index.php/tiapress?author=4
Seth,

Thank you for correcting my misunderstanding. I had thought I heard the words "column-store" at the Aster Big Data Summit, but perhaps, it was my own internal filter, as when I read the following from their FAQ, and somehow saw column-store:
It provides a very strong data management layer - ANSI SQL interface, ACID transactions, Information Lifecycle Management, indexes, cost-based query optimizer, compression, security and other database features. It provides a very strong application processing framework - multiple language support, workload management, security, statistics collection, error logging and other application server features. It is architected to co-locate both data management and application processing as first-class citizens on the same infrastructure.
-- What is Aster Data nCluster?

I'll correct this in the main post.
06/16/10 @ 15:13
Hi Joseph, gr8 post! Thanks also to have mentioned SpagoBI and its capabilities in terms of location intelligence. Just one thing about the teble: SpagoBI support also PaloPivot (aka JPalo web client) as OLAP client.

Andrea Gioia
06/16/10 @ 17:16
Comment from: Joseph A. di Paolantonio [Member] Email · http://press.teleinteractive.net/index.php/tiapress?author=4
Andrea,

SpagoBI supports many more things than I can fit in the table. ;) Is there a better link than the one that I gave to list them all?
06/16/10 @ 18:03
Comment from: Seth Grimes [Visitor] · http://twitter.com/sethgrimes
As you know, you have to be careful with vendor claims. For instance, SAS described, at the Aster Big Data Summit in Washington DC in May, the ability to run a version of the SAS Data Step on Aster nodes. SAS described this as if it is shipping. In fact, it will not ship this year.
06/16/10 @ 20:07
Comment from: Vladislav Malicevic [Visitor] Email · http://www.jedox.com
Hi Joseph!
Did you have a look at our Palo Suite? Plays well with all of the above and is OS too - http://www.jedox.com/en/products/Palo-Suite.html

Regards,
Vlado
06/17/10 @ 01:43
Joseph,
for a complete and up to date list of all the analitical engines supported by SpagoBI you can look here:

http://www.spagoworld.org/xwiki/bin/view/SpagoBI/TheSuite

Andrea
06/17/10 @ 02:48
Comment from: Joseph A. di Paolantonio [Member] Email · http://press.teleinteractive.net/index.php/tiapress?author=4
Vladislav,

We've been aware of the OSS Palo for Excel for some time, but haven't kept up with Jedox's other OSS for BI. We'll add these products to our OSSLinks, shown in the sidebar to this blog.

Thank you for taking the time to update us.
06/17/10 @ 16:11
Comment from: Joseph A. di Paolantonio [Member] Email · http://press.teleinteractive.net/index.php/tiapress?author=4
Andrea,

Thank you for the more comprehensive link.
06/17/10 @ 16:11

This post has 226 feedbacks awaiting moderation...

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
May 2012
Mon Tue Wed Thu Fri Sat Sun
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

The Open Source Solutions Blog is a companion to the Open Source Solutions for Business Intelligence Research Project, sponosred by InterActive Systems & Consulting, Inc. This Blog, a Wiki and Lens will be used to develop, support and publish the findings of our research into enterprise open source projects.

InterActive Systems & Consulting, Inc. (IASC) performs research in the areas of data analytics, collaboration and remote access.

InterASC Professional Services, a service mark of IASC, provides strategic consulting and project management for data warehousing, business intelligence and collaboration projects using proprietary and open source solutions. We formulate vendor-independent strategies and implement solutions for information management in an increasingly complex and distributed business environment, allowing secure data analysis and collaboration that provides enterprise information in the most valuable form to the right person, whenever and wherever needed.

TeleInterActive Networks, a service mark of IASC, hosts open source applications for small and medium enterprises including CMS, blogs, wikis, database applications, portals and mobile access. We provide the tools for SME to put their customer at the center of their business, and leverage information management in a way previously reserved for larger organizations.

37.540686772871 -122.516149406889

Search

Blogroll

XML Feeds

powered by b2evolution