Category: "Computers and Internet"

Reading Pentaho Kettle Solutions

On a rainy day, there's nothing better than to be sitting by the stove, stirring a big kettle with a finely turned spoon. I might be cooking up a nice meal of Abruzzo Maccheroni alla Chitarra con Polpettine, but actually, I'm reading the ebook edition of Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration on my iPhone.

Some of my notes made while reading Pentaho Kettle Solutinos:

…45% of all ETL is still done by hand-coded programs/scripts… made sense when… tools have 6-figure price tags… Actually, some extractions and many transformations can't be done natively in high-priced tools like Informatica and Ab Initio.

Jobs, transformations, steps and hops are the basic building blocks of KETTLE processes

It's great to see the Agile Manisto quoted at the beginning of the discussion of AgileBI. 

Search Terms for Data Management & Analytics

Recently, for a prospective customer, I created a list of some search terms to provide them with some "late night" reading on data management & analytics. I've tried these terms out on Google, and as suspected, for most, the first hit is for Wikipedia. While most articles in Wikipedia need to be taken with a grain of salt, they will give you a good overview. [By the way, I use the "Talk" page on the articles to see the discussion and arguments about the article's content as an indicator of how big a grain of salt is needed for that article] &#59;) So plug these into your favorite search engine, and happy reading.

  • Reporting - top two hits on Google are Wikipedia, and, interestingly, Pentaho
  • Ad-hoc reporting
  • OLAP - one of the first page hits is for Julian Hyde's blog, creator of the open source tool for OLAP, Mondrian, as well as real-time analytics engine, SQLstream
  • Enterprise dashboard - interestingly, Wikipedia doesn't come up in the top hits for this term on Google, so here's a link for Wikipedia: http://en.wikipedia.org/wiki/Dashboards_(management_information_systems)
  • Analytics - isn't very useful as a search term, but the product page from SAS gives a nice overview
  • Advanced Analytics - is mostly marketing buzz, so be wary of anything that you find using this as search term

Often, Data Mining, Machine Learning and Predictives are used interchangeably. This isn't really correct, as you can see from the following five search terms…

  • Data Mining
  • Machine Learning
  • Predictive Analytics
  • Predictive Intelligence - is an earlier term for Predictives that has mostly been supplanted by Predictive Analytics. I actually prefer just "Predictives".
  • PMML - Predictive Modeling Markup Language - is a way of transporting predictive models from one software package to another. Few packages will both export and import PMML. The lack of that capability can lock you into a solution, making it expensive to change vendors. The first hit for PMML on Google today is the Data Mining Group, which is a great resource. One company listed, Zementis, is a start-up that is becoming a leader in running data mining and predictive models that have been created anywhere
  • R - the R statistical language, is difficult to search on Google. Go to http://www.r-project.org/ and http://www.rseek.org/ … instead. R is useful for writing applications for any type of statistical analysis, and is invaluable for creating new algorithms and predictive models
  • ETL - Extract, Transform & Load, is the most common way of getting information from source systems to analytic systems
  • ReSTful Web Services - Representational State Transfer - can expose data as a web service using the four verbs of the web
  • SOA
  • ADBMS - Analytic Database Management Systems doesn't work well as a search term. Start with the Eigenbase.org site and follow the links from the Eigenbase subproject, LucidDB. Also, check out AsterData
  • Bayes - The Reverend Thomas Bayes came up with this interesting approach to statistical analysis in the 1700s. I first started creating Bayesian statistical methods and algorithms for predicting reliability and risk associated with solid propellant rockets. You'll find good articles using Bayes as a search term in Google. A bit denser article can be found at http://www.scholarpedia.org/article/Bayesian_statistics And some interesting research using Bayes can be found at: Andrew Gelman's Blog. You're likely familiar with one common Bayesian algorithm, naïve Bayes, which is used by most anti-spam email programs. Other forms are objective Bayes with non-informative priors and the original Subjective Bayes. I have an old aerospace joke about the Rand Corporation's Delphi method, based on subjective Bayes :-) I created my own methodology, and don't really care for naïve Bayes nor non-informative priors.
  • Sentiment Analysis - which is one of Seth Grimes' current areas of research
  • Decision Support Systems - in addition to searching on Google, you might find my recent OSS DSS Study Guide of interest

Let me know if I missed your favorite search term for data management & analytics.

Technology for the OSS DSS Study Guide

'Tis been longer than intended, but we finally have the technology, time and resources to continue with our Open Source Solutions Decision Support System Study Guide (OSS DSS SG).

First, I want to thank SQLstream for allowing us to use SQLstream as a part of our solution. As mentioned in our "First DSS Study Guide" post, we were hoping to add a real-time component to our DSS. SQLstream is not open source, and not readily available for download. It is however, a co-founder and core contributer to the open source Eigenbase Project, and has incorporated Eigenbase technology into its product. So, what is SQLstream? To quote their web site, "SQLstream enables executives to make strategic decisions based on current data, in flight, from multiple, diverse sources". And that is why we are so interested in having SQLstream as a part of our DSS technology stack: to have the capability to capture and manipulate data as it is being generated.

Today, there are two very important classes of technologies that should belong to any DSS: data warehousing (DW) and business intelligence (BI). What actually comprises these technologies is still a matter of debate. To me, they are quite interrelated and provide the following capabilities.

  • The means of getting data from one or more sources to one or more target storage & analysis systems. Regardless of the details for the source(s) and the target(s), the traditional means in data warehousing is Extract from the source(s), Transform for consistency & correctness, and Load into the target(s), that is, ETL. Other means, such as using data services within a services oriented architecture (SOA) either using provider-consumer contracts & Web Service Definition Language (WSDL) or representational state transfer (ReST) are also possible.
  • Active storage over the long term of historic and near-current data. Active storage as opposed to static storage, such as a tape archive. This storage should be optimized for reporting and analysis through both its logical and physical data models, and through the database architecture and technologies implemented. Today we're seeing an amazing surge of data storage and management innovation, with column-store relational database management systems (RDBMS), map-reduce (M-R), key-value stores (KVS) and more, especially hybrids of one or several of old and new technologies. The innovation is coming so thick and fast, that the terminology is even more confused than in the rest of the BI world. NoSQL has become a popular term for all non-RDBMS, and even some RDBMS like column-store. But even here, what once meant No Structured Query Language now is often defined as Not only Structured Query Language, as if SQL was the only way to create an RDBMS (can someone say Progress and its proprietary 4GL).
  • Tools for reporting including gathering the data, performing calculations, graphing, or perhaps more accurately, charting, formating and disseminating.
  • Online Analytical Processing (OLAP) also known as "slice and dice", generally allowing forms of multi-dimensional or pivot analysis. Simply put, there are three underlying concepts for OLAP: the cube (a.k.a. hypercube, multi-dimensional database [MDDB] or OLAP engine), the measures (facts) & dimensions, and aggregation. OLAP provides much more flexibility than reporting, though the two often work hand-in-hand, especially for ad-hoc reporting and analysis.
  • Data Mining, including machine learning and the ability to discover correlations among disparate data sets.

For our purposes, an important question is whether or not there are open source, or at least open source based, solutions for all of these capabilities. The answer is yes. As a matter of fact, there are three complete open source BI Suites [there were four, but the first, written in PERL, the Bee Project from the Czech Republic, is no longer being updated]. Here's a brief overview of SpagoBI, JasperSoft, and Pentaho.

Capability SpagoBI JasperSoft Pentaho
ETL Talend Talend
JasperETL
KETTLE
PDI
Included
DBMS
HSQLDB
MySQL
Reporting BIRT
JasperReport
JasperReports
iReports
jFreeReports
Analyzer jPivot
PaloPivot
JasperServer
JasperAnalysis
jPivot
PAT
OLAP Mondrian Mondrian Mondrian
Data Mining Weka None Weka

We'll be using Pentaho, but you can use any of the these, or any combination of the OSS projects that are used by these BI Suites, or pick and choose from the more than 60 projects in our OSS Linkblog, as shown in the sidebar to this blog. All of the OSS BI Suites have many more features than shown in the simple table above. For example, SpagoBI has good tools for geographic & location services. Also, JasperSoft Professional and Enterprise Editions have many features than their Community Edition, such as Ad Hoc Reporting and Dashboards. Pentaho has a different Analyzer in their Enterprise Edition than either jPivot or PAT, Pentaho Analyzer, based upon the SaaS ClearView from the now-defunct LucidEra, as well as ease-of-use tools such as an OLAP schæma designer, and enterprise class security and administration tools.

Data warehousing using general purpose RDBMS systems such as Oracle, EnterpriseDB, PostrgeSQL or MySQL, are gradually giving way to analytic database management system (ADBMS), or, as we mentioned above, the catch-all NoSQL data storage systems, or even hybrid systems. For example, Oracle recently introduced hybrid column-row store features, and Aster Data has a column-store Massive Parallel Processing (MPP) DBMS|map-reduce hybrid [updated 20100616 per comment from Seth Grimes]. Pentaho supports Hadoop, as well as traditional general purpose RDBMS and column-store ADMBS. In the open source world, there are two columnar storage engines for MySQL, Infobright and Calpont InfiniDB, as well as one column-store ADBMS purpose built for BI, LucidDB. We'll be using LucidDB, and just for fun, may throw some data into Hadoop.

In addition, a modern DSS needs two more primary capabilities. Predictives, sometimes called predictive intelligence or predictive analytics (PA), which is the ability to go beyond inference and trend analysis, assigning a probability, with associated confidence, or likelihood of an event occurring in the future, and full Statistical Analysis, which includes determining the probability density or distribution function that best describes the data. Of course, there are OSS projects for these as well, such as The R Project, the Apache Common Math libraries, and other GNU projects that can be found in our Linkblog.

For statistical analysis and predictives, we'll be using the open source R statistical language and the open standard predictive model markup language (PMML), both of which are also supported by Pentaho.

We have all of these OSS projects installed on a Red Hat Enterprise Linux machine. The trick will be to get them all working together. The magic will be in modeling and analyzing the data to support good decisions. There are several areas of decision making that we're considering as examples. One is fairly prosaic, one is very interesting and far-reaching, and the others are somewhat in between.

  1. A fairly simple example would be to take our blog statistics, a real-time stream using SQLstream's Twitter API, and run experiments to determine whether or not, and possibly how, Twitter affects traffic to and interaction with our blogs. Possibly, we could get to the point where we can predict how our use of Twitter will affect our blog.
  2. A much more far-reaching idea was presented by Ken Winnick to me, via Twitter, and has created an on-going Twitter conversation and hashtag, #BPgulfDB. Let's take crowd sourced, government, and other publicly available data about the recent oilspill in the Gulf of Mexico, and analyze it.
  3. Another idea is to take historical home utility usage plus current smart meter usage data, and create a real-time dashboard, and even predictives, for reducing and managing energy usage.
  4. We also have the opportunity of using public data to enhance reporting and analytics for small, rural and research hospitals.

OSS DSS Formalization

The next step in our open source solutions (OSS) for decision support systems (DSS) study guide (SG), according to the syllabus, is to make our first decision: a formal definition of "Decision Support System". Next, and soon, will be a post listing the technologies that will contribute to our studies.

The first stop in looking for a definition of anything today, is Wikipedia. And indeed, Wikipedia does have a nice article on DSS. One of the things that I find most informative about Wikipedia articles, is the "Talk" page for an article. The DSS discussion is rather mild though, no ongoing debate as can be found on some other talk pages, such as the discussion about Business Intelligence. The talk pages also change more often, and provide insight into the thoughts that go into the main article.

And of course, the second stop is a Google search for Decision Support System; a search on DSS is not nearly as fruitful for our purposes. :)

Once upon a time, we might have gone to a library and thumbed through the card catalog to find some books on Decision Support Systems. A more popular approach today would be to search Amazon for Decision Support books. There are several books in my library that you might find interesting for different reasons:

  1. Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL by Roland Bouman & Jos van Dongen provides a very good overview of data warehousing, business intelligence and data mining, all key components to a DSS, and does so within the context of the open source Pentaho suite
  2. Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions by James Taylor & Neil Raden introduces business concepts for truly managing information and using decision support systems, as well as being a primer on data warehousing and business intelligence, but goes beyond this by automating the data flow and decision making processes
  3. Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss & Shaku Atre takes a business, program and project management approach to implementing DSS within a company, introducing fundamental concepts in a clear, though simplistic level
  4. Competing on Analytics: The New Science of Winning by Thomas H. Davenport & Jeanne G. Harris in many ways goes into the next generation of decision support by showing how data, statistical and quantitative analysis within a context specific processes, gives businesses a strong lead over their competition, albeit, it does so at a very simplistic, formulaic level

These books range from being technology focused to being general business books, but they all provide insight into how various components of DSS fit into a business, and different approaches to implementing them. None of them actually provide a complete DSS, and only the first focuses on OSS. If you followed the Amazon search link given previously, you might also have noticed that there are books that show Excel as a DSS, and there is a preponderance of books that focus on the biomedical/pharmaceutical/healthcare industry. Another focus area is in using geographic information systems (actually one of the first uses for multi-dimensional databases) for decision support. There are several books in this search that look good, but haven't made it into my library as yet. I would love to hear your recommendations (perhaps in the comments).

From all of this, and our experiences in implementing various DW, BI and DSS programs, I'm going to give a definition of DSS. From a previous post in this DSS SG, we have the following:

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.
-- Questions and Commonality

As we stated, this is vague and generic. Now that we've done some reading, let's see if we can do better.

A DSS assists an individual in reaching the best possible conclusion, resolution or course of action in stand-alone, iterative or interdependent situations, by using historical and current structured and unstructured data, collaboration with colleagues, and personal knowledge to predict the outcome or infer the consequences.

I like that definition, but your comments will help to refine it.

Note that we make no mention of specific processes, nor any technology whatsoever. It reflects my bias that decisions are made by individuals not groups (electoral systems not withstanding). To be true to our "TeleInterActive Lifestyle" &#59;) I should point out that the DSS must be available when and where the individual needs to make the decision.

Any comments?

R the next Big Thing or Not

Recently, AnnMaria De Mars, PhD (multiple) and Dr. Peter Flom, PhD have stirred up a bit of a tempest in a tweet-pot, as well as in the statistical blogosphere, with comparisons of R and SAS, IBM/SPSS and the like. I've commented on both of their blogs, but decided to expand a bit here, as the choice of R is something that we planned to cover in a later post to our Open Source Solutions Decision Support Systems Study Guide. First, let me say that Dr. De Mars and Dr. Flom appear to have posted completely independently of each other, and further, that their posts have different goals.

In The Next Big Thing, Dr. De Mars is looking for the next big thing, both to keep her own career on-track, and to guide students into areas of study that will be survive in the job market in the coming decades. This is always difficult for mentors, as we can't always anticipate the "black swan" events that might change things drastically. The tempestuous nature of her post came from one little sentence:

Contrary to what some people seem to think, R is definitely not the next big thing, either. -- AnnMaria De Mars, The Next Big Thing, AnnMaria's Blog

In SAS vs. R, Introduction and Request, Dr. Flom starts a series comparing R and SAS from the standpoint of a statistician deciding upon tools to use.

There are several threads in Dr. De Mars post. I agree with Dr. De Mars that two of the "next big things" in data management & analysis are data visualization and dealing with unstructured data. I'm of the opinion that there is a third area, related to the "Internet of Things" and the tsunami of data that will be generated by it. These are conceptual areas, however. Dr. De Mars quickly moves on to discussing the tools that might be a part of the solutions of these next big things. The concepts cited are neither software packages nor computing languages. The software packages SAS, IBM/SPSS, Stata, Pentaho and the like, and the computing language S, with its open source distribution R, and its proprietary distribution S+ are none likely to be the next big things, as they are currently useful tools to know.

I find it interesting that both Dr. De Mars and Dr. Flom, as well as the various commenters, tweeters, and other posters, are comparing software suites and applications with a computing language. I think that a bit more historical perspective might be needed in bringing these threads together.

In 1979, when I first sat down with a FORTRAN programmer to turn my Bayesian methodologies into practical applications to determine the reliability and risk associated with the STAR48 kick motor and associated Payload Assist Module (PAM), the statistical libraries for FORTRAN seemed amazing. The ease with which we were able to create the program and churn through decades of NASA data (after buying a 1MB memory box for the mainframe) was wondrous &#59;)

Today, there's not so much wonder from such a feat. The evolution of computing has drastically affected the way in which we apply mathematics and statistics today. Several of the comments to these posts argue both sides of the statement that anyone doing statistics today should be a programmer, or shouldn't. It's an interesting argument, that I've also seen reflected in chemistry, as fewer technicians are used in the lab, and the Ph.D.s work directly with the robots to prepare the samples and interpret the results.

Approximately 15 years ago, I moved from solving scientific and engineering problems directly with statistics, to solving business problems through vendor's software suites. The marketing names for this endeavor have gone through several changes: Decision Support Systems, Very Large Databases, Data Warehousing, Data Marts, Corporate Information Factory, Business Intelligence, and the like. Today, Data Mining, Data Visualization, Sentiment Analysis, "Big Data", SQL Streaming, and similar buzzwords reflect the new "big thing". Software applications, from new as well as established vendors, both open source and proprietary, are coming to the fore to handle these new areas that represent real problems.

So, one question to answer for students, is which, if any, of these software packages will best survive with and aid the growth of, their maturing careers. Will Tableau, LyzaSoft, QlikView or Viney@rd be in a better spot in 20 years, through growth or acquisition, than SAS or IBM/SPSS? Will the open source movement take down the proprietary vendors or be subsumed by them? Is Pentaho/Weka the BI & data mining solution for their career? Maybe, maybe not. But what about that other beast of which everyone speaks? Namely, R, the r-project, the R Statistical Language. What is it? Is it a worthy alternative to SAS or IBM/SPSS or Pentaho/Weka? Or is it a different genus altogether? That's a question I've been seeking to answer for myself, in my own career evolution. After 15 years, software such as SAP/Business Objects and IBM/Cognos, haven't evolved into anything that I like, with their pinnacle of statistical computation being the "average", the arithmetic mean. SAS and IBM/SPSS are certainly better, and with data mining, machine learning and predictives becoming important to business, certainly likely to be a good choice for the future. But are they really powerful enough? Are they flexible enough? Can they be used to solve the next generation of data problems?  They're very likely to evolve into software that can do so.  But how quickly?  And like all vendor software, they have limitations based upon the market studies and business decisions of the corporation.

How is R different?

Well, first, R is a computing language. Unlike SAP/Business Objects, IBM/Cognos, IBM/SPSS, SAS, Pentaho, JasperSoft, SpagoBI, or Oracle, it's not a company, nor a BI Suite, nor even a collection of software applications.  Second, R is an open source project. It's an open source implementation of S. Like C, and the other single letter named languages, S came out of Bell Labs, and in the case of R, in 1976. The open source implementation, R comes from R. Ihaka and R. Gentleman, first revealed in 1996 through the article, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, and is often associated with the Department of Statistics, University of Auckland.

While I'm not a software engineer, R is a very compelling statistical tool. As a language, it's very intuitive… for a statistician. It's an interactive, interpretive, functional, object oriented, statistical programming language. R itself is written in R, C, C++ and FORTRAN; it's powerful. As an open source project, it has attracted thousands upon thousands of users who have formed a strong community. There are thousands upon thousands of community contributed packages for R. It's flexible, and growing. One of the main goals of R was data visualization, and it has a wonderful new package for data visualization in ggplot2. It's ahead of the curve. There are packages for  parallel processing (some quite specific), for big data beyond in-memory capacity, for servers, and for embedding in a web site.  Get the idea?  If you think you need something in R, search CRAN, RForge, BioConductor or Omegahat.

As you can tell, I like R. :) However, in all honesty, I don't think that the SAS vs. R controversy is an either/or situation. SAS, IBM/SPSS and Pentaho complement R and vice-versa. Pentaho, IBM/SPSS and some SAS products support R. R can read data from SAS, IBM/SPSS , relational databases, Excel, mapReduce and more. The real question isn't is one tool better than another but rather selecting the best tool to answer a particular question.  That being said, I'm looking forward to Dr. Flom's comparison, as well as the continuing discussion on Dr. De Mars' blog.

For us, the question is building a decision support system or stack from open source components. It looks like we'll have a good time doing so.

Why Open Source for a Friend

This post is in response to "Volunteer for the Greater Good" written by S. Kleiman. I remember that village in Pennsylvania, and the attitudes of my friend at that time. I'm not surprised that you're attracted to open source; I am surprised that you're having trouble with embracing its ideals. We've have had an email exchange on this subject, and, as you know, I'm fairly attracted to open source solutions my self. &#59;) I hadn't seen your blog prior to answering your email, so let me go into a bit more detail here.

"The model contributor is a real geek – a guy in his 20-30’s, single, lives in his parent’s basement, no mortgage, no responsibility other than to pick up his dirty socks (some even have mothers who will do that)." -- "Volunteer for the Greater Good" by S. Kleiman

Wow. What a stereotype, and one that couldn't be further from the truth. Admittedly, during economic downturns, when software developers are forced to take whatever job they can find to put food on the table, many contribute to open source projects, ones that don't have commercial support and ones that do. This helps that open source project and its community. But, it also helps the developers to keep their skills sharp and maintain credibility. Most open source developers get paid. Some are students. Some are entrepreneurs. But most get paid, it's their job. And even if it's not their job, projects have learned to give back to communities.

While there are hundreds of thousands of open source projects on Sourceforge.net and other forges, many have never gone beyond the proposal stage, and have nothing to download. The number of active open source projects does number in the tens of thousands, and that is still pretty amazing. The idea that the great unwashed contribute to these projects whilst Mom does laundry... Well, that just doesn't wash. :p The vast majority of open source communities are started by 1 - 5 developers, who have a common goal that can be obtained through that specific open source project. They have strict governance in place to assure that the source code in the main project tree can be submitted only by those that have founded the project, or those that have gained a place of respect and trust in the community (a meritocracy) through the value of the code that they have contributed for plugins, through forums, and the like.

Most active open source projects fall into two categories, and many have slipped back and forth between these two.

  1. A labour of love, creating something that no one else has created for the sheer joy of it, or to solve a specific pain point for the lead developer
  2. A commercial endeavor, backed by an organization or organizations to solve their own enterprise needs or those of a specific market

While there are thousands of examples of both types, let me give just a few examples of some developers that I know personally, or companies with which I'm familiar.

Mondrian was founded by Julian Hyde, primarily as a labour of love. I know Julian, and he's an incredibly bright fellow. [And public congratulations to you, Julian and to your wife, on the recent birth of Sebastian]. In addition to be the father of Sebastian and Mondrian, Julian is also the Chief Architect of SQLstream, and a contributor to the Eigenbase project. Not exactly sitting around in the basement, coding away and waiting for Mom to clean up after him. :>> You can read Julian's blog on Open Source OLAP and Stuff, and follow Julian's Twitter stream too. By the way, while Mondrian can still be found on Sourceforge.net under its original license, it is also sponsored by Pentaho, and can be found as Pentaho Analysis, and as the analytical heart of the Pentaho BI Suite, JasperSoft BI Suite and SpagoBI.

Two other fellows had somewhat similar problems to solve and felt that the commercial solutions designed to move data around were simply too bloated, too complex, and prone to failure to boot. I don't believe that these two knew each other, and their problems were different enough to take different forms in the open source solutions that they created. I'm talking about Matt Casters, founder of the KETTLE ETL tool for data warehousing, and Ross Mason, founder of the Mule ESB. Both of them had an itch to scratch, and felt that the best way to scratch it was to create their own software, and leverage the power of the open source communities to refine their back scratchers. KETTLE, too, can now be found in Pentaho, as Pentaho Data Integration. Ross co-founded both Ricston and MuleSource to monetize his brain child, and has done an excellent job with the annual MuleCons. Matt still lives in Belgium, and has been known to share the fine beers produced by a local monastery [Thanks Matt]. You should follow Matt's blog too. Ross lives on the Island of Malta, and Ross blogs about Mule and the Maltese lifestyle.

Let's look at two other projects: Talend and WSO2. Both of these are newer entrants into the ETL and SOA space respectively, and both were started as commercial efforts by companies of the same name. I haven't had the opportunity to sit down with the Talend folk. I have spoken with the founders of WSO2, and they have an incredible passion that simply couldn't be fulfilled with their prior employer. So they founded their company, and their open source product, and haven't looked back. You can follow Sanjiva's Blog to learn more about WSO2 and their approach to open source.

And just one more, and somewhat different example: projects started by multiple educational institutions to meet their unique needs: Kuali for ERP and Sakai for learning management. For another take on commercialization, The rSmart Group contributes to these projects, but is commercializing them as appliances sold to educational institutions. You can read more about this rather different approach to monetizing open source at Chris Coppola's blog.

There are many, many more such examples. Just in the area of data management & analysis, we cover over 60 related open source projects [take a look at the blogroll in the sidebar to the right.

..."they organize themselves into groups of developers and maintainers on an adhoc basis, and on a world-wide basis. And the end products are robust, well developed, and well tested." -- "Volunteer for the Greater Good" by S. Kleiman

I think we've covered my rebuttal to your posting between the first quote and this one. I very much agree with this statement. I'm surprised by your surprise. The organizational dynamics that result in the excellent code that comprise open source projects is the subject of much thought, admiration and research. Here's a few places that you can go for more information.

And just for completeness sake, here's our email exchange:

From S. Kleiman: "OS is the current bug in my head. I'm trying to understand why my intellectual property should be "open" to the world (according to Richard Stallman.

Yes, I've read the copious amounts of literature on open software and the economics thereof - but I still don't get it. If I apply for a patent on a gadget, and then license companies to make that gadget - isn't that intellectual property? To copy my design, while it doesn't destroy my design, does limit any profit I might gain.

Anyway - how are you? Are you one of the original hackers?
I realized that all this time I though I had a great practical engineering degree. Instead I realize they made us into hackers - in the best sense of the word.

What is your experience with OS? What are you talking about (besides the title)?
How is the "snow" in CA? "

And my response:

Discussions around open source often get very passionate, so we should be having this conversation on a warm beach cooled by ocean breezes, fueled with lots of espresso ristretto followed by rounds of grappa to lower inhibitions and destroy preconceptions ;-)

But email is all we have.

Most open source projects are software, though there are a few examples of hardware projects such as Bug Labs, TrollTech (bought by Nokia, I think), OpenMojo and one for UAVs.

I should start by pointing out that I'm not presenting at the Open Source Business Conference, but am moderating a panel.

http://www.infoworld.com/event/osbc/09/osbc_agenda.html

Session Title: Moving Open Source Up the Stack

Session Abstract: Open Source Solutions for IT infrastructure have shown great success in organizations of all types
and sizes. OSS for business applications have seen greater difficulties in penetrating the glass ceiling
of the enterprise stack. We have put together a panel representing the EU and the US, system
integrators, vendors and buyers, and corporate focus vs. education focus. We''ll explore how the OSS
application strategy has changed over the past four years. We will also look at success and failures,
the trade-offs and the opportunities in solving business/end-user needs with OSS enterprise
applications.

Learning Objective 1: Most buyers know the 80% capability for 20% cost mantra of most OSS vendors, but we''ll focus on
what that lower cost actually buys.

Learning Objective 2: Where does OSS fit in the higher levels of the application stack? Learn how flexibility & mashups
can improve the end user experience.

Learning Objective 3: Learn how to come out ahead on the trade-offs of up-front cost vs. operational cost, experience and
learning curves, maintenance and replacement, stagnation and growth.

Here are the confirmed panelists:

(1) Tim Golden, Vice President - Unix Engineering, Security & Provisioning, Bank of America
(2) Gabriele Ruffatti, Architectures & Consulting Director, Research & Innovation Division, Engineering Group, Engineering Ingegneria Informatica S.p.A.
(3) Aaron Fulkerson, CEO/Founder, mindtouch
(4) Lance Walter, Vice President - Marketing, Pentaho
(5) Christopher D. Coppola, President, The rSmart Group
(Moderator) Joseph A. di Paolantonio, Principal Consultant/Blogger/Analyst, InterActive Systems & Consulting, Inc.

So, back to the "Why open source" discussion.

You might want to listen to a couple of our podcasts:

http://press.teleinteractive.net/tialife/2005/06/30/what_is_open_source

http://press.teleinteractive.net/tialife/2005/07/01/why_open_source

or not :-D

Historically, there were analog computers programmer by moving around jumper cables and circuits. Then there were general purpose computers programmed in machine language. Companies like IBM got the idea of adding operating systems, compilers and even full applications to their new mainframes to make them more useful and "user friendly" with languages like COBOL for the average business person and FORTRAN fir those crazy engineers. Later Sun, Apple, HP and others designed RISC based CPU's with tightly integrated operating systems for great performance. Throughout all this, academicians and data processing folk would send each other paper or magnetic tapes and enhance the general body of knowledge concerning running and programming computers. There eventually grew close to 100 flavours of Unix, either the freely available BSD version or the more tightly licensed AT&T version.

Then a little company called Microsoft changed the game, showing that hardware was a commodity and the money was in patenting, copywriting and using restrictive licenses to make the money in computers come from software sales.

Fast forward ~15 years and the principals in Netscape decided to take a page from the Free Software Foundation & their GNU (Gnu is not Unix) General Public License and the more permissive Berkeley License for BSD and as a final recourse in their lost battle to the Microsoft monopoly, coined the term "open source" and released the geiko web rendering engine under the Mozilla Public License. And the philosophical wars were on.

When I was the General Manager of CapTech IT Services, I had a couple of SunOS Sys Admins who spent their spare time writing code to improve FreeBSD & NetBSD. I let them use their beach time to further contribute to these projects. Then a young'un came along who wanted to do the same for this upstart variant of minix called Linux. :-D. All of this piqued my interest in F/LOSS.

Today, I feel that F/LOSS is a development method and not a distribution method nor a business model. If you look at IBM, HP, Oracle and others, you'll find that >50% of their money comes from services. Just as M$ commodified hardware and caused the Intel CISC architecture to win over proprietary RISC chips, software has become a commodity. Services is how one makes money in the computer market. With an open source development methodology, a company can create and leverage a community, not just for core development but for plugins and extensions, but more importantly that community can be leveraged ad thousands of QA testers at all levels: modules, regression & UAT, for thousands of use cases, and for forum level customer support (People, people helping people, are the happiest people on the world ;-)

Can the functions in your application be replicated by someone else without duplicating a single line of your code? Are the margins on your software sales being forced below 10%? Does most of your profit come from support, system integration, customizations or SaaS? Then why not leverage your community?

So, this is a really short answer to a really complex issue.

To answer some of your other questions...

I'm not an hacker nor a programmer of any type. I have started to
play around with the open source R statistical language to recreate my Objective Bayes assessment technique and grow beyond the (Fortran on OS/360 of VAX/VMS) applications that I caused to be created from it.

I haven't gotten to the snow in a couple of years, but we're in a drought cycle. Though it is storming as I write this.

I hope this helps you with your open source struggle, my friend. And thank you for putting up with me being a wordy bastard for the past /cough /harumph years. :D Oh, and note the Creative Commons license for this post. This must really cause you great consternation as a writer. Oh, and I'm not going to touch your post on Stallman. B)

SQLStreamv2 Real Time BI

Today, SQLStream announced version 2.0 of their Real Time BI solution. SQLStream comes from the fertile creativity of Julian Hyde, who is also the founder of the open source Mondrian OLAP engine. While SQLStream is not open source, it does stem from the open source Eigenbase community, leveraging the user-defined transforms that were originally developed for LucidDB to operate on traditional stored relational data, with SQL:2003-compliant syntax. SQLStream extends this to handle streaming relational data.

In addition to capturing standard, structured data while "on the wire", SQLStream also includes adapters for feeds, such as Atom and RSS, and for Twitter.

Methinks Julian and I need to schedule another lunch soon, so that I can learn more about how this unstructured data, especially from Twitter, can fit into real time analytics provided by SQLStream v2.0.

BTW, you can follow me on Twitter as @JAdP.

Microsoft Acquires Datallegro whither Ingres

I've been "hearing" all day on Twitter that Microsoft would be announcing something big at OSCON2008. Perhaps this is it:

Microsoft today announced that it intends to acquire DATAllegro, provider of breakthrough data warehouse appliances. The acquisition will extend the capabilities of Microsoft’s mission-critical data platform, making it easier and more cost effective for customers of all sizes to manage and glean insight from the ever expanding amount of data generated by and for businesses, employees and consumers.
-- Press Release "Microsoft to Acquire DATAllegro"

This is very interesting given the progress that Microsoft has made with its analytic services binding MS Office and SQL Server. Further quoting from the press release:

“Microsoft SQL Server 2008 delivers enterprise-class capabilities in business intelligence and data warehousing and the addition of the DATAllegro team and their technology will take our data platform to the highest scale of data warehousing.”
-- Ted Kummert, corporate vice president of the Data and Storage Platform Division at Microsoft

The direction for DATAllegro's data warehouse appliance is also made clear in the press release:

“DATAllegro's integration with SQL Server is the opti mal next generation solution and the acquisition by Microsoft is a great conclusion for the company.”
-- Lisa Lambert, Intel Capital managing director, Software and Solutions Group.

For those who don't know, DATAllegro is a data warehousing appliance company that utilizes "EMC® storage, Dell™ servers, Cisco® InfiniBand switches, Intel® multi-core CPUs and the Ingres® open source database".

So, whither Ingres in this acquisition? As we've written before here, Ingres is one of the earliest and strongest RDBMS products, which was absorbed by CA and then spun off again with an open source play in 2005. MS SQL Server, of course, started out as a rebranding of Sybase SQL*Server, until the partnership dissolved in the mid-1990's. Since then, MS SQL Server has been geared mostly as a workgroup and data mart server. It seems that a switch from Ingres to MS SQL Server could heavily undermine DATAllegro's business. In addition, the switchover in code to T-SQL will be a nightmare for developers. Add to that the challenges of moving from Linux to MS Windows, and from C/C++ to C# and it will take quite some time in production environments to iron out all the wrinkles.

In addition, while most seem to think that this puts Microsoft in a good position to challenge Oracle for the Enterprise Data Warehouse lead, it actually puts Microsoft directly into competition with other DW appliance vendors, such as Teradata. I truly doubt that this move will position Microsoft strongly into competition with either Oracle or Teradata, but merely marks another tactical error in Microsoft's increasingly desperate acquisition strategy to move deeper into the Enterprise on one hand, while striving to move further into the online space on the other.

More can be read at:

SOAP vs REST and OSBI News

I recently joined Twitter. I must share the following:

Roebot: #e20 note to organizers: If your panelists do NOT know what SOAP and REST are they prolly shouldn't be on a mashup panel!!! WTF!! about 5 hours ago from twhirlend quotation
-- Aaron Roe Fulkerson on Twitter

To which I responded:

Joseph_di_P: @Roebot wiki(SOAP) is what you use in tub to get clean; wiki(REST) is what you do in tub when not using SOAP :-D Easy, yah! about 1 hour ago from Hahlo in reply to Roebotend quotation
-- my response on Twitter

I know, I know, all the important stuff happening in the Open Source BI related world this week, and this is what I blog about. Is it a sign of dementia when you crack yourself up? :)) :crazy:

Here's some of the more important stuff that's been happening:

There's much else to do, including some additions to our linkblog with open source for MDM and more open source communities. But, I'm tweeting. :D

Making Sense of Open Source

The highlight of JavaOne for me has become supper with Gianugo Rabellino, the founder and CEO of SourceSense. For now, each year…

"'The time has come,' the Walrus said, 'To talk of many things: Of shoes—and ships—and sealing-wax— Of cabbages—and kings— And why the sea is boiling hot— And whether pigs have wings.'"
-- from the poem "The Walrus and the Carpenter" within Through the Looking Glass by Lewis Carrollend quotation

… For our conversations are wide ranging and thoroughly engaging as we indulge our enjoyment of fine food and open source.

This year, Gianugo has been in the USA several times, for the Open Source Think Tank and the Open Source Business Conference, and on business. An employee of SourceSense even had a gig in the town in which I was raised. Unfortunately discovering it to be the arm pit of the United States. Ah well, it's why I live in the SF Bay Area now. &#59;)

We talked of tempering chocolate, unusual eating habits of various cultures, the worsening economy, European vs USA political views, and Gianugo even had me read, while sober, the Pronunciation Poem. [I flubbed a few words, and disagree with some of the rhymes given that my dialect is as Philly as a Chesse Steak.]

But we also talked about Open Source; all the open source related conferences and blogging and work and newsworthy activities of the past few months. Here are some of my thoughts.

Open Source is a Philosophy

This one has been bubbling around in my head for years. Open Source, in and of itself is NOT

  • a licensing template
  • a business model
  • a development methodology

It is a philosophy that can provide the framework for those three things with which it is most often identified. The variations among what the open source philosophy means to each of its followers can best be seen from the proliferation of licenses, business models and methodologies all claiming to be open source.

To me, the open source philosophy is very simple, and it can be applied to solutions for some very complex problems and concerns: the source is available to anyone who obtains the end-product, whether one has obtained the end-product through a no-cost download or through a purchase agreement of whatever type. The "source" may be the "source code" for software, but, to me, it should be whatever specifications and design documentation are required to recreate the end-product in either the original or a modified form.

There are a variety of reasons that a developer, an engineer, an inventor, a creator, might what to subscribe to an open source philosophy. And each individual or business must decide if those reasons make creative and economic sense for them. Once one has decided to subscribe to open source, or any philosophy, then all your other decisions will only lead to success if they are logically consistent within that philosophy. The licensing language, business model, economic forecasting, internal processes and external relationships should form a coherent whole within the underlying philosophical framework.

Business Models Will be Different for Different Markets

Organizations often look for the silver bullet or the golden mantra or the platinum ring that will solve their problems, lead to dominance in the marketplace, or allow them to rule them all. &#59;) So we hear a lot of talk about the "best" open source business model. This search ignores that fact that there are open source solutions for many markets. Without going into specific verticals, let's just consider four general, horizontal markets that are addressed by open source software.

  1. Information Technology Infrastructure: including operating systems such as the various flavours of linux and BSD unix, application servers, and other middleware such as the Mule ESB, WSO2 SOA solutions & KETTLE for ETL, web servers, email MTA, the Funambol Mobile Server platform, and many more.
  2. End User Applications on the Desktop/Laptop/Mobile-Device: with OpenOffice.org and its MacOSX offshoot NeoOffice, Projivity OpenProj, and Mozilla Firefox and Thunderbird likely being the most well known, but also including many thousands of open source projects for artistic creation and enjoyment, as well as office and personal productivity, and gaming, and all things that one personally does on a computer
  3. Software Development Tools: from the Eclipse IDE to hundreds and thousands of specific language libraries and everything else a developer might need; this is even more "by geeks for geeks" than the infrastructure area
  4. Enterprise Applications: these are end-user facing applications, often fulfilling mission critical needs, such as ERP, CRM and BI, and is the least mature market for open source providers and the hardest area to gain acceptance

No one business model, no matter how generically expressed, could satisfy these four disparate markets. Monetizing these areas will come from combining innovative and traditional packaging of support, customizations, system integration, training, licensing, and subscriptions.

Automated maintenance, repair and update networks might work very well for monetizing IT infrastructure, but might be insufficient for the other markets. Ad based monetization, directly or through partnerships with ad networks like Google's, might work for some end-user applications. Putting out the begging bowl, asking for PayPal contributions might also work. But for many open source projects, there simply isn't any path to monetization.

One area that I think has been insufficiently explored, and might well be the only path to success for the Enterprise Application open source vendors is Software as a Service. The SaaS approach, whether through partner channels or directly, is the most sensible means of monetizing a wide variety of open source applications. Embrace the open source philosophy, leverage the strengths of flexibility and community in the licensing, business models and processes, and monetize through SaaS delivery into vertical and niche markets.

And for the rest, just acknowledging that your company is a software company embracing an open source philosophy, and building appropriate support and licensing structures will be the best path.

Communities are Strongest when Open

Developing a community around a product is not unique to the open source world. Microsoft, Oracle, SAP, SAS, SPSS, IBM, etc, etc, etc have all developed and supported great communities made up of ISVs, VARs and Users. The difference however, is that open source projects are very dependent on their community. Conversely, communities are strongest when there are no artificial limits to communication.

I think that most companies are embracing at least this one ramification of the open source philosophy. Though it seems to me that when the source isn't open, the community can only converse and support each other in surface issues.

One area though, where open source vendors still seem to be uncertain, is the definition of community, and what groups make up their community, and if they must develop different communities for different categories of members: developers, users, non-software contributors, paying customers, or partners. I think this is self-defeating. There is only one community for each project: those who want to be involved and will benefit from that involvement, ultimately to the betterment of the project, whether directly or indirectly, whether monetary or not, whether contributing code or documentation or use cases or testing or playing around with the project or complaining about it or just soaking up the ambience. Can any open source project community manager give a valid reason why any of these should be excluded from the community or ignored?

Open Source Sales are Very Foreign to Most IT Buyers

Whether we're talking about senior IT management, the CIO, or the purchasing department, the idea of freely downloading the fully operational, unlimited version of a product, and using it for research, prototyping or production, without handholding, cosseting or sales incentives is complete anathema to those making purchasing decisions.

Did you ever see a "vendor bake-off" among products from Oracle, IBM and SAP without vendor involvement, without vendor sales engineers fine-tuning and providing weeks of free labour? Is there any open source company that provides a corporate jet and box seats at the super bowl? And is there anyone who has ever been part of selling (internally or as a vendor) a large, big-budget enterprise project to a CIO or purchasing committee, who doesn't believe that such freebies and perks are essential elements to the approval process?

Many, too many, open source vendor CEOs believe that bloatware (more features than most users will use), vendor lock-in and lack of interoperability, and high prices of initial licenses are the problems CIOs worry and that open source software solves. They think success will come from "80% of the capabilities for 20% of the cost" of their closed source competitors. I don't believe any of it. CIOs worry about being beat up by the business side and greasing the squeaky wheel. The two highest costs in a data center are people and energy, not licenses or maintenance agreements. Implementation costs: hardware, software licenses, personnel and training, are spread out over three to five years with appropriate tax implications. For really large projects such as a full ERP or data warehouse, there is at least a ten year life expectancy. Why do you think that there are so many mainframes and COBOL applications still around? The implementation costs were written-off long ago, and the on-going support costs are minimal because they still JUST WORK. Those with an eye to economics are more concerned about the total cost of ownership over the complete 10-year lifecycle of a capital project, than with the 10% or less of those costs that go towards initial software licenses.

So, if the only way for open source vendors to become big players, or as one CIO put it, to "move out of the junior varsity", is to look like the big players, to provide pre-sales engineering and collateral beyond a T-shirt, then what will this do the business model and the belief in "20% of the cost"? It will destroy it. This is why there are more openings for sales & marketing than for engineers at open source companies today.

Open Source Vendors will grow and evolve and come to operate more and more like the big players they're destined to become. As long as they continue to adhere to their form of the open source philosophy, are internally consistent and true to the logic of their open source framework, they'll continue to benefit from the true value of open source: flexibility and community knowledge to a greater degree than can be achieved by any closed-source vendor or any isolated, enterprise data center with home-grown solutions. Speaking of which…

Open Source is the Middling Ground between Build vs Buy

This is another point I've been hammering for years. Traditionally, IT shops were either build or buy. The build shops were very development oriented and created custom solutions to implement business processes and support business users. The buy shops bought COTS software, and either convinced the business that the software they bought implemented industry best practices, or spent millions of dollars in customization.

Build shops suffer from isolation and the total burden of maintaining and updating the software they built. When the only developer left who remembers and understands the code, retires, faith often becomes the best practice for ongoing support. :>>

Buy shops suffer from ongoing dependancy on the vendor(s) and ongoing compromises for the users.

Build shops should be (but often aren't) more flexible to respond to changing business needs, and can be more valuable to the business.

Buy shops can be more reliable and cost-effective - really. But not always.

One result of Y2K [remember that?] was that most IT departments became much more of a blend of build and buy, with buy decisions winning out. Coupling this with the facilitation of distributed workgroups made possible by the Internet expanding world wide, and offsourcing [outsourcing operations to offshore companies] decimated many corporate data centers by solving the personnel and energy cost problems. Guess what? Implementation and licensing and maintenance costs remain the same. Some CIOs use offsourcing as the reason that they don't use open source; the outsourced vendor isn't familiar with it, and has no incentive to use it.

As companies bring IT back inside, and as its importance to the business is once again realized, open source offers a third path to the traditional build or buy. However, offsourcing will still be a substantial part of the solution, and should lead OSVs to recognize the importance of offsourcing vendors and embrace them as partners and very important channels.

The IT shop that responds to the business using open source can be both flexible and well-supported. The open source vendors that make such IT shops succeed through flexibility and reliability will be the most successful ones. This can be achieved through the OSVs growing their support, professional services and training organizations, or by partnering with all sizes of PS, VAR and outsourcing firms, or both.

Conclusions

The real conclusion is that bringing the open source philosophy to fruition in business is still an evolving process. The advantages gained by the openness of the source and the strength of the community is being recognized by both the IT shops and the OSVs. While there are cost savings to be had, OSVs need to stop relying on initial licensing cost reductions as their main selling point, and begin to market the advantages to the IT shop of using open source: responsiveness to changing business needs and increasing reliability over time, all while providing the best TCO and ROI.

There's a lot more to be said on all of these topics and opinions, and maybe I'll even get the time do so. :p

October 2018
Mon Tue Wed Thu Fri Sat Sun
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959

Search

  XML Feeds

mindmaps

Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is here.

Recent Posts

powered by b2evolution free blog software