Category: "Open Source"

Reading Pentaho Kettle Solutions

On a rainy day, there's nothing better than to be sitting by the stove, stirring a big kettle with a finely turned spoon. I might be cooking up a nice meal of Abruzzo Maccheroni alla Chitarra con Polpettine, but actually, I'm reading the ebook edition of Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration on my iPhone.

Some of my notes made while reading Pentaho Kettle Solutinos:

…45% of all ETL is still done by hand-coded programs/scripts… made sense when… tools have 6-figure price tags… Actually, some extractions and many transformations can't be done natively in high-priced tools like Informatica and Ab Initio.

Jobs, transformations, steps and hops are the basic building blocks of KETTLE processes

It's great to see the Agile Manisto quoted at the beginning of the discussion of AgileBI. 

BayAreaUseR October Special Event

Zhou Yu organized a great special event for the San Francisco Bay Area Use R group, and has asked me to post the slide decks for download. Here they are:

No longer missing is the very interesting presentation by Yasemin Atalay showing the difference in plotting analysis using the Windermere Humic Aqueous Model for river water environmental factors, without using R and then the increased in variety and accuracy of analysis and plotting gained by using R.

Technology for the OSS DSS Study Guide

'Tis been longer than intended, but we finally have the technology, time and resources to continue with our Open Source Solutions Decision Support System Study Guide (OSS DSS SG).

First, I want to thank SQLstream for allowing us to use SQLstream as a part of our solution. As mentioned in our "First DSS Study Guide" post, we were hoping to add a real-time component to our DSS. SQLstream is not open source, and not readily available for download. It is however, a co-founder and core contributer to the open source Eigenbase Project, and has incorporated Eigenbase technology into its product. So, what is SQLstream? To quote their web site, "SQLstream enables executives to make strategic decisions based on current data, in flight, from multiple, diverse sources". And that is why we are so interested in having SQLstream as a part of our DSS technology stack: to have the capability to capture and manipulate data as it is being generated.

Today, there are two very important classes of technologies that should belong to any DSS: data warehousing (DW) and business intelligence (BI). What actually comprises these technologies is still a matter of debate. To me, they are quite interrelated and provide the following capabilities.

  • The means of getting data from one or more sources to one or more target storage & analysis systems. Regardless of the details for the source(s) and the target(s), the traditional means in data warehousing is Extract from the source(s), Transform for consistency & correctness, and Load into the target(s), that is, ETL. Other means, such as using data services within a services oriented architecture (SOA) either using provider-consumer contracts & Web Service Definition Language (WSDL) or representational state transfer (ReST) are also possible.
  • Active storage over the long term of historic and near-current data. Active storage as opposed to static storage, such as a tape archive. This storage should be optimized for reporting and analysis through both its logical and physical data models, and through the database architecture and technologies implemented. Today we're seeing an amazing surge of data storage and management innovation, with column-store relational database management systems (RDBMS), map-reduce (M-R), key-value stores (KVS) and more, especially hybrids of one or several of old and new technologies. The innovation is coming so thick and fast, that the terminology is even more confused than in the rest of the BI world. NoSQL has become a popular term for all non-RDBMS, and even some RDBMS like column-store. But even here, what once meant No Structured Query Language now is often defined as Not only Structured Query Language, as if SQL was the only way to create an RDBMS (can someone say Progress and its proprietary 4GL).
  • Tools for reporting including gathering the data, performing calculations, graphing, or perhaps more accurately, charting, formating and disseminating.
  • Online Analytical Processing (OLAP) also known as "slice and dice", generally allowing forms of multi-dimensional or pivot analysis. Simply put, there are three underlying concepts for OLAP: the cube (a.k.a. hypercube, multi-dimensional database [MDDB] or OLAP engine), the measures (facts) & dimensions, and aggregation. OLAP provides much more flexibility than reporting, though the two often work hand-in-hand, especially for ad-hoc reporting and analysis.
  • Data Mining, including machine learning and the ability to discover correlations among disparate data sets.

For our purposes, an important question is whether or not there are open source, or at least open source based, solutions for all of these capabilities. The answer is yes. As a matter of fact, there are three complete open source BI Suites [there were four, but the first, written in PERL, the Bee Project from the Czech Republic, is no longer being updated]. Here's a brief overview of SpagoBI, JasperSoft, and Pentaho.

Capability SpagoBI JasperSoft Pentaho
ETL Talend Talend
JasperETL
KETTLE
PDI
Included
DBMS
HSQLDB
MySQL
Reporting BIRT
JasperReport
JasperReports
iReports
jFreeReports
Analyzer jPivot
PaloPivot
JasperServer
JasperAnalysis
jPivot
PAT
OLAP Mondrian Mondrian Mondrian
Data Mining Weka None Weka

We'll be using Pentaho, but you can use any of the these, or any combination of the OSS projects that are used by these BI Suites, or pick and choose from the more than 60 projects in our OSS Linkblog, as shown in the sidebar to this blog. All of the OSS BI Suites have many more features than shown in the simple table above. For example, SpagoBI has good tools for geographic & location services. Also, JasperSoft Professional and Enterprise Editions have many features than their Community Edition, such as Ad Hoc Reporting and Dashboards. Pentaho has a different Analyzer in their Enterprise Edition than either jPivot or PAT, Pentaho Analyzer, based upon the SaaS ClearView from the now-defunct LucidEra, as well as ease-of-use tools such as an OLAP schæma designer, and enterprise class security and administration tools.

Data warehousing using general purpose RDBMS systems such as Oracle, EnterpriseDB, PostrgeSQL or MySQL, are gradually giving way to analytic database management system (ADBMS), or, as we mentioned above, the catch-all NoSQL data storage systems, or even hybrid systems. For example, Oracle recently introduced hybrid column-row store features, and Aster Data has a column-store Massive Parallel Processing (MPP) DBMS|map-reduce hybrid [updated 20100616 per comment from Seth Grimes]. Pentaho supports Hadoop, as well as traditional general purpose RDBMS and column-store ADMBS. In the open source world, there are two columnar storage engines for MySQL, Infobright and Calpont InfiniDB, as well as one column-store ADBMS purpose built for BI, LucidDB. We'll be using LucidDB, and just for fun, may throw some data into Hadoop.

In addition, a modern DSS needs two more primary capabilities. Predictives, sometimes called predictive intelligence or predictive analytics (PA), which is the ability to go beyond inference and trend analysis, assigning a probability, with associated confidence, or likelihood of an event occurring in the future, and full Statistical Analysis, which includes determining the probability density or distribution function that best describes the data. Of course, there are OSS projects for these as well, such as The R Project, the Apache Common Math libraries, and other GNU projects that can be found in our Linkblog.

For statistical analysis and predictives, we'll be using the open source R statistical language and the open standard predictive model markup language (PMML), both of which are also supported by Pentaho.

We have all of these OSS projects installed on a Red Hat Enterprise Linux machine. The trick will be to get them all working together. The magic will be in modeling and analyzing the data to support good decisions. There are several areas of decision making that we're considering as examples. One is fairly prosaic, one is very interesting and far-reaching, and the others are somewhat in between.

  1. A fairly simple example would be to take our blog statistics, a real-time stream using SQLstream's Twitter API, and run experiments to determine whether or not, and possibly how, Twitter affects traffic to and interaction with our blogs. Possibly, we could get to the point where we can predict how our use of Twitter will affect our blog.
  2. A much more far-reaching idea was presented by Ken Winnick to me, via Twitter, and has created an on-going Twitter conversation and hashtag, #BPgulfDB. Let's take crowd sourced, government, and other publicly available data about the recent oilspill in the Gulf of Mexico, and analyze it.
  3. Another idea is to take historical home utility usage plus current smart meter usage data, and create a real-time dashboard, and even predictives, for reducing and managing energy usage.
  4. We also have the opportunity of using public data to enhance reporting and analytics for small, rural and research hospitals.

OSS DSS Formalization

The next step in our open source solutions (OSS) for decision support systems (DSS) study guide (SG), according to the syllabus, is to make our first decision: a formal definition of "Decision Support System". Next, and soon, will be a post listing the technologies that will contribute to our studies.

The first stop in looking for a definition of anything today, is Wikipedia. And indeed, Wikipedia does have a nice article on DSS. One of the things that I find most informative about Wikipedia articles, is the "Talk" page for an article. The DSS discussion is rather mild though, no ongoing debate as can be found on some other talk pages, such as the discussion about Business Intelligence. The talk pages also change more often, and provide insight into the thoughts that go into the main article.

And of course, the second stop is a Google search for Decision Support System; a search on DSS is not nearly as fruitful for our purposes. :)

Once upon a time, we might have gone to a library and thumbed through the card catalog to find some books on Decision Support Systems. A more popular approach today would be to search Amazon for Decision Support books. There are several books in my library that you might find interesting for different reasons:

  1. Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL by Roland Bouman & Jos van Dongen provides a very good overview of data warehousing, business intelligence and data mining, all key components to a DSS, and does so within the context of the open source Pentaho suite
  2. Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions by James Taylor & Neil Raden introduces business concepts for truly managing information and using decision support systems, as well as being a primer on data warehousing and business intelligence, but goes beyond this by automating the data flow and decision making processes
  3. Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss & Shaku Atre takes a business, program and project management approach to implementing DSS within a company, introducing fundamental concepts in a clear, though simplistic level
  4. Competing on Analytics: The New Science of Winning by Thomas H. Davenport & Jeanne G. Harris in many ways goes into the next generation of decision support by showing how data, statistical and quantitative analysis within a context specific processes, gives businesses a strong lead over their competition, albeit, it does so at a very simplistic, formulaic level

These books range from being technology focused to being general business books, but they all provide insight into how various components of DSS fit into a business, and different approaches to implementing them. None of them actually provide a complete DSS, and only the first focuses on OSS. If you followed the Amazon search link given previously, you might also have noticed that there are books that show Excel as a DSS, and there is a preponderance of books that focus on the biomedical/pharmaceutical/healthcare industry. Another focus area is in using geographic information systems (actually one of the first uses for multi-dimensional databases) for decision support. There are several books in this search that look good, but haven't made it into my library as yet. I would love to hear your recommendations (perhaps in the comments).

From all of this, and our experiences in implementing various DW, BI and DSS programs, I'm going to give a definition of DSS. From a previous post in this DSS SG, we have the following:

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.
-- Questions and Commonality

As we stated, this is vague and generic. Now that we've done some reading, let's see if we can do better.

A DSS assists an individual in reaching the best possible conclusion, resolution or course of action in stand-alone, iterative or interdependent situations, by using historical and current structured and unstructured data, collaboration with colleagues, and personal knowledge to predict the outcome or infer the consequences.

I like that definition, but your comments will help to refine it.

Note that we make no mention of specific processes, nor any technology whatsoever. It reflects my bias that decisions are made by individuals not groups (electoral systems not withstanding). To be true to our "TeleInterActive Lifestyle" &#59;) I should point out that the DSS must be available when and where the individual needs to make the decision.

Any comments?

Syllabus for OSS DSS Studies

As promised, here's the syllabus for our study guide to decision support systems using open source solutions. We'll start with a first draft on 2010-03-23, and update and change based on ideas, comments and lessons learned. So, please comment. :) The updates will be marked. Deletions will be marked with a strike-though and not removed.

  1. Introduction
    1. Continuing the discussion of the processes and technologies that constitute a decision support system
    2. Formalizing a definition of DSS as well as the components, such as business intelligence (BI) that contribute to a DSS
    3. Providing [and updating] the list of references for this study guide
  2. Preparation
    1. Discussing the technology for use in this study guide including the client(s) and server (Red Hat Enterprise Linux 5)
    2. Checking for prerequisites for the open source solutions that will be used
    3. Hands-on exercises for preparing the system
  3. Installation
    1. Pointers and examples for installing the open source server-side packages including but not limited to:
      1. LucidDB
      2. Pentaho BI-Server, including PAT, and Administrative Console
      3. RServe and/or RApache
    2. Pointers for installation of client-side software and some examples on MacOSX
  4. Modeling
    1. Generally, we would determine the models, the architecture and then one (or more competing) design(s) to satisfy that architecture, including selecting the right technical solutions for the job at hand. Here, we're creating a learning environment for certain tools, so we're introducing the architecture and design studies after the technology installs.
    2. In general, this section will explore the various means of modeling processes, systems and data, specifically as these relate to making decisions.
    3. Decision Making Processes
      1. Decision Theory
      2. Game Theory
      3. Machine Learning & Data Mining
      4. Bayes and Iterations
      5. Predictives
    4. Information Flow
    5. Mathematical Modeling
    6. Data Modeling
    7. UML
    8. Dimensional Modeling
    9. PMML
  5. Architecture and Design
    1. In this section, we'll examine the differences between enterprise and system architecture, and between architecture and design. We'll look at various architectural and design elements that might influence both policy and technology directions.
    2. Discussing Enterprise Architecture, especially the translation between the user needs and technology/operational realities
    3. System Architecture
    4. SOA, ReST, WSDL, and Master Data Management
    5. Technology selection and vendor bake-offs
  6. Implementation Considerations
    1. Discussing the various philosophies and considerations for implementing any DSS, or really, any system integration project. We'll look at our own three track implementation methodology, as well as how the new Pentaho Agile BI tools support our method. In addition, we'll consider how we'll get all these OSS tools working together, on the same data sets, as well as, the importance of managing data about the data.
    2. Pentaho Agile BI and our own 8D™ Method
    3. System and Data Integration
    4. Metadata
  7. Using the Tools
    1. This is the vaguest part of our syllabus. We'll be using the examples from our various references, but with the system we've set-up here, rather than the exact systems that the references use. For example, we'll be using LucidDB and not MySQL for the examples from Pentaho Solutions. Remember too, that this is a study guide, and not a oops meant to be a book written as a series of blog posts, so while we might vary from the reference materials, we'll always refer to them.
    2. ETL
    3. Reporting
    4. OLAP
    5. Data Mining & Machine Learning
    6. Statistical Analysis
    7. Predictives
    8. Workflow
    9. Collaboration
    10. Hmm, this should take years :D

Questions and Commonality

In the introduction to our open source solutions (OSS) for decision support systems (DSS) study guide (SG), I gave a variety of examples of activities that might be considered using a DSS. I asked some questions as to what common elements exist among these activities that might help us to define a modern platform for DSS, and whether or not we could build such a system using open source solutions.

In this post, let's examine the first of those questions, and see if we can start answering those questions. In the next post, we will lay out a syllabus of sorts for this OSS DSS SG.

The first common element is that in all cases, we have an individual doing the activity, not a machine nor a committee.

Secondly, the individual has some resources at their disposal. Those resources include current and historical information, structured and unstructured data, communiqués and opinions, and some amount of personal experience, augmented by the experience of others.

Thirdly, though not explicit, there's the idea of digesting these resources and performing formal or informal analyses.

Fourthly, though again, not explicit, the concept of trying to predict what might happen next, or as a result of the decision is inherent to all of the examples.

Finally, there's collaboration involved. Few of us can make good decisions in a vacuum.

Of course, since the examples are fictional, and created by us, they represent our biases. If you had fingered our domain server back in 1993, or read our .project and .plan files from that time, you would have seen that we were interested in sharing information and analyses, while providing a framework for making decisions using such tools as email, gopher and electronic bulletin boards. So, if you identify any other commonalities, or think anything is missing, please join the discussion in the comments.

From these commonalities, can we begin to answer the first question we had asked: "What does this term [DSS] really mean?". Let's try.

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.

That's nice and vague; generic enough to almost meaningless, but provides some key points that will help us to bound the specifics as we go along. For example, if a process or technology doesn't help us to make a better decision, than it doesn't fit. If something allows us to make a better decision, but we can't define the process or identify the technology involved, it doesn't belong (e.g. "my gut tells me so").

Let's create a list from all of the above.

  1. Individual Decision Maker
  2. Process
  3. Technology
  4. Structured Data
  5. Unstructured Data
  6. Historical Information
  7. Current Information
  8. Communication
  9. Opinion
  10. Collaboration
  11. Analysis
  12. Prediction
  13. Personal Experience
  14. Other's Experience

What do you think? Does a modern system to support decisions need to cover all of these elements and no others? Is this list complete and sufficient? The comments are open.

An Open Source Childrens Story

On Twitter today, Lance Walter asked me to go into the Ark Business with him, and Gareth Greenaway asked for entertainment. It must be a rainy Friday afternoon &#59;)

I'm not sure about Lance's offer, but I did tell Gareth the following story, from tweet-start to tweet-end. This isn't word for word as I tweeted. 'Tis a bit expanded, but the tale is the same.

Once upon a time there was a young penguin named Tux. Tux decided to set off on a journey through IT Land. Now IT Land is a dangerous place, full of hackers fighting crackers, and ruled by those in the Ivory Tower and the acolytes of the Megaliths.

Along the way, the adventurous Tux met the Dolphin, the Elephant and the Beekeeper. They made a pact on the Lucid glyph to become a Dynamo of IT, bringing power to the datasmiths of the Land.

They met many Titans from the Megaliths on their Quest. The Beekeeper used the open source bees to open the scrum along the way, blocking the hookers with their sharp claws.

Some of the Titans were helpful, some, not so much.

The Dolphin was empowered by the Sun. But the Sun was consumed by a powerful Oracle. The Elephant, too, gained a powerful ally, and they do Enterprise against the Oracle. The band of the Quest was broken, and Tux was sad.

The Era of Lucid thought ended, but the Dynamo yet powers the Lucid Glyph, and Tux can rely on the Dynamo and the Beekeeper to predict a future clear of the Oracle.

And thus this quest ends, but another soon begins, where Tux will meet new friends and new foes. Will Beastie and the dæmons be allies? Will the Paladin in the Red Hat be stalwart?

Perhaps we'll find out at OSCON, for Gareth suggested that an assemblage of geeks would enjoy this story, and we'll see if OSCON thinks our tales worthy of a keynote slot in 2010.

Do you recognize all the characters in this tale? Maybe the links will help.

What say you, OSCON? Would these tales make a worthy Keynote?

Pentaho Reporting Review

As promised in my post, "Pentaho Reporting 3.5 for Java Developers First Look", I've taken the time to thoroughly grok Pentaho Reporting 3.5 for Java Developers by Will Gorman [direct link to Packt Publishing][Buy the book from Amazon]. I've read the book, cover-to-cover, and gone through the [non-Java] exercises. As I said in my first look at this book, it contains nuggets of wisdom and practicalities drawn from deep insider knowledge. This book does best serve its target audience, Java developers with a need to incorporate reporting into their applications. But it is also useful for report developers who wish to know more about Pentaho, and Pentaho users who wish to make their use of Pentaho easier and the resulting reporting experience richer.

The first three chapters provide a very good introduction to Pentaho Reporting and its relationship to the Pentaho BI Suite and the company Pentaho, historical, technical and practical. These three chapters are also the ones that have clearly marked sections for Java specific information and exercises. By the end of Chapter Three, you'll have installed Pentaho Report Designer, and built several rich reports. If you're a Java developer, you'll have had the opportunity to incorporate these reports into both Tomcat J2EE or Swing web applications. You'll have been introduced to the rich reporting capabilities of Pentaho, accessing data sources, the underlying Java libraries, and the various output options that include PDF, Excel, CSV, RTF, XML and plain text.

Chapters 4 through 8 is all about the WYSIWYG Pentaho Report Designer, the pixel-level control that it gives you over the layout of your reports, and the many wonderful capabilities provided by Pentaho Reporting from a wide range of chart types to embedding numeric and text functions, to cross-tabs and sub-reports. Other than Chapter 5, these chapters are as useful for a business user creating their own reports, as it is for a report developer. Chapter 5 is a very deep dive, very technical look at incorporating various data sources. The two areas that really stand out are the charts (Chapter 6) and functions (Chapter 7).

There are a baker's dozen types of charts covered, with an example for each type. Some of the more exotic are Waterfall, Bar-Line, Radar and Extended XY Series charts.

There are hundreds of parameters, functions and expressions that can be used in Pentaho Reports, and Will covers them all. The formula capability of Pentaho Reporting follows the OpenFormula standard, similar to the support for formulæ in Microsoft Excel, and the same as that followed by OpenOffice.org. One can provide computed text or numeric values within Pentaho reports to a fairly complex extent. Chapter 7 provides a great introduction to using this feature.

Chapters 9 through 11 are very much for the software developer, covering the development of Interactive Reports in Swing and HTML, the use of Pentaho's APIs and extension of Pentaho Reporting capabilities. It's all interesting stuff, that really explains the technology of Pentaho Reporting, but there's little here that is of use to the business user or non-Java report developer.

The first part of Chapter 12, on the other hand, is of little use to the Java developer, as it shows how to take reports created in Pentaho Report Designer and publish them through the Pentaho BI-Server, including formats suitable to mobile devices, such as the iPhone. The latter part of Chapter 12 goes into the use of metadata, and is useful both for the report developer and the Java developer.

So, as I said in my first look, the majority of the book is useful even if you're not a Java developer who needs to incorporate sophisticated reports into your application. That being said, Will Gorman does an excellent job in explaining Pentaho Reporting, and making it very useful for business users, report designers, report developers and, his target audience, Java developers. I heartily recommend that you buy this book. [Amazon link]

Pentaho Reporting 3.5 for Java Developers First Look

I was approached by Richard Dias of Packt Publishing to review "Pentaho Reporting 3.5 for Java Developers" written by Will Gorman. (Link is to Amazon.com)

LinkedIn
Richard Dias has indicated you are a Friend:

Hi Joseph,

My name is Richard Dias and I work for Packt Publishing which specializes in publishing focused IT related books.

I was wondering if you would be interesteed in reviewing the book "Pentaho Reporting for Java Developers" written by Will Gorman.

- Richard Dias

After some back and forth, I decided to accept the book in exchange for my review.

Hi Joseph,

Thanks for the reply and interest in reviewing the book. I have just placed an order for a copy of the book and it should arrive at your place within 10 days. Please do let me know when you receive it.

I have also created a unique link for you. It is http://www.packtpub.com/pentaho-reporting-3-5-for-java-developers?utm_source=press.teleinteractive.net&utm_medium=bookrev&utm_content=blog&utm_campaign=mdb_001537. Please feel free to use this link in your book review.

In the meanwhile, if you could mention about the book on your blog and tweet about the book, it would be highly appreciated. Please do let me know if it is fine with you.

I’m also sending you the link of an extracted chapter from the book (Chapter 6 Including Charts and Graphics in Reports). It would be great if you could put up the link on your blog. This would act as first hand information for your readers and they will also be able to download the file.

Any queries or suggestions are always welcome.

I look forward to your reply.

Best Regards,

Richard

Richard Dias
Marketing Research Executive | Packt Publishing | www.PacktPub.com

Shortly thereafter, I received notification that the book had shipped. It arrived within two weeks.

Of course, I've been too busy to do more than skim through the book. Anyone who follows me as JAdP on Twitter knows that in the past few weeks, I've been:

  • helping customers with algorithm development and implementing Pentaho on LucidDB,
  • working with Nicholas Goodman with his planning for commercial support of LucidDB through Dynamo Business Intelligence, and roadmaps for DynamoDB packages built on LucidDB's plugin architecture, and
  • migrating our RHEL host at ServerBeach from our old machine to a new one, while dealing with issues brought about by ServerBeach migrating to Peer1's tools.

None of which has left any time for a thorough review of "Pentaho Reporting for Java Developers".

I hope to have a full review up shortly after the holidays, which for me runs from Solstice to Epiphany, and maybe into the following weekend.

First, a little background. Will Gorman, the author, works for Pentaho, in software engineering, as a team lead, and works primarily on Pentaho Reporting products, a combination of server-side (Pentaho BI-Server), Desktop (MacOSX, Linux and Windows platforms) and Web-based software (Reporting Engine, Report Designer, Report Design Wizard and Pentaho Ad Hoc Reporting), which stems from the open source JFreeReport and JFreeChart. While I don't know Will personally, I do know quite a few individuals at Pentaho, and in the Pentaho community. I very much endorse their philosophy towards open source, and the way they've treated the open source projects and communities that they've integrated into their Pentaho Business Intelligence Suite. I do follow Will on Twitter, and on the IRC Freednode Channel, ##pentaho.

I myself am not a Java Developer, so at first I was not attracted to a book with a title that seemed geared to Pentaho Developers. Having skimmed through the book, I think that the title was poorly chosen. (Sorry Richard). I find that I can read through the book without stumbling, and that there is plenty of good intelligence that will help me better server and instruct my customers through the use of Pentaho Report Designer.

My initial impressions are good. The content seems full of golden nuggets of "how-tos" and background information not commonly known among the Pentaho community. Will's knowledge of Pentaho Reporting and how it fits into the rest of the Pentaho tools, such as KETTLE (Pentaho Data Integration) and Mondrian (Pentaho Analysis), along with a clear writing style makes all aspects of Pentaho more accessible to the BI practitioner, as well as those that wish to embed Pentaho Reporting into their own application.

This book is not just for Java developers, but for anyone who wishes to extend their abilities in BI, Reporting and Analysis, with Pentaho as an excellent example.

I'll be following up with the really exciting finds as I wend my way through Will's gold mine of knowledge, and, will do my best to fulfill my promise of a full review by mid-January.

You can also click through the Chapter 6 (a PDF) as mentioned in Richard's email.

Thank you, Richard. And most especially, thank you, Will.

Why Open Source for a Friend

This post is in response to "Volunteer for the Greater Good" written by S. Kleiman. I remember that village in Pennsylvania, and the attitudes of my friend at that time. I'm not surprised that you're attracted to open source; I am surprised that you're having trouble with embracing its ideals. We've have had an email exchange on this subject, and, as you know, I'm fairly attracted to open source solutions my self. &#59;) I hadn't seen your blog prior to answering your email, so let me go into a bit more detail here.

"The model contributor is a real geek – a guy in his 20-30’s, single, lives in his parent’s basement, no mortgage, no responsibility other than to pick up his dirty socks (some even have mothers who will do that)." -- "Volunteer for the Greater Good" by S. Kleiman

Wow. What a stereotype, and one that couldn't be further from the truth. Admittedly, during economic downturns, when software developers are forced to take whatever job they can find to put food on the table, many contribute to open source projects, ones that don't have commercial support and ones that do. This helps that open source project and its community. But, it also helps the developers to keep their skills sharp and maintain credibility. Most open source developers get paid. Some are students. Some are entrepreneurs. But most get paid, it's their job. And even if it's not their job, projects have learned to give back to communities.

While there are hundreds of thousands of open source projects on Sourceforge.net and other forges, many have never gone beyond the proposal stage, and have nothing to download. The number of active open source projects does number in the tens of thousands, and that is still pretty amazing. The idea that the great unwashed contribute to these projects whilst Mom does laundry... Well, that just doesn't wash. :p The vast majority of open source communities are started by 1 - 5 developers, who have a common goal that can be obtained through that specific open source project. They have strict governance in place to assure that the source code in the main project tree can be submitted only by those that have founded the project, or those that have gained a place of respect and trust in the community (a meritocracy) through the value of the code that they have contributed for plugins, through forums, and the like.

Most active open source projects fall into two categories, and many have slipped back and forth between these two.

  1. A labour of love, creating something that no one else has created for the sheer joy of it, or to solve a specific pain point for the lead developer
  2. A commercial endeavor, backed by an organization or organizations to solve their own enterprise needs or those of a specific market

While there are thousands of examples of both types, let me give just a few examples of some developers that I know personally, or companies with which I'm familiar.

Mondrian was founded by Julian Hyde, primarily as a labour of love. I know Julian, and he's an incredibly bright fellow. [And public congratulations to you, Julian and to your wife, on the recent birth of Sebastian]. In addition to be the father of Sebastian and Mondrian, Julian is also the Chief Architect of SQLstream, and a contributor to the Eigenbase project. Not exactly sitting around in the basement, coding away and waiting for Mom to clean up after him. :>> You can read Julian's blog on Open Source OLAP and Stuff, and follow Julian's Twitter stream too. By the way, while Mondrian can still be found on Sourceforge.net under its original license, it is also sponsored by Pentaho, and can be found as Pentaho Analysis, and as the analytical heart of the Pentaho BI Suite, JasperSoft BI Suite and SpagoBI.

Two other fellows had somewhat similar problems to solve and felt that the commercial solutions designed to move data around were simply too bloated, too complex, and prone to failure to boot. I don't believe that these two knew each other, and their problems were different enough to take different forms in the open source solutions that they created. I'm talking about Matt Casters, founder of the KETTLE ETL tool for data warehousing, and Ross Mason, founder of the Mule ESB. Both of them had an itch to scratch, and felt that the best way to scratch it was to create their own software, and leverage the power of the open source communities to refine their back scratchers. KETTLE, too, can now be found in Pentaho, as Pentaho Data Integration. Ross co-founded both Ricston and MuleSource to monetize his brain child, and has done an excellent job with the annual MuleCons. Matt still lives in Belgium, and has been known to share the fine beers produced by a local monastery [Thanks Matt]. You should follow Matt's blog too. Ross lives on the Island of Malta, and Ross blogs about Mule and the Maltese lifestyle.

Let's look at two other projects: Talend and WSO2. Both of these are newer entrants into the ETL and SOA space respectively, and both were started as commercial efforts by companies of the same name. I haven't had the opportunity to sit down with the Talend folk. I have spoken with the founders of WSO2, and they have an incredible passion that simply couldn't be fulfilled with their prior employer. So they founded their company, and their open source product, and haven't looked back. You can follow Sanjiva's Blog to learn more about WSO2 and their approach to open source.

And just one more, and somewhat different example: projects started by multiple educational institutions to meet their unique needs: Kuali for ERP and Sakai for learning management. For another take on commercialization, The rSmart Group contributes to these projects, but is commercializing them as appliances sold to educational institutions. You can read more about this rather different approach to monetizing open source at Chris Coppola's blog.

There are many, many more such examples. Just in the area of data management & analysis, we cover over 60 related open source projects [take a look at the blogroll in the sidebar to the right.

..."they organize themselves into groups of developers and maintainers on an adhoc basis, and on a world-wide basis. And the end products are robust, well developed, and well tested." -- "Volunteer for the Greater Good" by S. Kleiman

I think we've covered my rebuttal to your posting between the first quote and this one. I very much agree with this statement. I'm surprised by your surprise. The organizational dynamics that result in the excellent code that comprise open source projects is the subject of much thought, admiration and research. Here's a few places that you can go for more information.

And just for completeness sake, here's our email exchange:

From S. Kleiman: "OS is the current bug in my head. I'm trying to understand why my intellectual property should be "open" to the world (according to Richard Stallman.

Yes, I've read the copious amounts of literature on open software and the economics thereof - but I still don't get it. If I apply for a patent on a gadget, and then license companies to make that gadget - isn't that intellectual property? To copy my design, while it doesn't destroy my design, does limit any profit I might gain.

Anyway - how are you? Are you one of the original hackers?
I realized that all this time I though I had a great practical engineering degree. Instead I realize they made us into hackers - in the best sense of the word.

What is your experience with OS? What are you talking about (besides the title)?
How is the "snow" in CA? "

And my response:

Discussions around open source often get very passionate, so we should be having this conversation on a warm beach cooled by ocean breezes, fueled with lots of espresso ristretto followed by rounds of grappa to lower inhibitions and destroy preconceptions ;-)

But email is all we have.

Most open source projects are software, though there are a few examples of hardware projects such as Bug Labs, TrollTech (bought by Nokia, I think), OpenMojo and one for UAVs.

I should start by pointing out that I'm not presenting at the Open Source Business Conference, but am moderating a panel.

http://www.infoworld.com/event/osbc/09/osbc_agenda.html

Session Title: Moving Open Source Up the Stack

Session Abstract: Open Source Solutions for IT infrastructure have shown great success in organizations of all types
and sizes. OSS for business applications have seen greater difficulties in penetrating the glass ceiling
of the enterprise stack. We have put together a panel representing the EU and the US, system
integrators, vendors and buyers, and corporate focus vs. education focus. We''ll explore how the OSS
application strategy has changed over the past four years. We will also look at success and failures,
the trade-offs and the opportunities in solving business/end-user needs with OSS enterprise
applications.

Learning Objective 1: Most buyers know the 80% capability for 20% cost mantra of most OSS vendors, but we''ll focus on
what that lower cost actually buys.

Learning Objective 2: Where does OSS fit in the higher levels of the application stack? Learn how flexibility & mashups
can improve the end user experience.

Learning Objective 3: Learn how to come out ahead on the trade-offs of up-front cost vs. operational cost, experience and
learning curves, maintenance and replacement, stagnation and growth.

Here are the confirmed panelists:

(1) Tim Golden, Vice President - Unix Engineering, Security & Provisioning, Bank of America
(2) Gabriele Ruffatti, Architectures & Consulting Director, Research & Innovation Division, Engineering Group, Engineering Ingegneria Informatica S.p.A.
(3) Aaron Fulkerson, CEO/Founder, mindtouch
(4) Lance Walter, Vice President - Marketing, Pentaho
(5) Christopher D. Coppola, President, The rSmart Group
(Moderator) Joseph A. di Paolantonio, Principal Consultant/Blogger/Analyst, InterActive Systems & Consulting, Inc.

So, back to the "Why open source" discussion.

You might want to listen to a couple of our podcasts:

http://press.teleinteractive.net/tialife/2005/06/30/what_is_open_source

http://press.teleinteractive.net/tialife/2005/07/01/why_open_source

or not :-D

Historically, there were analog computers programmer by moving around jumper cables and circuits. Then there were general purpose computers programmed in machine language. Companies like IBM got the idea of adding operating systems, compilers and even full applications to their new mainframes to make them more useful and "user friendly" with languages like COBOL for the average business person and FORTRAN fir those crazy engineers. Later Sun, Apple, HP and others designed RISC based CPU's with tightly integrated operating systems for great performance. Throughout all this, academicians and data processing folk would send each other paper or magnetic tapes and enhance the general body of knowledge concerning running and programming computers. There eventually grew close to 100 flavours of Unix, either the freely available BSD version or the more tightly licensed AT&T version.

Then a little company called Microsoft changed the game, showing that hardware was a commodity and the money was in patenting, copywriting and using restrictive licenses to make the money in computers come from software sales.

Fast forward ~15 years and the principals in Netscape decided to take a page from the Free Software Foundation & their GNU (Gnu is not Unix) General Public License and the more permissive Berkeley License for BSD and as a final recourse in their lost battle to the Microsoft monopoly, coined the term "open source" and released the geiko web rendering engine under the Mozilla Public License. And the philosophical wars were on.

When I was the General Manager of CapTech IT Services, I had a couple of SunOS Sys Admins who spent their spare time writing code to improve FreeBSD & NetBSD. I let them use their beach time to further contribute to these projects. Then a young'un came along who wanted to do the same for this upstart variant of minix called Linux. :-D. All of this piqued my interest in F/LOSS.

Today, I feel that F/LOSS is a development method and not a distribution method nor a business model. If you look at IBM, HP, Oracle and others, you'll find that >50% of their money comes from services. Just as M$ commodified hardware and caused the Intel CISC architecture to win over proprietary RISC chips, software has become a commodity. Services is how one makes money in the computer market. With an open source development methodology, a company can create and leverage a community, not just for core development but for plugins and extensions, but more importantly that community can be leveraged ad thousands of QA testers at all levels: modules, regression & UAT, for thousands of use cases, and for forum level customer support (People, people helping people, are the happiest people on the world ;-)

Can the functions in your application be replicated by someone else without duplicating a single line of your code? Are the margins on your software sales being forced below 10%? Does most of your profit come from support, system integration, customizations or SaaS? Then why not leverage your community?

So, this is a really short answer to a really complex issue.

To answer some of your other questions...

I'm not an hacker nor a programmer of any type. I have started to
play around with the open source R statistical language to recreate my Objective Bayes assessment technique and grow beyond the (Fortran on OS/360 of VAX/VMS) applications that I caused to be created from it.

I haven't gotten to the snow in a couple of years, but we're in a drought cycle. Though it is storming as I write this.

I hope this helps you with your open source struggle, my friend. And thank you for putting up with me being a wordy bastard for the past /cough /harumph years. :D Oh, and note the Creative Commons license for this post. This must really cause you great consternation as a writer. Oh, and I'm not going to touch your post on Stallman. B)

September 2018
Mon Tue Wed Thu Fri Sat Sun
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959

Search

  XML Feeds

mindmaps

Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is here.

Recent Posts

powered by b2evolution free blog software