Category: "Business Intelligence"

Reading Pentaho Kettle Solutions

On a rainy day, there's nothing better than to be sitting by the stove, stirring a big kettle with a finely turned spoon. I might be cooking up a nice meal of Abruzzo Maccheroni alla Chitarra con Polpettine, but actually, I'm reading the ebook edition of Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration on my iPhone.

Some of my notes made while reading Pentaho Kettle Solutinos:

…45% of all ETL is still done by hand-coded programs/scripts… made sense when… tools have 6-figure price tags… Actually, some extractions and many transformations can't be done natively in high-priced tools like Informatica and Ab Initio.

Jobs, transformations, steps and hops are the basic building blocks of KETTLE processes

It's great to see the Agile Manisto quoted at the beginning of the discussion of AgileBI. 

Search Terms for Data Management & Analytics

Recently, for a prospective customer, I created a list of some search terms to provide them with some "late night" reading on data management & analytics. I've tried these terms out on Google, and as suspected, for most, the first hit is for Wikipedia. While most articles in Wikipedia need to be taken with a grain of salt, they will give you a good overview. [By the way, I use the "Talk" page on the articles to see the discussion and arguments about the article's content as an indicator of how big a grain of salt is needed for that article] &#59;) So plug these into your favorite search engine, and happy reading.

  • Reporting - top two hits on Google are Wikipedia, and, interestingly, Pentaho
  • Ad-hoc reporting
  • OLAP - one of the first page hits is for Julian Hyde's blog, creator of the open source tool for OLAP, Mondrian, as well as real-time analytics engine, SQLstream
  • Enterprise dashboard - interestingly, Wikipedia doesn't come up in the top hits for this term on Google, so here's a link for Wikipedia:
  • Analytics - isn't very useful as a search term, but the product page from SAS gives a nice overview
  • Advanced Analytics - is mostly marketing buzz, so be wary of anything that you find using this as search term

Often, Data Mining, Machine Learning and Predictives are used interchangeably. This isn't really correct, as you can see from the following five search terms…

  • Data Mining
  • Machine Learning
  • Predictive Analytics
  • Predictive Intelligence - is an earlier term for Predictives that has mostly been supplanted by Predictive Analytics. I actually prefer just "Predictives".
  • PMML - Predictive Modeling Markup Language - is a way of transporting predictive models from one software package to another. Few packages will both export and import PMML. The lack of that capability can lock you into a solution, making it expensive to change vendors. The first hit for PMML on Google today is the Data Mining Group, which is a great resource. One company listed, Zementis, is a start-up that is becoming a leader in running data mining and predictive models that have been created anywhere
  • R - the R statistical language, is difficult to search on Google. Go to and … instead. R is useful for writing applications for any type of statistical analysis, and is invaluable for creating new algorithms and predictive models
  • ETL - Extract, Transform & Load, is the most common way of getting information from source systems to analytic systems
  • ReSTful Web Services - Representational State Transfer - can expose data as a web service using the four verbs of the web
  • SOA
  • ADBMS - Analytic Database Management Systems doesn't work well as a search term. Start with the site and follow the links from the Eigenbase subproject, LucidDB. Also, check out AsterData
  • Bayes - The Reverend Thomas Bayes came up with this interesting approach to statistical analysis in the 1700s. I first started creating Bayesian statistical methods and algorithms for predicting reliability and risk associated with solid propellant rockets. You'll find good articles using Bayes as a search term in Google. A bit denser article can be found at And some interesting research using Bayes can be found at: Andrew Gelman's Blog. You're likely familiar with one common Bayesian algorithm, naïve Bayes, which is used by most anti-spam email programs. Other forms are objective Bayes with non-informative priors and the original Subjective Bayes. I have an old aerospace joke about the Rand Corporation's Delphi method, based on subjective Bayes :-) I created my own methodology, and don't really care for naïve Bayes nor non-informative priors.
  • Sentiment Analysis - which is one of Seth Grimes' current areas of research
  • Decision Support Systems - in addition to searching on Google, you might find my recent OSS DSS Study Guide of interest

Let me know if I missed your favorite search term for data management & analytics.

Technology for the OSS DSS Study Guide

'Tis been longer than intended, but we finally have the technology, time and resources to continue with our Open Source Solutions Decision Support System Study Guide (OSS DSS SG).

First, I want to thank SQLstream for allowing us to use SQLstream as a part of our solution. As mentioned in our "First DSS Study Guide" post, we were hoping to add a real-time component to our DSS. SQLstream is not open source, and not readily available for download. It is however, a co-founder and core contributer to the open source Eigenbase Project, and has incorporated Eigenbase technology into its product. So, what is SQLstream? To quote their web site, "SQLstream enables executives to make strategic decisions based on current data, in flight, from multiple, diverse sources". And that is why we are so interested in having SQLstream as a part of our DSS technology stack: to have the capability to capture and manipulate data as it is being generated.

Today, there are two very important classes of technologies that should belong to any DSS: data warehousing (DW) and business intelligence (BI). What actually comprises these technologies is still a matter of debate. To me, they are quite interrelated and provide the following capabilities.

  • The means of getting data from one or more sources to one or more target storage & analysis systems. Regardless of the details for the source(s) and the target(s), the traditional means in data warehousing is Extract from the source(s), Transform for consistency & correctness, and Load into the target(s), that is, ETL. Other means, such as using data services within a services oriented architecture (SOA) either using provider-consumer contracts & Web Service Definition Language (WSDL) or representational state transfer (ReST) are also possible.
  • Active storage over the long term of historic and near-current data. Active storage as opposed to static storage, such as a tape archive. This storage should be optimized for reporting and analysis through both its logical and physical data models, and through the database architecture and technologies implemented. Today we're seeing an amazing surge of data storage and management innovation, with column-store relational database management systems (RDBMS), map-reduce (M-R), key-value stores (KVS) and more, especially hybrids of one or several of old and new technologies. The innovation is coming so thick and fast, that the terminology is even more confused than in the rest of the BI world. NoSQL has become a popular term for all non-RDBMS, and even some RDBMS like column-store. But even here, what once meant No Structured Query Language now is often defined as Not only Structured Query Language, as if SQL was the only way to create an RDBMS (can someone say Progress and its proprietary 4GL).
  • Tools for reporting including gathering the data, performing calculations, graphing, or perhaps more accurately, charting, formating and disseminating.
  • Online Analytical Processing (OLAP) also known as "slice and dice", generally allowing forms of multi-dimensional or pivot analysis. Simply put, there are three underlying concepts for OLAP: the cube (a.k.a. hypercube, multi-dimensional database [MDDB] or OLAP engine), the measures (facts) & dimensions, and aggregation. OLAP provides much more flexibility than reporting, though the two often work hand-in-hand, especially for ad-hoc reporting and analysis.
  • Data Mining, including machine learning and the ability to discover correlations among disparate data sets.

For our purposes, an important question is whether or not there are open source, or at least open source based, solutions for all of these capabilities. The answer is yes. As a matter of fact, there are three complete open source BI Suites [there were four, but the first, written in PERL, the Bee Project from the Czech Republic, is no longer being updated]. Here's a brief overview of SpagoBI, JasperSoft, and Pentaho.

Capability SpagoBI JasperSoft Pentaho
ETL Talend Talend
Reporting BIRT
Analyzer jPivot
OLAP Mondrian Mondrian Mondrian
Data Mining Weka None Weka

We'll be using Pentaho, but you can use any of the these, or any combination of the OSS projects that are used by these BI Suites, or pick and choose from the more than 60 projects in our OSS Linkblog, as shown in the sidebar to this blog. All of the OSS BI Suites have many more features than shown in the simple table above. For example, SpagoBI has good tools for geographic & location services. Also, JasperSoft Professional and Enterprise Editions have many features than their Community Edition, such as Ad Hoc Reporting and Dashboards. Pentaho has a different Analyzer in their Enterprise Edition than either jPivot or PAT, Pentaho Analyzer, based upon the SaaS ClearView from the now-defunct LucidEra, as well as ease-of-use tools such as an OLAP schæma designer, and enterprise class security and administration tools.

Data warehousing using general purpose RDBMS systems such as Oracle, EnterpriseDB, PostrgeSQL or MySQL, are gradually giving way to analytic database management system (ADBMS), or, as we mentioned above, the catch-all NoSQL data storage systems, or even hybrid systems. For example, Oracle recently introduced hybrid column-row store features, and Aster Data has a column-store Massive Parallel Processing (MPP) DBMS|map-reduce hybrid [updated 20100616 per comment from Seth Grimes]. Pentaho supports Hadoop, as well as traditional general purpose RDBMS and column-store ADMBS. In the open source world, there are two columnar storage engines for MySQL, Infobright and Calpont InfiniDB, as well as one column-store ADBMS purpose built for BI, LucidDB. We'll be using LucidDB, and just for fun, may throw some data into Hadoop.

In addition, a modern DSS needs two more primary capabilities. Predictives, sometimes called predictive intelligence or predictive analytics (PA), which is the ability to go beyond inference and trend analysis, assigning a probability, with associated confidence, or likelihood of an event occurring in the future, and full Statistical Analysis, which includes determining the probability density or distribution function that best describes the data. Of course, there are OSS projects for these as well, such as The R Project, the Apache Common Math libraries, and other GNU projects that can be found in our Linkblog.

For statistical analysis and predictives, we'll be using the open source R statistical language and the open standard predictive model markup language (PMML), both of which are also supported by Pentaho.

We have all of these OSS projects installed on a Red Hat Enterprise Linux machine. The trick will be to get them all working together. The magic will be in modeling and analyzing the data to support good decisions. There are several areas of decision making that we're considering as examples. One is fairly prosaic, one is very interesting and far-reaching, and the others are somewhat in between.

  1. A fairly simple example would be to take our blog statistics, a real-time stream using SQLstream's Twitter API, and run experiments to determine whether or not, and possibly how, Twitter affects traffic to and interaction with our blogs. Possibly, we could get to the point where we can predict how our use of Twitter will affect our blog.
  2. A much more far-reaching idea was presented by Ken Winnick to me, via Twitter, and has created an on-going Twitter conversation and hashtag, #BPgulfDB. Let's take crowd sourced, government, and other publicly available data about the recent oilspill in the Gulf of Mexico, and analyze it.
  3. Another idea is to take historical home utility usage plus current smart meter usage data, and create a real-time dashboard, and even predictives, for reducing and managing energy usage.
  4. We also have the opportunity of using public data to enhance reporting and analytics for small, rural and research hospitals.

OSS DSS Formalization

The next step in our open source solutions (OSS) for decision support systems (DSS) study guide (SG), according to the syllabus, is to make our first decision: a formal definition of "Decision Support System". Next, and soon, will be a post listing the technologies that will contribute to our studies.

The first stop in looking for a definition of anything today, is Wikipedia. And indeed, Wikipedia does have a nice article on DSS. One of the things that I find most informative about Wikipedia articles, is the "Talk" page for an article. The DSS discussion is rather mild though, no ongoing debate as can be found on some other talk pages, such as the discussion about Business Intelligence. The talk pages also change more often, and provide insight into the thoughts that go into the main article.

And of course, the second stop is a Google search for Decision Support System; a search on DSS is not nearly as fruitful for our purposes. :)

Once upon a time, we might have gone to a library and thumbed through the card catalog to find some books on Decision Support Systems. A more popular approach today would be to search Amazon for Decision Support books. There are several books in my library that you might find interesting for different reasons:

  1. Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL by Roland Bouman & Jos van Dongen provides a very good overview of data warehousing, business intelligence and data mining, all key components to a DSS, and does so within the context of the open source Pentaho suite
  2. Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions by James Taylor & Neil Raden introduces business concepts for truly managing information and using decision support systems, as well as being a primer on data warehousing and business intelligence, but goes beyond this by automating the data flow and decision making processes
  3. Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss & Shaku Atre takes a business, program and project management approach to implementing DSS within a company, introducing fundamental concepts in a clear, though simplistic level
  4. Competing on Analytics: The New Science of Winning by Thomas H. Davenport & Jeanne G. Harris in many ways goes into the next generation of decision support by showing how data, statistical and quantitative analysis within a context specific processes, gives businesses a strong lead over their competition, albeit, it does so at a very simplistic, formulaic level

These books range from being technology focused to being general business books, but they all provide insight into how various components of DSS fit into a business, and different approaches to implementing them. None of them actually provide a complete DSS, and only the first focuses on OSS. If you followed the Amazon search link given previously, you might also have noticed that there are books that show Excel as a DSS, and there is a preponderance of books that focus on the biomedical/pharmaceutical/healthcare industry. Another focus area is in using geographic information systems (actually one of the first uses for multi-dimensional databases) for decision support. There are several books in this search that look good, but haven't made it into my library as yet. I would love to hear your recommendations (perhaps in the comments).

From all of this, and our experiences in implementing various DW, BI and DSS programs, I'm going to give a definition of DSS. From a previous post in this DSS SG, we have the following:

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.
-- Questions and Commonality

As we stated, this is vague and generic. Now that we've done some reading, let's see if we can do better.

A DSS assists an individual in reaching the best possible conclusion, resolution or course of action in stand-alone, iterative or interdependent situations, by using historical and current structured and unstructured data, collaboration with colleagues, and personal knowledge to predict the outcome or infer the consequences.

I like that definition, but your comments will help to refine it.

Note that we make no mention of specific processes, nor any technology whatsoever. It reflects my bias that decisions are made by individuals not groups (electoral systems not withstanding). To be true to our "TeleInterActive Lifestyle" &#59;) I should point out that the DSS must be available when and where the individual needs to make the decision.

Any comments?

Syllabus for OSS DSS Studies

As promised, here's the syllabus for our study guide to decision support systems using open source solutions. We'll start with a first draft on 2010-03-23, and update and change based on ideas, comments and lessons learned. So, please comment. :) The updates will be marked. Deletions will be marked with a strike-though and not removed.

  1. Introduction
    1. Continuing the discussion of the processes and technologies that constitute a decision support system
    2. Formalizing a definition of DSS as well as the components, such as business intelligence (BI) that contribute to a DSS
    3. Providing [and updating] the list of references for this study guide
  2. Preparation
    1. Discussing the technology for use in this study guide including the client(s) and server (Red Hat Enterprise Linux 5)
    2. Checking for prerequisites for the open source solutions that will be used
    3. Hands-on exercises for preparing the system
  3. Installation
    1. Pointers and examples for installing the open source server-side packages including but not limited to:
      1. LucidDB
      2. Pentaho BI-Server, including PAT, and Administrative Console
      3. RServe and/or RApache
    2. Pointers for installation of client-side software and some examples on MacOSX
  4. Modeling
    1. Generally, we would determine the models, the architecture and then one (or more competing) design(s) to satisfy that architecture, including selecting the right technical solutions for the job at hand. Here, we're creating a learning environment for certain tools, so we're introducing the architecture and design studies after the technology installs.
    2. In general, this section will explore the various means of modeling processes, systems and data, specifically as these relate to making decisions.
    3. Decision Making Processes
      1. Decision Theory
      2. Game Theory
      3. Machine Learning & Data Mining
      4. Bayes and Iterations
      5. Predictives
    4. Information Flow
    5. Mathematical Modeling
    6. Data Modeling
    7. UML
    8. Dimensional Modeling
    9. PMML
  5. Architecture and Design
    1. In this section, we'll examine the differences between enterprise and system architecture, and between architecture and design. We'll look at various architectural and design elements that might influence both policy and technology directions.
    2. Discussing Enterprise Architecture, especially the translation between the user needs and technology/operational realities
    3. System Architecture
    4. SOA, ReST, WSDL, and Master Data Management
    5. Technology selection and vendor bake-offs
  6. Implementation Considerations
    1. Discussing the various philosophies and considerations for implementing any DSS, or really, any system integration project. We'll look at our own three track implementation methodology, as well as how the new Pentaho Agile BI tools support our method. In addition, we'll consider how we'll get all these OSS tools working together, on the same data sets, as well as, the importance of managing data about the data.
    2. Pentaho Agile BI and our own 8D™ Method
    3. System and Data Integration
    4. Metadata
  7. Using the Tools
    1. This is the vaguest part of our syllabus. We'll be using the examples from our various references, but with the system we've set-up here, rather than the exact systems that the references use. For example, we'll be using LucidDB and not MySQL for the examples from Pentaho Solutions. Remember too, that this is a study guide, and not a oops meant to be a book written as a series of blog posts, so while we might vary from the reference materials, we'll always refer to them.
    2. ETL
    3. Reporting
    4. OLAP
    5. Data Mining & Machine Learning
    6. Statistical Analysis
    7. Predictives
    8. Workflow
    9. Collaboration
    10. Hmm, this should take years :D

Questions and Commonality

In the introduction to our open source solutions (OSS) for decision support systems (DSS) study guide (SG), I gave a variety of examples of activities that might be considered using a DSS. I asked some questions as to what common elements exist among these activities that might help us to define a modern platform for DSS, and whether or not we could build such a system using open source solutions.

In this post, let's examine the first of those questions, and see if we can start answering those questions. In the next post, we will lay out a syllabus of sorts for this OSS DSS SG.

The first common element is that in all cases, we have an individual doing the activity, not a machine nor a committee.

Secondly, the individual has some resources at their disposal. Those resources include current and historical information, structured and unstructured data, communiqués and opinions, and some amount of personal experience, augmented by the experience of others.

Thirdly, though not explicit, there's the idea of digesting these resources and performing formal or informal analyses.

Fourthly, though again, not explicit, the concept of trying to predict what might happen next, or as a result of the decision is inherent to all of the examples.

Finally, there's collaboration involved. Few of us can make good decisions in a vacuum.

Of course, since the examples are fictional, and created by us, they represent our biases. If you had fingered our domain server back in 1993, or read our .project and .plan files from that time, you would have seen that we were interested in sharing information and analyses, while providing a framework for making decisions using such tools as email, gopher and electronic bulletin boards. So, if you identify any other commonalities, or think anything is missing, please join the discussion in the comments.

From these commonalities, can we begin to answer the first question we had asked: "What does this term [DSS] really mean?". Let's try.

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.

That's nice and vague; generic enough to almost meaningless, but provides some key points that will help us to bound the specifics as we go along. For example, if a process or technology doesn't help us to make a better decision, than it doesn't fit. If something allows us to make a better decision, but we can't define the process or identify the technology involved, it doesn't belong (e.g. "my gut tells me so").

Let's create a list from all of the above.

  1. Individual Decision Maker
  2. Process
  3. Technology
  4. Structured Data
  5. Unstructured Data
  6. Historical Information
  7. Current Information
  8. Communication
  9. Opinion
  10. Collaboration
  11. Analysis
  12. Prediction
  13. Personal Experience
  14. Other's Experience

What do you think? Does a modern system to support decisions need to cover all of these elements and no others? Is this list complete and sufficient? The comments are open.

First DSS Study Guide

Someone sitting in their study, looking at their books, journals, piles of scholarly periodicals and files of correspondence with learned colleagues probably didn't think that they were looking at their decision support system, but they were.

Someone sitting on the plains, looking at the conditions around them, smoke signals from distant tribe members, records knotted into a string, probably didn't think that they were looking at their decision support system, but they were.

Someone at the nexus of a modern military command, control, communications, computing and intelligence system, probably didn't think that they were looking at their decision support system, but they were.

Someone pulling data from transactional systems, and dumping the results of reports & analyses from BI tool into a spreadsheet to feed a dashboard for the executives of a huge corporation probably didn't think that they were looking at their decision support system, but they were.

The term "decision support system" has been in use for over 50 years, perhaps longer.

  • But what does this term really mean?
  • What do all of my examples have in common?
  • How can we build a reasonable decision support system from open source solutions?
  • What resources exist to help us learn?

I'm starting a series of posts, essentially a "study guide" to help answer these questions.

I'll be drawing from and pointing to the following books and online resources as we install, configure and use open source systems to create a technical platform for a decision support system.

  1. Bayesian Computation in R by Jim Albert, Springer Series in UseR!, ISBN: 0-38-792297-0, Purchase from Amazon, you can also purchase the Kindle ebook from Amazon
  2. R in a Nutshell by Joseph Adler, ISBN: 0-59-68017-0X, Purchase from Amazon
  3. Pentaho Solutions; Business Intelligence and Data Warehousing with Pentaho and MySQL, by Roland Bouman and Jos van Dongen, ISBN: 0-47-048432-2, Purchase from Amazon
  4. Pentaho Reporting 3.5 for Java Developers by Will Gorman, ISBN: 1-84-719319-6, Purchase from Amazon
  5. Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration by Matt Casters, Roland Bouman & Jos van Dongen, ISBN: 0-47-063517-7 due 2010 September, Pre-Order from Amazon
  6. Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten and Eibe Frank, Second Edition, Morgan-Kaufmann Series in Data Management Systems, ISBN: 0-12-088407-0 a.k.a. "The Weka Book", Purchase from Amazon, Pre-Order the Third Edition, you can also purchase the Kindle ebook from Amazon
  7. LucidDB online documentation
  8. Pertinent information from Eigenbase
  9. LudidDB mailing list archive on Nabble
  10. Anything I can find on PAT
  11. Pentaho Community Forums, Wiki, WebEx Events, and other community sources
  12. R Mailing Lists and Forums
  13. Various Books in PDF from The R Project
  14. Information Management and Open Source Solution Blogs from our side-column linkblogs

In this study guide series of posts:

  • I'll show how the datawarehousing (DW) and business intelligence (BI) can be extended to include all the elements held in common from my DSS examples.
  • We'll examine the open source solutions Pentaho, R, Rserve, Rapache, LucidDB and possibly Map-Reduce & Key-value-stores, and the related open source projects, communities and companies in terms of how they can be used to create a DSS.
  • I would like to add a collaboration tool to the mix, as we do in our implementation projects, possibly Mindtouch, a ReSTful Wiki Platform.
  • We may add one non-open source package, SQLStream, that's built upon open source elements from Eigenbase. This will allow us to add a real-time component to our DSS.
  • I'll give my own experience in installing these packages and getting them to work together, with pointers to the resources listed above.
  • We'll explore sample and public data sets with the DSS environment we created, again with pointers to and help from the resources listed.

The purpose of this series of posts is a study guide, not an online book written as a blog. The goal is to help us to define a modern DSS and build it out of open source solutions, while using existing resources.

Please feel free to comment, especially if there is anything that you feel should be included beyond what I've outlined here.

Pentaho Reporting Review

As promised in my post, "Pentaho Reporting 3.5 for Java Developers First Look", I've taken the time to thoroughly grok Pentaho Reporting 3.5 for Java Developers by Will Gorman [direct link to Packt Publishing][Buy the book from Amazon]. I've read the book, cover-to-cover, and gone through the [non-Java] exercises. As I said in my first look at this book, it contains nuggets of wisdom and practicalities drawn from deep insider knowledge. This book does best serve its target audience, Java developers with a need to incorporate reporting into their applications. But it is also useful for report developers who wish to know more about Pentaho, and Pentaho users who wish to make their use of Pentaho easier and the resulting reporting experience richer.

The first three chapters provide a very good introduction to Pentaho Reporting and its relationship to the Pentaho BI Suite and the company Pentaho, historical, technical and practical. These three chapters are also the ones that have clearly marked sections for Java specific information and exercises. By the end of Chapter Three, you'll have installed Pentaho Report Designer, and built several rich reports. If you're a Java developer, you'll have had the opportunity to incorporate these reports into both Tomcat J2EE or Swing web applications. You'll have been introduced to the rich reporting capabilities of Pentaho, accessing data sources, the underlying Java libraries, and the various output options that include PDF, Excel, CSV, RTF, XML and plain text.

Chapters 4 through 8 is all about the WYSIWYG Pentaho Report Designer, the pixel-level control that it gives you over the layout of your reports, and the many wonderful capabilities provided by Pentaho Reporting from a wide range of chart types to embedding numeric and text functions, to cross-tabs and sub-reports. Other than Chapter 5, these chapters are as useful for a business user creating their own reports, as it is for a report developer. Chapter 5 is a very deep dive, very technical look at incorporating various data sources. The two areas that really stand out are the charts (Chapter 6) and functions (Chapter 7).

There are a baker's dozen types of charts covered, with an example for each type. Some of the more exotic are Waterfall, Bar-Line, Radar and Extended XY Series charts.

There are hundreds of parameters, functions and expressions that can be used in Pentaho Reports, and Will covers them all. The formula capability of Pentaho Reporting follows the OpenFormula standard, similar to the support for formulæ in Microsoft Excel, and the same as that followed by One can provide computed text or numeric values within Pentaho reports to a fairly complex extent. Chapter 7 provides a great introduction to using this feature.

Chapters 9 through 11 are very much for the software developer, covering the development of Interactive Reports in Swing and HTML, the use of Pentaho's APIs and extension of Pentaho Reporting capabilities. It's all interesting stuff, that really explains the technology of Pentaho Reporting, but there's little here that is of use to the business user or non-Java report developer.

The first part of Chapter 12, on the other hand, is of little use to the Java developer, as it shows how to take reports created in Pentaho Report Designer and publish them through the Pentaho BI-Server, including formats suitable to mobile devices, such as the iPhone. The latter part of Chapter 12 goes into the use of metadata, and is useful both for the report developer and the Java developer.

So, as I said in my first look, the majority of the book is useful even if you're not a Java developer who needs to incorporate sophisticated reports into your application. That being said, Will Gorman does an excellent job in explaining Pentaho Reporting, and making it very useful for business users, report designers, report developers and, his target audience, Java developers. I heartily recommend that you buy this book. [Amazon link]

Pentaho Reporting 3.5 for Java Developers First Look

I was approached by Richard Dias of Packt Publishing to review "Pentaho Reporting 3.5 for Java Developers" written by Will Gorman. (Link is to

Richard Dias has indicated you are a Friend:

Hi Joseph,

My name is Richard Dias and I work for Packt Publishing which specializes in publishing focused IT related books.

I was wondering if you would be interesteed in reviewing the book "Pentaho Reporting for Java Developers" written by Will Gorman.

- Richard Dias

After some back and forth, I decided to accept the book in exchange for my review.

Hi Joseph,

Thanks for the reply and interest in reviewing the book. I have just placed an order for a copy of the book and it should arrive at your place within 10 days. Please do let me know when you receive it.

I have also created a unique link for you. It is Please feel free to use this link in your book review.

In the meanwhile, if you could mention about the book on your blog and tweet about the book, it would be highly appreciated. Please do let me know if it is fine with you.

I’m also sending you the link of an extracted chapter from the book (Chapter 6 Including Charts and Graphics in Reports). It would be great if you could put up the link on your blog. This would act as first hand information for your readers and they will also be able to download the file.

Any queries or suggestions are always welcome.

I look forward to your reply.

Best Regards,


Richard Dias
Marketing Research Executive | Packt Publishing |

Shortly thereafter, I received notification that the book had shipped. It arrived within two weeks.

Of course, I've been too busy to do more than skim through the book. Anyone who follows me as JAdP on Twitter knows that in the past few weeks, I've been:

  • helping customers with algorithm development and implementing Pentaho on LucidDB,
  • working with Nicholas Goodman with his planning for commercial support of LucidDB through Dynamo Business Intelligence, and roadmaps for DynamoDB packages built on LucidDB's plugin architecture, and
  • migrating our RHEL host at ServerBeach from our old machine to a new one, while dealing with issues brought about by ServerBeach migrating to Peer1's tools.

None of which has left any time for a thorough review of "Pentaho Reporting for Java Developers".

I hope to have a full review up shortly after the holidays, which for me runs from Solstice to Epiphany, and maybe into the following weekend.

First, a little background. Will Gorman, the author, works for Pentaho, in software engineering, as a team lead, and works primarily on Pentaho Reporting products, a combination of server-side (Pentaho BI-Server), Desktop (MacOSX, Linux and Windows platforms) and Web-based software (Reporting Engine, Report Designer, Report Design Wizard and Pentaho Ad Hoc Reporting), which stems from the open source JFreeReport and JFreeChart. While I don't know Will personally, I do know quite a few individuals at Pentaho, and in the Pentaho community. I very much endorse their philosophy towards open source, and the way they've treated the open source projects and communities that they've integrated into their Pentaho Business Intelligence Suite. I do follow Will on Twitter, and on the IRC Freednode Channel, ##pentaho.

I myself am not a Java Developer, so at first I was not attracted to a book with a title that seemed geared to Pentaho Developers. Having skimmed through the book, I think that the title was poorly chosen. (Sorry Richard). I find that I can read through the book without stumbling, and that there is plenty of good intelligence that will help me better server and instruct my customers through the use of Pentaho Report Designer.

My initial impressions are good. The content seems full of golden nuggets of "how-tos" and background information not commonly known among the Pentaho community. Will's knowledge of Pentaho Reporting and how it fits into the rest of the Pentaho tools, such as KETTLE (Pentaho Data Integration) and Mondrian (Pentaho Analysis), along with a clear writing style makes all aspects of Pentaho more accessible to the BI practitioner, as well as those that wish to embed Pentaho Reporting into their own application.

This book is not just for Java developers, but for anyone who wishes to extend their abilities in BI, Reporting and Analysis, with Pentaho as an excellent example.

I'll be following up with the really exciting finds as I wend my way through Will's gold mine of knowledge, and, will do my best to fulfill my promise of a full review by mid-January.

You can also click through the Chapter 6 (a PDF) as mentioned in Richard's email.

Thank you, Richard. And most especially, thank you, Will.

SQLStreamv2 Real Time BI

Today, SQLStream announced version 2.0 of their Real Time BI solution. SQLStream comes from the fertile creativity of Julian Hyde, who is also the founder of the open source Mondrian OLAP engine. While SQLStream is not open source, it does stem from the open source Eigenbase community, leveraging the user-defined transforms that were originally developed for LucidDB to operate on traditional stored relational data, with SQL:2003-compliant syntax. SQLStream extends this to handle streaming relational data.

In addition to capturing standard, structured data while "on the wire", SQLStream also includes adapters for feeds, such as Atom and RSS, and for Twitter.

Methinks Julian and I need to schedule another lunch soon, so that I can learn more about how this unstructured data, especially from Twitter, can fit into real time analytics provided by SQLStream v2.0.

BTW, you can follow me on Twitter as @JAdP.

April 2018
Mon Tue Wed Thu Fri Sat Sun
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959


  XML Feeds


Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is here.

Recent Posts

powered by b2evolution free blog software