Category: "Data Warehousing"

Information Architecture and DynamoBI

Anyone who follows either Nicholas Goodman or myself on Twitter (links are to our Twitter handles) or follow either this blog or Nick's Goodman on BI blog, know that I've been helping Nick out here and there with his new business, Dynamo Business Intelligence Corporation, offering support and commercial (and still open source) packages of the "best column-store database you never heard of", LucidDB.

One of the things that I'll be doing over the next few weeks is some website and community development. For all that I've been an executive type for decades, I love to keep hands-on with various technologies, and one of those technologies is "THE WEB". While I've never made a living as a web developer, I started with the web very early on, developing internal sites for the Lynx browser, as one of the internal web chiefs, learning from Comet, the Oracle web master. The first commercial site that I did, in 1994, for the local Eagle Express Flowers, is still up, with a few modernizations. :)

So, while waiting for the style guide from CORHOUSE, who designed the new Dynamo Business Intelligence Corporation logo [what do you think of it?]…

I've decided to go through an old friend. Information Architecture for the World Wide Web: Designing Large-Scale Web Sites

This exercise has reminded me that Information Architecture isn't just important for websites, but also for all the ways that individuals and businesses organize their data, concepts, information and knowledge. I'm happy to be helping out DynamoBI, and glad that doing so led me to this reminder of something I've been taking for granted. Time to revisit those [Ever]notes, [Zotero] researches, files and what not.

OpenSQLcamp Play with Data

On November 14 and 15th, I attended openSQLcamp 2009 in Portland, OR. It was a great event, and I was honored to be accepted for a five minute lightening talk: "I Play with Data". I would like to thank Sheeri (aka tcation) for providing Youtube videos of the lightening talks. Here's mine:

And here's a transcript, with links to things that were mentioned.

Hi mine name is Joseph, and I play with data.

It's good that I followed David [David J. Lutz, Director of Technical Sales Support, Infobright] because part of what I'm looking for, in the solution of how to do statistics with SQL, is column-store databases.

Way back in the 70's & 80's, I was doing pair programming with FORTRAN programmers [laughter in background] :D turning algorithms into software. I was, pair programming, we sat down together, I would write math, they would write software, we did things [mostly in Bayes], through the 80's [most with Wendy, who still works with me occasionally].

Then I started playing with data through other people algorithms using SQL, and relational database management systems, and then later, Business Intelligence systems, and most recently playing a lot with Pentaho, using that.

And I'm going to make a lot of statements, but I really have a question. I know of three ways that I can start doing real statistics with SQL databases. And I want to do real statistics because the most you can get just with AVERAGE, is, assuming that I have a uniform distribution or a normal distribution, and even in many cases, an average isn't necessarily the mean, and the mean is certainly not the best descriptor of the underlying distribution of the data. Right?

So, I can start doing fancier algorithms in SQL, but they're painful. And you know the big-O number, and they're nasty big-O numbers, to do, even if I have a frequency function, to try to arrive at the mean or the mode, simple things.

And if I want to do Bayesian statistics, and a Markov Chain Monte Carlo simulation to get at inferences on mathematical conjugates [snickering in the background] &#59;) … I'm not going to do this in SQL.

So, I have two other choices that I've been exploring.

Anyone here familiar with the R Project? [Several affirmative responses] Ya! Yeah! All right! I love the R Project, and I'm having a lot of fun with the R Project. The R Project is written in R and C and FORTRAN and there are thousands of packages written in FORTRAN and C and R and I'm doing a lot of nice math with it now, and that's a lot of fun. But everything in R is actually in data sets, and data sets are column-store databases, in memory. And even though you can get 8GB of memory on a lap top now, I run out of memory, frequently, with the type of stuff I do. So, what do I do? I use SQL, because relational database management systems, manage data really, really well, and R analyzes the data really, really well, and R speaks SQL through either RODBC, or DBI… Off you go.

So, I would like to use column-store databases, and one of my questions is that I'm looking for a way of speeding this up, so that I can match a column-store data set in R in memory with a column-store database such as Infobright or MonetDB or LucidDB. And do this one-to-one mapping much more efficiently than I can going through ODBC.

Does anyone have any thoughts on this?

[Discussion with someone in the audience - if you read this, please identify yourself in the comments, and thank you for talking to me] Have you heard of TL/R [my error in listening]?

I have not. I've never heard of TL/R.

It's R embedded in PostgreSQL.

OK, yes, I have. Did you say TL or PL?

PL. [PL/R by Joe Conway is back in development and becoming interesting again].

Yeah, PL/R I know. And there's a lot of things like that, but they're basically interfaces.



Yeah, which isn't all that mature. It tries to map the name of the dataframe in R, where you're doing your stuff in R, to a table in MySQL [in the weeds]. Which is really what you want, is to prodSQL, is that relationship of the sets, where basically you overloaded the dataframe… so you can access… overloaded the access operator… to go out to a SQL area, however it does it.


A third solution that I've been looking at is LucidDB, which is a column-store database with a plug-in architecture, written in Java. And there is the math commons on [oops] packages which have real statistic packages, probability distribution packages, all sorts of really neat packages, which are essentially Java libraries and I would like to see real statistics written into LucidDB as plug-ins for LucidDB [horn sounds] If anyone is interested. Thank you so much.

The notes taken during the lightening rounds were written by Ben Hengst, and can be found at openSQLcamp Lightening Talks

That last part is really the most important to me. I'm working with Nick Goodman, who recently started Dynamo Business Intelligence, and with advice from Julian Hyde and others in the Eigenbase community, to develop plugins for LucidDB which might be bundled into ADBMS versions of DynamoDB to do real statistics, making real inferences and real predictions, using the math packages from the Apache Commons, and having a transparent interface to R, so that R isn't limited by in-memory constraints.

Why not join us on ##luciddb and discuss it?

Microsoft Acquires Datallegro whither Ingres

I've been "hearing" all day on Twitter that Microsoft would be announcing something big at OSCON2008. Perhaps this is it:

Microsoft today announced that it intends to acquire DATAllegro, provider of breakthrough data warehouse appliances. The acquisition will extend the capabilities of Microsoft’s mission-critical data platform, making it easier and more cost effective for customers of all sizes to manage and glean insight from the ever expanding amount of data generated by and for businesses, employees and consumers.
-- Press Release "Microsoft to Acquire DATAllegro"

This is very interesting given the progress that Microsoft has made with its analytic services binding MS Office and SQL Server. Further quoting from the press release:

“Microsoft SQL Server 2008 delivers enterprise-class capabilities in business intelligence and data warehousing and the addition of the DATAllegro team and their technology will take our data platform to the highest scale of data warehousing.”
-- Ted Kummert, corporate vice president of the Data and Storage Platform Division at Microsoft

The direction for DATAllegro's data warehouse appliance is also made clear in the press release:

“DATAllegro's integration with SQL Server is the opti mal next generation solution and the acquisition by Microsoft is a great conclusion for the company.”
-- Lisa Lambert, Intel Capital managing director, Software and Solutions Group.

For those who don't know, DATAllegro is a data warehousing appliance company that utilizes "EMC® storage, Dell™ servers, Cisco® InfiniBand switches, Intel® multi-core CPUs and the Ingres® open source database".

So, whither Ingres in this acquisition? As we've written before here, Ingres is one of the earliest and strongest RDBMS products, which was absorbed by CA and then spun off again with an open source play in 2005. MS SQL Server, of course, started out as a rebranding of Sybase SQL*Server, until the partnership dissolved in the mid-1990's. Since then, MS SQL Server has been geared mostly as a workgroup and data mart server. It seems that a switch from Ingres to MS SQL Server could heavily undermine DATAllegro's business. In addition, the switchover in code to T-SQL will be a nightmare for developers. Add to that the challenges of moving from Linux to MS Windows, and from C/C++ to C# and it will take quite some time in production environments to iron out all the wrinkles.

In addition, while most seem to think that this puts Microsoft in a good position to challenge Oracle for the Enterprise Data Warehouse lead, it actually puts Microsoft directly into competition with other DW appliance vendors, such as Teradata. I truly doubt that this move will position Microsoft strongly into competition with either Oracle or Teradata, but merely marks another tactical error in Microsoft's increasingly desperate acquisition strategy to move deeper into the Enterprise on one hand, while striving to move further into the online space on the other.

More can be read at:

John Sichi of LucidDB

Earlier today, I met John Sichi for coffee at the Half Moon Bay Coffee Co. in the Stone Pine Center. John is also a Coastsider and very involved in open source data management & analytics. We spoke of many things: our histories, folk we know in common such as Julian Hyde and Nicholas Goodman and some Oracle alums, happenings in the open source BI world, Pentaho, JasperSoft, SpagoBI, and lots of good story telling.

Mostly though, we spoke of LucidDB, LucidEra, & metadata management. I've been asked not to blog about some of things we discussed, so I'm just going to be safe and say that I am very impressed with what LucidEra is doing in BI SaaS. I'm also looking toward including the amazing capabilities of the column-store open source LucidDB in some engagement, somewhere, as soon as I can.

Complex Data Visualization at MySQL BI DW BoaF

Just got home from the MySQL Data Warehousing and BI Birds of a Feather gathering (BoaF). I'm tired, but my mind is on overdrive. 'Tis a great feeling.

First, I want to thank Lance Walter of Pentaho for introducing Clarise and me to the group as publishers of the OSBI Lens on Squidoo, this blog and the OSS wiki.

Clarise and I had a great conversation with Dr. Jacob Nikom of the MIT Lincoln Laboratory. The conversation ranged from the great Chimay beer that Matt Casters of Pentaho and Lead Architect of KETTLE, brought with him from Belgium, to

  • Data Modeling and Relational Algebra/Theory: the integrity of the model
  • Bayesian Statistics, Weibull Analysis and Tensor Calculus for mathematical modeling of complex systems [I love it when n-dimensional eigenvalues start floating in front of my eyes]
  • Meeting the needs of different types of users: managers, scientists, business folk
  • supplementing historical data warehouses with [near] real time data using ESB and dashboards
  • data visualization of complex data sets such that the analyses and limitations can be grasped at different levels by different users
  • collaboration among distributed workgroups of disparate career backgrounds and cultural pre-dispositions
  • use of Second Life and other virtual worlds for collaboration and data visualization
  • a calculator is to a computer [think if statment] as a flat file is to a relational database [think where clause]
  • early USSR vs. British knock-offs of IBM mainframes
  • Complexity as the balance of robustness and fragility

At various times in this discussion, we were joined by Sherman Wood, Director of BI at JasperSoft, and one of the legendary Mondrian developers, Julian Hyde of Pentaho and Mondrian Lead Architect, and Nicholas Goodman Director of BI for Pentaho.

And if you put Nick and dashboards and virtual worlds in the same post, then you have to mention Discoverer meets Duke Nukem.

Jakob, et al, thank you so very much for a great conversation.

Campus Technology 2007 Schedule

The schedule for the Campus Technology 2007 conference is online; a PDF of the brochure is also available for download. In addition to our session, there are several other workshops and talks related to either BI/DW or open source solutions.

We're hoping to coordinate with other speakers, so that our sessions are complementary and to avoid duplication.

Will you be going to CT 2007? What would you like to see discussed in terms of BI/DW and open source solutions? See you there.

Campus Technology 2007

Mary Grush has invited Clarise and I me to speak at the Campus Technology 2007 conference to be held in Washington, D.C., USA. We'll be speaking on Wednesday, 2007 August 1 at 11:15am-12:15pm. In general, institutes of higher learning are only beginning to explore data warehousing and business intelligence technologies, and, in general, they don't like what they're seeing from traditional, proprietary vendors. From our initial conversations with Mary, here's our direction. We'll develop this here in the OSS Blog as much as we can. We would really appreciate any comments to help us refine our talk.

Cost Effective BI/DW Strategy


Our strategy for reporting, data management and analysis programs and projects responds to user needs quickly without blowing the budget. Using open source software, project management, and user involvement, this strategy economically and efficiently meets campus-wide and departmental data warehouse, data mart, and business intelligence needs through dashboards, reporting, OLAP, and data mining tools. Cost effective results can be in user's hands in as little as one week.

Points to be covered
  1. A framework leading to an economical strategy/BI-roadmap for data warehousing, data management or data analytics programs
  2. Program, Project & Risk Management methods
  3. Risk and advantages of using open Source Solutions for BI suites or DW/data mining components such as ETL/EAI/ESB, RDBMS & MDDB, meta data management, reporting, OLAP engines, multi-variate analysis (a.k.a. "slice & dice"), machine learning, portals, and dashboards
  4. User involvement for determining specifications and implementing quality control
  5. Costing, value and return
Take-away Points
  1. Strategy and tactics should be separated with a clear iteration plan for quick, economical response to user & organizational needs
  2. Agile development doesn't mean a lack of project management nor should it allow scope creep
  3. Open source software has matured to the point where it can certainly be used for prototyping and even production

Open Source: Closing thoughts of Vladimir Stojanovski

Over the past five years, our research into open source BI components has shown few projects supporting BI, and no BI suites, until recently. Bee is the oldest of the open source BI suites, starting in 2002. Five years ago, there was one open source project developing an Extract, Transform and Load (ETL) tool - Jetstream, one for reporting - JasperReports, one for analysis - Mondrian. Of the open source Relational Database Management Systems (RDBMS), none were optimized for very large databases, or for querying, until this year. There are now over 25 open source projects supporting every aspect of BI, from ETL to the user Portal, including reporting, on-line analytical processing (OLAP), advanced analytics and data mining, workflow, and dashboards. Six of these can be considered BI suites, with all but Bee having launched this year.

Vladimir Stojanovski has written a five-part article in his blog at ITtoolbox. Part of his conclusion is quoted below.

Call me shortsighted, but then this nomer could also apply to the CRM/BI industry indiscriminately (except for the brave souls at places like SugarCRM [see post Open Source: CRM and Business Intelligence (Part 2 - SugarCRM)] and Pentaho [see post Open Source: CRM and Business Intelligence (Part 3 - Pentaho, et al)]). The industry is finally being forced to take Open Source seriously not necessarily because we think it is a great movement, but because our clients are forcing us to do so. An increasing number of companies are adopting Open Source in fundamental areas such as operating systems (Linux), database platforms (MySQL, PostgreSQL), application servers (JBoss), and web servers (Apache). This foundational platform is then forcing itself onto enterprise-class applications, such as CRM.end quotation
-- Open Source: Closing thoughts, I think... (Part 5) by Vladimir Stojanovski

As shown in my opening paragraph, the open source movement is responding to the interest in open source solutions for enterprise applications, particularly, BI. You can check out the links in the side column of this blog for a list of open source BI suites and tools being developed. We'll be continuing with our research and use of open source BI solutions over the past year, and I think it will be some time beyond that before we, or Vladimir, or anyone else, actually writes the final Closing Thoughts on open source BI.


Recently, Navica and InterASC teamed up on project where the customer required we use PostgreSQL as a central data warehouse. Clarise pointed out that PostgreSQL lacked essential attributes to be an efficient platform for data warehousing. In investigating alternatives, we discovered Bizgres. Bizgres is a separate distrubtion based on PostgreSQL with the primary purpose of filling exactly those lacks Clarise had highlighted such as table partitioning and bit map indexing, and the secondary purpose of building a BI suite. As we develop Open Source Business Intelligence, we'll be writing posts describing the enhancements that Bizgres is making to PostgreSQL, and compare Bizgres to Oracle as a DW platform.

Technorati Tags: , , , ,

July 2020
Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959


  XML Feeds