Category: "OLAP"

OpenSQLcamp Play with Data

On November 14 and 15th, I attended openSQLcamp 2009 in Portland, OR. It was a great event, and I was honored to be accepted for a five minute lightening talk: "I Play with Data". I would like to thank Sheeri (aka tcation) for providing Youtube videos of the lightening talks. Here's mine:

And here's a transcript, with links to things that were mentioned.

Hi mine name is Joseph, and I play with data.

It's good that I followed David [David J. Lutz, Director of Technical Sales Support, Infobright] because part of what I'm looking for, in the solution of how to do statistics with SQL, is column-store databases.

Way back in the 70's & 80's, I was doing pair programming with FORTRAN programmers [laughter in background] :D turning algorithms into software. I was, pair programming, we sat down together, I would write math, they would write software, we did things [mostly in Bayes], through the 80's [most with Wendy, who still works with me occasionally].

Then I started playing with data through other people algorithms using SQL, and relational database management systems, and then later, Business Intelligence systems, and most recently playing a lot with Pentaho, using that.

And I'm going to make a lot of statements, but I really have a question. I know of three ways that I can start doing real statistics with SQL databases. And I want to do real statistics because the most you can get just with AVERAGE, is, assuming that I have a uniform distribution or a normal distribution, and even in many cases, an average isn't necessarily the mean, and the mean is certainly not the best descriptor of the underlying distribution of the data. Right?

So, I can start doing fancier algorithms in SQL, but they're painful. And you know the big-O number, and they're nasty big-O numbers, to do, even if I have a frequency function, to try to arrive at the mean or the mode, simple things.

And if I want to do Bayesian statistics, and a Markov Chain Monte Carlo simulation to get at inferences on mathematical conjugates [snickering in the background] &#59;) … I'm not going to do this in SQL.

So, I have two other choices that I've been exploring.

Anyone here familiar with the R Project? [Several affirmative responses] Ya! Yeah! All right! I love the R Project, and I'm having a lot of fun with the R Project. The R Project is written in R and C and FORTRAN and there are thousands of packages written in FORTRAN and C and R and I'm doing a lot of nice math with it now, and that's a lot of fun. But everything in R is actually in data sets, and data sets are column-store databases, in memory. And even though you can get 8GB of memory on a lap top now, I run out of memory, frequently, with the type of stuff I do. So, what do I do? I use SQL, because relational database management systems, manage data really, really well, and R analyzes the data really, really well, and R speaks SQL through either RODBC, or DBI… Off you go.

So, I would like to use column-store databases, and one of my questions is that I'm looking for a way of speeding this up, so that I can match a column-store data set in R in memory with a column-store database such as Infobright or MonetDB or LucidDB. And do this one-to-one mapping much more efficiently than I can going through ODBC.

Does anyone have any thoughts on this?

[Discussion with someone in the audience - if you read this, please identify yourself in the comments, and thank you for talking to me] Have you heard of TL/R [my error in listening]?

I have not. I've never heard of TL/R.

It's R embedded in PostgreSQL.

OK, yes, I have. Did you say TL or PL?

PL. [PL/R by Joe Conway is back in development and becoming interesting again].

Yeah, PL/R I know. And there's a lot of things like that, but they're basically interfaces.

SQLDF?

SQLDF?

Yeah, which isn't all that mature. It tries to map the name of the dataframe in R, where you're doing your stuff in R, to a table in MySQL [in the weeds]. Which is really what you want, is to prodSQL, is that relationship of the sets, where basically you overloaded the dataframe… so you can access… overloaded the access operator… to go out to a SQL area, however it does it.

OK, so SQLDF.

A third solution that I've been looking at is LucidDB, which is a column-store database with a plug-in architecture, written in Java. And there is the math commons on apache.com [oops] packages which have real statistic packages, probability distribution packages, all sorts of really neat packages, which are essentially Java libraries and I would like to see real statistics written into LucidDB as plug-ins for LucidDB [horn sounds] If anyone is interested. Thank you so much.

The notes taken during the lightening rounds were written by Ben Hengst, and can be found at openSQLcamp Lightening Talks

That last part is really the most important to me. I'm working with Nick Goodman, who recently started Dynamo Business Intelligence, and with advice from Julian Hyde and others in the Eigenbase community, to develop plugins for LucidDB which might be bundled into ADBMS versions of DynamoDB to do real statistics, making real inferences and real predictions, using the math packages from the Apache Commons, and having a transparent interface to R, so that R isn't limited by in-memory constraints.

Why not join us on irc.freenode.net ##luciddb and discuss it?

Complex Data Visualization at MySQL BI DW BoaF

Just got home from the MySQL Data Warehousing and BI Birds of a Feather gathering (BoaF). I'm tired, but my mind is on overdrive. 'Tis a great feeling.

First, I want to thank Lance Walter of Pentaho for introducing Clarise and me to the group as publishers of the OSBI Lens on Squidoo, this blog and the OSS wiki.

Clarise and I had a great conversation with Dr. Jacob Nikom of the MIT Lincoln Laboratory. The conversation ranged from the great Chimay beer that Matt Casters of Pentaho and Lead Architect of KETTLE, brought with him from Belgium, to

  • Data Modeling and Relational Algebra/Theory: the integrity of the model
  • Bayesian Statistics, Weibull Analysis and Tensor Calculus for mathematical modeling of complex systems [I love it when n-dimensional eigenvalues start floating in front of my eyes]
  • Meeting the needs of different types of users: managers, scientists, business folk
  • supplementing historical data warehouses with [near] real time data using ESB and dashboards
  • data visualization of complex data sets such that the analyses and limitations can be grasped at different levels by different users
  • collaboration among distributed workgroups of disparate career backgrounds and cultural pre-dispositions
  • use of Second Life and other virtual worlds for collaboration and data visualization
  • a calculator is to a computer [think if statment] as a flat file is to a relational database [think where clause]
  • early USSR vs. British knock-offs of IBM mainframes
  • Complexity as the balance of robustness and fragility

At various times in this discussion, we were joined by Sherman Wood, Director of BI at JasperSoft, and one of the legendary Mondrian developers, Julian Hyde of Pentaho and Mondrian Lead Architect, and Nicholas Goodman Director of BI for Pentaho.

And if you put Nick and dashboards and virtual worlds in the same post, then you have to mention Discoverer meets Duke Nukem.

Jakob, et al, thank you so very much for a great conversation.

Cubulus Presentation

We discovered a new OLAP engine, Cubulus, registered on Sourceforge on 2007-03-07. The developer has an interesting presentation on the web, as well. [Just click anywhere on the page that isn't a link to somewhere else, and you'll go to the next slide. Or was that obvious to everyone but me?] Even though the project is pre-alpha, we've added it to our linkblog [look in the side column under OLAP]. The first database supported is MySQL, but they've recently added "hacks" for PostgreSQL support as well.

Update to our Palo Description

We were contacted by Stephanie Endlich of Jedox GmbH, who suggested a better description for Palo:

"Palo is a memory based Open Source MOLAP which is able to consolidate data hierarchies in real time and which supports write-back of data. It provides a MDDB and a free Microsoft Excel add-in. For flexible integration Palo offers APIs for Java, PHP, C and .NET. From their homepage... 'Palo is an advanced data store for Microsoft Excel that allows you to handle large amounts of Excel data on a small number of worksheets. In addition, it also allows you to share Excel data real-time with your colleagues.'"end quotation

We also have the following from their website on Palo Basics.

"Palo is made for Microsoft Excel. It is a cell-based database that is multidimensional, hierarchical and memory-based. Now what does those terms mean in particular?

"Trying to do Business Intelligence, Financial Analysis, Budgeting or Planning with Microsoft Excel? Looking at Excel workbooks that are difficult to maintain because of their size? Then Palo is for you.

"Palo is an advanced data store for Microsoft Excel that allows you to handle large amounts of Excel data on a small number of worksheets. In addition, it also allows you to share Excel data real-time with your collegues. You get exciting new Excel features without loosing Excel's flexibility. Works with your existing Excel 2000/XP/2003."end quotation

Let us know what you think in comments, and we'll update our OSBI Lens' linkblog "Links to OSS OLAP Tools" description accordingly.

Commercial Spread Sheet Services

Pentaho and Simba Technologies announced today their Spreadsheet Service allowing Microsoft Office users to access Pentaho OLAP using Microsoft Excel's Pivot Tables. Under the terms of their agreement, Pentaho becomes the exclusive distributor of Simba’s Microsoft Excel connectivity technology for use with Mondrian, which will be delivered and branded as Pentaho Spreadsheet Services.

We're trying to get more information about this new service.

[Quick Update:] Lance Walter of Pentaho quickly responded to our request for more information, and provided a link from the Pentaho Analysis page to the Pentaho Spreadsheet Services FAQ.

Spreadsheet Services is basically a client-side library. It sits on the client machine, and translates ODBO calls from Excel’s PivotTable Service into XML/A calls that go to Pentaho Analysis Services / Mondrian. [Pentaho Spreadsheet Services] works with Mondrian “standalone” – meaning Mondrian deployed without any of the rest of the Pentaho platform. So if you’re an existing, happy Mondrian user and want to use Pentaho Spreadsheet Services, you don’t need to deploy or configure the Pentaho platform to use it.end quotation
-- From Lance Walter, VP Marketing, Pentaho

November 2018
Mon Tue Wed Thu Fri Sat Sun
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    
 << <   > >>

At the beginning, The Open Source Solutions Blog was a companion to the Open Source Solutions for Business Intelligence Research Project, and book. But back in 2005, we couldn't find a publisher. As Apache Hadoop and its family of open source projects proliferated, and in many ways, took over the OSS data management and analytics world, our interests became more focused on streaming data management and analytics for IoT, the architecture for people, processes and technology required to bring value from the IoT through Sensor Analytics Ecosystems, and the maturity model organizations will need to follow to achieve SAEIoT success. OSS is very important in this world too, for DMA, API and community development.

37.652951177164 -122.490877706959

Search

  XML Feeds

mindmaps

Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is here.

Recent Posts

Secure CMS